Categorygithub.com/james-bowman/nlp

modulepackage

0.0.0-20210511120306-26d441fa0ded

Repository: https://github.com/james-bowman/nlp.git

Documentation: pkg.go.dev

# README

Natural Language Processing

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.

Features

LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
PCA (Principal Component Analysis)
TF-IDF weighting to account for frequently occuring words
Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
Stop word removal to remove frequently occuring English words e.g. "the", "and"
Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
Similarity/distance measures to calculate the similarity/distance between feature vectors.

Planned

Expanded persistence support
Stemming to treat words with common root as the same e.g. "go" and "going"
Clustering algorithms e.g. Heirachical, K-means, etc.
Classification algorithms e.g. SVM, KNN, random forest, etc.

References

# Packages

measures

No description provided by the author

# Functions

ColDo

ColDo executes fn for each column j in m.

ColNonZeroElemDo

ColNonZeroElemDo executes fn for each non-zero element in column j of matrix m.

CreateRandomProjectionTransform

CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims.

NewClassicLSH

NewClassicLSH creates a new ClassicLSH with the configured number of hash tables and hash functions per table.

NewCountVectoriser

NewCountVectoriser creates a new CountVectoriser.

NewHashingVectoriser

NewHashingVectoriser creates a new HashingVectoriser.

NewLatentDirichletAllocation

NewLatentDirichletAllocation returns a new LatentDirichletAllocation type initialised with default values for k topics.

NewLinearScanIndex

NewLinearScanIndex construct a new empty LinearScanIndex which will use the specified pairwise distance metric to determine nearest neighbours based on similarity.

NewLSHForest

NewLSHForest creates a new LSHForest Locality Sensitive Hashing scheme with the specified number of hash tables and hash functions per table.

NewLSHIndex

NewLSHIndex creates a new LSHIndex.

NewPCA

NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance.

NewPipeline

NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers.

NewRandomIndexing

NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space.

NewRandomProjection

NewRandomProjection creates and returns a new RandomProjection transformer.

NewReflectiveRandomIndexing

NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing.

NewSignRandomProjection

NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality.

NewSimHash

NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits.

NewTfidfTransformer

NewTfidfTransformer constructs a new TfidfTransformer.

NewTokeniser

NewTokeniser returns a new, default Tokeniser implementation.

NewTruncatedSVD

NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k.

# Constants

DocBasedRRI

DocBasedRRI represents columns (documents/contexts in a term-document matrix) forming the initial basis for index/elemental vectors in Random Indexing.

TermBasedRRI

TermBasedRRI indicates rows (terms in a term-document matrix) form the initial basis for index/elemental vectors in Reflective Random Indexing.

# Structs

ClassicLSH

ClassicLSH supports finding top-k Approximate Nearest Neighbours (ANN) using Locality Sensitive Hashing (LSH).

CountVectoriser

CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set.

HashingVectoriser

HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term.

LatentDirichletAllocation

LatentDirichletAllocation (LDA) for fast unsupervised topic extraction.

LearningSchedule

LearningSchedule is used to calculate the learning rate for each iteration using a natural gradient descent algorithm.

LinearScanIndex

LinearScanIndex supports Nearest Neighbour (NN) similarity searches across indexed vectors performing queries in O(n) and requiring O(n) storage.

LSHForest

LSHForest is an implementation of the LSH Forest Locality Sensitive Hashing scheme based on the work of M.

LSHIndex

LSHIndex is an LSH (Locality Sensitive Hashing) based index supporting Approximate Nearest Neighbour (ANN) search in O(log n).

Match

Match represents a matching item for nearest neighbour similarity searches.

PCA

PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis.

Pipeline

Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps.

RandomIndexing

RandomIndexing is a method of dimensionality reduction used for Latent Semantic Analysis in a similar way to TruncatedSVD and PCA.

RandomProjection

RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

RegExpTokeniser

RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal.

SignRandomProjection

SignRandomProjection represents a transform of a matrix into a lower dimensional space.

SimHash

SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm for angular distance using sign random projections based on the work of Moses S.

TfidfTransformer

TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus.

TruncatedSVD

TruncatedSVD implements the Singular Value Decomposition factorisation of matrices.

# Interfaces

Hasher

Hasher interface represents a Locality Sensitive Hashing algorithm whereby the proximity of data points is preserved in the hash space i.e.

Indexer

Indexer indexes vectors to support Nearest Neighbour (NN) similarity searches across the indexed vectors.

LSHScheme

LSHScheme interface represents LSH indexing schemes to support Approximate Nearest Neighbour (ANN) search.

OnlineTransformer

OnlineTransformer is an extension to the Transformer interface that supports online (streaming/mini-batch) training as opposed to just batch.

OnlineVectoriser

OnlineVectoriser is an extension to the Vectoriser interface that supports online (streaming/mini-batch) training as opposed to just batch.

Tokeniser

Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g.

Transformer

Transformer provides a common interface for transformer steps.

Vectoriser

Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.

# Type aliases

RRIBasis

RRIBasis represents the initial basis for the index/elemental vectors used for Random Reflective Indexing.