modulepackage
0.0.0-20210511120306-26d441fa0ded
Repository: https://github.com/james-bowman/nlp.git
Documentation: pkg.go.dev
# README
Natural Language Processing

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.
Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.
Check out the companion blog post or the Go documentation page for full usage and examples.
Features
- LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
- Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
- Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
- Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
- PCA (Principal Component Analysis)
- TF-IDF weighting to account for frequently occuring words
- Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
- Stop word removal to remove frequently occuring English words e.g. "the", "and"
- Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
- Similarity/distance measures to calculate the similarity/distance between feature vectors.
Planned
- Expanded persistence support
- Stemming to treat words with common root as the same e.g. "go" and "going"
- Clustering algorithms e.g. Heirachical, K-means, etc.
- Classification algorithms e.g. SVM, KNN, random forest, etc.
References
- Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
- Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
- Thomo, Alex. Latent Semantic Analysis (Tutorial).
- Latent Semantic Indexing. Standford NLP Course
- Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
- M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
- A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
- Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
- Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
- Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
- QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
- Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
# Packages
No description provided by the author
# Functions
ColDo executes fn for each column j in m.
ColNonZeroElemDo executes fn for each non-zero element in column j of matrix m.
CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims.
NewClassicLSH creates a new ClassicLSH with the configured number of hash tables and hash functions per table.
NewCountVectoriser creates a new CountVectoriser.
NewHashingVectoriser creates a new HashingVectoriser.
NewLatentDirichletAllocation returns a new LatentDirichletAllocation type initialised with default values for k topics.
NewLinearScanIndex construct a new empty LinearScanIndex which will use the specified pairwise distance metric to determine nearest neighbours based on similarity.
NewLSHForest creates a new LSHForest Locality Sensitive Hashing scheme with the specified number of hash tables and hash functions per table.
NewLSHIndex creates a new LSHIndex.
NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance.
NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers.
NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space.
NewRandomProjection creates and returns a new RandomProjection transformer.
NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing.
NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality.
NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits.
NewTfidfTransformer constructs a new TfidfTransformer.
NewTokeniser returns a new, default Tokeniser implementation.
NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k.
# Constants
DocBasedRRI represents columns (documents/contexts in a term-document matrix) forming the initial basis for index/elemental vectors in Random Indexing.
TermBasedRRI indicates rows (terms in a term-document matrix) form the initial basis for index/elemental vectors in Reflective Random Indexing.
# Structs
ClassicLSH supports finding top-k Approximate Nearest Neighbours (ANN) using Locality Sensitive Hashing (LSH).
CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set.
HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term.
LatentDirichletAllocation (LDA) for fast unsupervised topic extraction.
LearningSchedule is used to calculate the learning rate for each iteration using a natural gradient descent algorithm.
LinearScanIndex supports Nearest Neighbour (NN) similarity searches across indexed vectors performing queries in O(n) and requiring O(n) storage.
LSHForest is an implementation of the LSH Forest Locality Sensitive Hashing scheme based on the work of M.
LSHIndex is an LSH (Locality Sensitive Hashing) based index supporting Approximate Nearest Neighbour (ANN) search in O(log n).
Match represents a matching item for nearest neighbour similarity searches.
PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis.
Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps.
RandomIndexing is a method of dimensionality reduction used for Latent Semantic Analysis in a similar way to TruncatedSVD and PCA.
RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.
RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal.
SignRandomProjection represents a transform of a matrix into a lower dimensional space.
SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm for angular distance using sign random projections based on the work of Moses S.
TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus.
TruncatedSVD implements the Singular Value Decomposition factorisation of matrices.
# Interfaces
Hasher interface represents a Locality Sensitive Hashing algorithm whereby the proximity of data points is preserved in the hash space i.e.
Indexer indexes vectors to support Nearest Neighbour (NN) similarity searches across the indexed vectors.
LSHScheme interface represents LSH indexing schemes to support Approximate Nearest Neighbour (ANN) search.
OnlineTransformer is an extension to the Transformer interface that supports online (streaming/mini-batch) training as opposed to just batch.
OnlineVectoriser is an extension to the Vectoriser interface that supports online (streaming/mini-batch) training as opposed to just batch.
Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g.
Transformer provides a common interface for transformer steps.
Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.
# Type aliases
RRIBasis represents the initial basis for the index/elemental vectors used for Random Reflective Indexing.