Categorygithub.com/james-bowman/nlp
modulepackage
0.0.0-20210511120306-26d441fa0ded
Repository: https://github.com/james-bowman/nlp.git
Documentation: pkg.go.dev

# README

Natural Language Processing

License: MIT GoDoc Build Status Go Report Card codecov Mentioned in Awesome Go Sourcegraph

nlp

Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.

Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.

Check out the companion blog post or the Go documentation page for full usage and examples.


Features

Planned

  • Expanded persistence support
  • Stemming to treat words with common root as the same e.g. "go" and "going"
  • Clustering algorithms e.g. Heirachical, K-means, etc.
  • Classification algorithms e.g. SVM, KNN, random forest, etc.

References

  1. Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
  2. Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
  3. Thomo, Alex. Latent Semantic Analysis (Tutorial).
  4. Latent Semantic Indexing. Standford NLP Course
  5. Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
  6. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
  7. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
  8. Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
  9. Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
  10. Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
  11. QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
  12. Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation

# Packages

No description provided by the author

# Functions

ColDo executes fn for each column j in m.
ColNonZeroElemDo executes fn for each non-zero element in column j of matrix m.
CreateRandomProjectionTransform returns a new random matrix for Random Projections of shape newDims x origDims.
NewClassicLSH creates a new ClassicLSH with the configured number of hash tables and hash functions per table.
NewCountVectoriser creates a new CountVectoriser.
NewHashingVectoriser creates a new HashingVectoriser.
NewLatentDirichletAllocation returns a new LatentDirichletAllocation type initialised with default values for k topics.
NewLinearScanIndex construct a new empty LinearScanIndex which will use the specified pairwise distance metric to determine nearest neighbours based on similarity.
NewLSHForest creates a new LSHForest Locality Sensitive Hashing scheme with the specified number of hash tables and hash functions per table.
NewLSHIndex creates a new LSHIndex.
NewPCA constructs a new Principal Component Analysis transformer to reduce the dimensionality, projecting matrices onto the axis of greatest variance.
NewPipeline constructs a new processing pipline with the supplied Vectoriser and one or more transformers.
NewRandomIndexing returns a new RandomIndexing transformer configured to transform term document matrices into k dimensional space.
NewRandomProjection creates and returns a new RandomProjection transformer.
NewReflectiveRandomIndexing returns a new RandomIndexing type configured for Reflective Random Indexing.
NewSignRandomProjection constructs a new SignRandomProjection transformer to reduce the dimensionality.
NewSimHash constructs a new SimHash creating a set of locality sensitive hash functions which are combined to accept input vectors of length dim and produce hashed binary vector fingerprints of length bits.
NewTfidfTransformer constructs a new TfidfTransformer.
NewTokeniser returns a new, default Tokeniser implementation.
NewTruncatedSVD creates a new TruncatedSVD transformer with K (the truncated dimensionality) being set to the specified value k.

# Constants

DocBasedRRI represents columns (documents/contexts in a term-document matrix) forming the initial basis for index/elemental vectors in Random Indexing.
TermBasedRRI indicates rows (terms in a term-document matrix) form the initial basis for index/elemental vectors in Reflective Random Indexing.

# Structs

ClassicLSH supports finding top-k Approximate Nearest Neighbours (ANN) using Locality Sensitive Hashing (LSH).
CountVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term present in the training data set.
HashingVectoriser can be used to encode one or more text documents into a term document matrix where each column represents a document within the corpus and each row represents a term.
LatentDirichletAllocation (LDA) for fast unsupervised topic extraction.
LearningSchedule is used to calculate the learning rate for each iteration using a natural gradient descent algorithm.
LinearScanIndex supports Nearest Neighbour (NN) similarity searches across indexed vectors performing queries in O(n) and requiring O(n) storage.
LSHForest is an implementation of the LSH Forest Locality Sensitive Hashing scheme based on the work of M.
LSHIndex is an LSH (Locality Sensitive Hashing) based index supporting Approximate Nearest Neighbour (ANN) search in O(log n).
Match represents a matching item for nearest neighbour similarity searches.
PCA calculates the principal components of a matrix, or the axis of greatest variance and then projects matrices onto those axis.
Pipeline is a mechanism for composing processing pipelines out of vectorisers transformation steps.
RandomIndexing is a method of dimensionality reduction used for Latent Semantic Analysis in a similar way to TruncatedSVD and PCA.
RandomProjection is a method of dimensionality reduction based upon the Johnson–Lindenstrauss lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.
RegExpTokeniser implements Tokeniser interface using a basic RegExp pattern for unary-gram word tokeniser supporting optional stop word removal.
SignRandomProjection represents a transform of a matrix into a lower dimensional space.
SimHash implements the SimHash Locality Sensitive Hashing (LSH) algorithm for angular distance using sign random projections based on the work of Moses S.
TfidfTransformer takes a raw term document matrix and weights each raw term frequency value depending upon how commonly it occurs across all documents within the corpus.
TruncatedSVD implements the Singular Value Decomposition factorisation of matrices.

# Interfaces

Hasher interface represents a Locality Sensitive Hashing algorithm whereby the proximity of data points is preserved in the hash space i.e.
Indexer indexes vectors to support Nearest Neighbour (NN) similarity searches across the indexed vectors.
LSHScheme interface represents LSH indexing schemes to support Approximate Nearest Neighbour (ANN) search.
OnlineTransformer is an extension to the Transformer interface that supports online (streaming/mini-batch) training as opposed to just batch.
OnlineVectoriser is an extension to the Vectoriser interface that supports online (streaming/mini-batch) training as opposed to just batch.
Tokeniser interface for tokenisers allowing substitution of different tokenisation strategies e.g.
Transformer provides a common interface for transformer steps.
Vectoriser provides a common interface for vectorisers that take a variable set of string arguments and produce a numerical matrix of features.

# Type aliases

RRIBasis represents the initial basis for the index/elemental vectors used for Random Reflective Indexing.