Categorygithub.com/chewxy/lingo
modulepackage
0.0.0-20200918122423-491e816b48d4
Repository: https://github.com/chewxy/lingo.git
Documentation: pkg.go.dev

# README

lingo

Build Status

package lingo provides the data structures and algorithms required for natural language processing.

Specifically, it provides a POS Tagger (lingo/pos), a Dependency Parser (lingo/dep), and a basic tokenizer (lingo/lexer) for English. It also provides data structures for holding corpuses (lingo/corpus), and treebanks (lingo/treebank).

The aim of this package is to provide a production quality pipeline for natural language processing.

Install

The package is go-gettable: go get -u github.com/chewxy/lingo

This package and its subpackages depend on very few external packages. Here they are:

PackageUsed ForVitalityNotesLicence
gorgoniaMachine learningVital. It won't be hard to rewrite them, but why?Same authorGorgonia Licence (Apache 2.0-like)
gographvizVisualization of annotations, and other graph-related visualizationsVital for visualizations, which are a nice-to-have featureAPI last changed 12th April 2017gographviz licence (Apache 2.0)
errorsErrorsThe package won't die without it, but it's a very nice to haveStable API for the past yearerrors licence (MIT/BSD like)
setSet operationsCan be easily replacedStable API for the past yearset licence (MIT/BSD-like)

Usage

See the individual packages for usage. There is also a bunch of executables in the cmd directory. They're meant to be examples as to how a natural language processing pipeline can be set up.

A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

How It Works

For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.

Perhaps the most important data structure is the *Annotation structure. It basically holds a word and the associated metadata for the word.

For dependency parses, the graph takes three forms: *Dependency, *DependencyTree and *Annotation. All three forms are convertable from one to another. TODO: explain rationale behind each data type.

Quirks

Very Oddly Specific POS Tags and Dependency Rel Types

A particular quirk you may have noticed is that the POSTag and DependencyType are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from UniversalDependencies.

The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.

Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.

The following build tags are supported:

  • stanfordtags
  • universaltags
  • stanfordrel
  • universalrel

To use a specific tagset or relset, build your program thusly: go build -tags='stanfordtags'.

The default tag and dependency rel types are the universal dependencies version.

Lexer

You should also note that the tokenizer, lingo/lexer is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.

The test cases in package lingo/lexer showcases how it handles unicode, and other pathalogical english.

Contributing

see CONTRIBUTING.md for more info

Licence

This package is licenced under the MIT licence.

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Functions

AllocTree allocates the lefts and rights.
AnnotationFromLexTag is only ever used in tests.
No description provided by the author
FromAnnotatedSentence creates a dependency from an AnnotatedSentence.
No description provided by the author
POSTag related functions.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
IsIN returns true if the POSTag is a subordinating conjunction.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
NewDependency creates a new *Dependency.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
POSTagShortcut is a shortcut function to help the POSTagger shortcircuit some decisions about what the tag is.
ReadCluster reads PercyLiang's cluster file format and returns a map of strings to Cluster.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Constants

predicate dependencies.
RCMod in stanford deps.
No description provided by the author
No description provided by the author
No description provided by the author
Predicate Dependencies.
Modifier Word.
modifier word.
http://universaldependencies.github.io/docs/en/dep/all.html.
Auxilliary.
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
Case Marking, preposition, possessive.
http://universaldependencies.github.io/docs/en/dep/all.html.
http://universaldependencies.github.io/docs/en/dep/all.html.
Compounding and Unanalyzed.
http://universaldependencies.github.io/docs/en/dep/all.html.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
CC.
http://universaldependencies.github.io/docs/en/dep/all.html.
predicate dependencies.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
Unused in English.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
Unused in English.
Unused in English.
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
for ner.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Loose Joining Relations.
Other.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
MAXTAG is provided here as index support.
http://universaldependencies.github.io/docs/en/dep/all.html.
Unused in English.
http://universaldependencies.github.io/docs/en/dep/all.html.
http://universaldependencies.github.io/docs/en/dep/all.html.
Nominal dependencies.
http://universaldependencies.github.io/docs/en/dep/all.html.
http://universaldependencies.github.io/docs/en/dep/all.html.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
nominal dependencies.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
nominal dependencies.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
No description provided by the author
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
Unused in English.
Unused in English.
http://universaldependencies.github.io/docs/en/dep/all.html.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Unused in English.
No description provided by the author
aka NULLTAG.
http://universaldependencies.github.io/docs/en/dep/all.html.

# Variables

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
NumberWords was generated with this python code numberWords = {} simple = '''zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty'''.split() for i, word in zip(xrange(0, 20+1), simple): numberWords[word] = i tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split() for i, word in zip(xrange(30, 100+1, 10), tense): numberWords[word] = i larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split() for i, word in zip(xrange(3, 24+1, 3), larges): numberWords[word] = 10**i */.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Structs

Annotation is the word and it's metadata.
Dependency represents the dependency parse of a sentence.
No description provided by the author
A DependencyTree is an alternate form of representing a dependency parse.
No description provided by the author

# Interfaces

No description provided by the author
Corpus is the interface for the corpus.
Lemmatizer is anything that can lemmatize.
Sentencer is anything that returns an AnnotatedSentence.
Stemmer is anything that can stem.
WordEmbeddings is any type that is both a corpus and can return word vectors.

# Type aliases

AnnotatedSentence is a sentence, but each word has been annotated.
No description provided by the author
Cluster represents a brown cluster.
DependencyType represents the relation between two words.
DependencyTypeSet is a set of all the DependencyTypes.
Lexeme Sentence */.
No description provided by the author
POSTag represents a Part of Speech Tag.
Shape represents the shape of a word.
TagSet is a set of all the POSTags.
WordFlags represent the types a word may be.