Categorygithub.com/chewxy/lingo

modulepackage

0.0.0-20200918122423-491e816b48d4

Repository: https://github.com/chewxy/lingo.git

Documentation: pkg.go.dev

# README

lingo

package lingo provides the data structures and algorithms required for natural language processing.

Specifically, it provides a POS Tagger (lingo/pos), a Dependency Parser (lingo/dep), and a basic tokenizer (lingo/lexer) for English. It also provides data structures for holding corpuses (lingo/corpus), and treebanks (lingo/treebank).

The aim of this package is to provide a production quality pipeline for natural language processing.

Install

The package is go-gettable: go get -u github.com/chewxy/lingo

This package and its subpackages depend on very few external packages. Here they are:

Package	Used For	Vitality	Notes	Licence
gorgonia	Machine learning	Vital. It won't be hard to rewrite them, but why?	Same author	Gorgonia Licence (Apache 2.0-like)
gographviz	Visualization of annotations, and other graph-related visualizations	Vital for visualizations, which are a nice-to-have feature	API last changed 12th April 2017	gographviz licence (Apache 2.0)
errors	Errors	The package won't die without it, but it's a very nice to have	Stable API for the past year	errors licence (MIT/BSD like)
set	Set operations	Can be easily replaced	Stable API for the past year	set licence (MIT/BSD-like)

Usage

See the individual packages for usage. There is also a bunch of executables in the cmd directory. They're meant to be examples as to how a natural language processing pipeline can be set up.

A natural language pipeline with this package is heavily channels driven. Here's is an example for dependency parsing:

func main() {
	inputString: `The cat sat on the mat`
	lx := lexer.New("dummy", strings.NewReader(inputString)) // lexer - required to break a sentence up into words.
	pt := pos.New(pos.WithModel(posModel))                   // POS Tagger - required to tag the words with a part of speech tag.
	dp := dep.New(depModel)                                  // Creates a new parser

	// set up a pipeline
	pt.Input = lx.Output
	dp.Input = pt.Output

	// run all
	go lx.Run()
	go pt.Run()
	go dp.Run()

	// wait to receive:
	for {
		select {
		case d := <- dp.Output:
			// do something
		case err:= <-dp.Error:
			// handle error
		}
	}

}

How It Works

For specific tasks (POS tagging, parsing, named entity recognition etc), refer to the README of each subpackage. This package on its own mainly provides the data structures that the subpackages will use.

Perhaps the most important data structure is the *Annotation structure. It basically holds a word and the associated metadata for the word.

For dependency parses, the graph takes three forms: *Dependency, *DependencyTree and *Annotation. All three forms are convertable from one to another. TODO: explain rationale behind each data type.

Quirks

Very Oddly Specific POS Tags and Dependency Rel Types

A particular quirk you may have noticed is that the POSTag and DependencyType are hard coded in as constants. This package does in fact provide two variations of each: one from Stanford/Penn Treebank and one from UniversalDependencies.

The main reason for hardcoding these are mainly for performance reasons - knowing ahead how much to allocate reduces a lot of additional work the program has to do. It also reduces the chances of mutating a global variable.

Of course this comes as a tradeoff - programs are limited to these two options. Thankfully there are only a limited number of POS Tag and Dependency Relation types. Two of the most popular ones (Stanford/PTB and Universal Dependencies) have been implemented.

The following build tags are supported:

stanfordtags
universaltags
stanfordrel
universalrel

To use a specific tagset or relset, build your program thusly: go build -tags='stanfordtags'.

The default tag and dependency rel types are the universal dependencies version.

Lexer

You should also note that the tokenizer, lingo/lexer is not your usual run-of-the-mill NLP tokenizer. It's a tokenizer that tokenizes by space, with some specific rules for English. It was inspired by Rob Pike's talk on lexers. I thought it'd be cool to write something like that for NLP.

The test cases in package lingo/lexer showcases how it handles unicode, and other pathalogical english.

Contributing

see CONTRIBUTING.md for more info

Licence

This package is licenced under the MIT licence.

# Packages

cmd

No description provided by the author

corpus

No description provided by the author

dep

No description provided by the author

lexer

No description provided by the author

pos

No description provided by the author

treebank

No description provided by the author

# Functions

AllocTree

AllocTree allocates the lefts and rights.

AnnotationFromLexTag

AnnotationFromLexTag is only ever used in tests.

EqStringSlice

No description provided by the author

FromAnnotatedSentence

FromAnnotatedSentence creates a dependency from an AnnotatedSentence.

InDepTypes

No description provided by the author

InPOSTags

POSTag related functions.

InStringSlice

No description provided by the author

IsAdjective

No description provided by the author

IsAdverb

No description provided by the author

IsCompound

No description provided by the author

IsDeterminer

No description provided by the author

IsDeterminerRel

No description provided by the author

IsIN

IsIN returns true if the POSTag is a subordinating conjunction.

IsInterrogative

No description provided by the author

IsModifier

No description provided by the author

IsMultiword

No description provided by the author

IsNoun

No description provided by the author

IsNumber

No description provided by the author

IsProperNoun

No description provided by the author

IsQuantifier

No description provided by the author

IsSymbol

No description provided by the author

IsVerb

No description provided by the author

MakeLexeme

No description provided by the author

NewAnnotatedSentence

No description provided by the author

NewAnnotation

No description provided by the author

NewDependency

NewDependency creates a new *Dependency.

NewDependencyTree

No description provided by the author

NewLexemeSentence

No description provided by the author

NullAnnotation

No description provided by the author

NullLexeme

No description provided by the author

POSTagShortcut

POSTagShortcut is a shortcut function to help the POSTagger shortcircuit some decisions about what the tag is.

ReadCluster

ReadCluster reads PercyLiang's cluster file format and returns a map of strings to Cluster.

RootAnnotation

No description provided by the author

RootLexeme

No description provided by the author

StartAnnotation

No description provided by the author

StartLexeme

No description provided by the author

StringIs

No description provided by the author

StringToAnnotation

No description provided by the author

UnescapeSpecials

No description provided by the author

# Constants

ACl

predicate dependencies.

ACl_RelCl

RCMod in stanford deps.

ADJ

No description provided by the author

ADP

No description provided by the author

ADV

No description provided by the author

AdvCl

Predicate Dependencies.

AdvMod

Modifier Word.

AMod

modifier word.

Appos

http://universaldependencies.github.io/docs/en/dep/all.html.

Aux

Auxilliary.

AUX

No description provided by the author

AuxPass

http://universaldependencies.github.io/docs/en/dep/all.html.

BUILD_RELSET

No description provided by the author

BUILD_TAGSET

No description provided by the author

Case

Case Marking, preposition, possessive.

CC_PreConj

http://universaldependencies.github.io/docs/en/dep/all.html.

CComp

http://universaldependencies.github.io/docs/en/dep/all.html.

Compound

Compounding and Unanalyzed.

Compound_Part

http://universaldependencies.github.io/docs/en/dep/all.html.

Conj

http://universaldependencies.github.io/docs/en/dep/all.html.

CONJ

No description provided by the author

Coordination

CC.

Cop

http://universaldependencies.github.io/docs/en/dep/all.html.

CSubj

predicate dependencies.

CSubjPass

http://universaldependencies.github.io/docs/en/dep/all.html.

Date

No description provided by the author

Dep

http://universaldependencies.github.io/docs/en/dep/all.html.

Det

http://universaldependencies.github.io/docs/en/dep/all.html.

DET

No description provided by the author

Det_PreDet

http://universaldependencies.github.io/docs/en/dep/all.html.

Disambig

No description provided by the author

Discourse

http://universaldependencies.github.io/docs/en/dep/all.html.

Dislocated

Unused in English.

DObj

http://universaldependencies.github.io/docs/en/dep/all.html.

EOF

No description provided by the author

Expl

http://universaldependencies.github.io/docs/en/dep/all.html.

Foreign

Unused in English.

GoesWith

Unused in English.

INTJ

No description provided by the author

IObj

http://universaldependencies.github.io/docs/en/dep/all.html.

IsAscii

No description provided by the author

IsDigit

No description provided by the author

IsLetter

No description provided by the author

IsLower

No description provided by the author

IsOOV

for ner.

IsPunct

No description provided by the author

IsSpace

No description provided by the author

IsStopWord

No description provided by the author

IsTitle

No description provided by the author

IsUpper

No description provided by the author

LikeEmail

No description provided by the author

LikeNum

No description provided by the author

LikeURL

No description provided by the author

List

Loose Joining Relations.

Mark

Other.

MAXDEPTYPE

http://universaldependencies.github.io/docs/en/dep/all.html.

MAXFLAG

No description provided by the author

MAXTAG

MAXTAG is provided here as index support.

MWE

http://universaldependencies.github.io/docs/en/dep/all.html.

Name

Unused in English.

Neg

http://universaldependencies.github.io/docs/en/dep/all.html.

NMod

http://universaldependencies.github.io/docs/en/dep/all.html.

NMod_NPMod

Nominal dependencies.

NMod_Poss

http://universaldependencies.github.io/docs/en/dep/all.html.

NMod_TMod

http://universaldependencies.github.io/docs/en/dep/all.html.

NoDepType

http://universaldependencies.github.io/docs/en/dep/all.html.

NoFlag

No description provided by the author

NOUN

No description provided by the author

NSubj

nominal dependencies.

NSubjPass

http://universaldependencies.github.io/docs/en/dep/all.html.

NUM

No description provided by the author

Number

No description provided by the author

NumMod

nominal dependencies.

Parataxis

http://universaldependencies.github.io/docs/en/dep/all.html.

PART

No description provided by the author

PRON

No description provided by the author

PROPN

No description provided by the author

Punct

http://universaldependencies.github.io/docs/en/dep/all.html.

PUNCT

No description provided by the author

Punctuation

No description provided by the author

Remnant

Unused in English.

Reparandum

Unused in English.

Root

http://universaldependencies.github.io/docs/en/dep/all.html.

ROOT_TAG

No description provided by the author

SCONJ

No description provided by the author

Space

No description provided by the author

SYM

No description provided by the author

Symbol

No description provided by the author

SystemUse

No description provided by the author

Time

No description provided by the author

UNKNOWN_TAG

No description provided by the author

URI

No description provided by the author

VERB

No description provided by the author

Vocative

Unused in English.

Word

No description provided by the author

aka NULLTAG.

XComp

http://universaldependencies.github.io/docs/en/dep/all.html.

# Variables

Adjectives

No description provided by the author

Adverbs

No description provided by the author

Compounds

No description provided by the author

DeterminerRels

No description provided by the author

Determiners

No description provided by the author

Interrogatives

No description provided by the author

Modifiers

No description provided by the author

MultiWord

No description provided by the author

Nouns

No description provided by the author

Numbers

No description provided by the author

NumberWords

NumberWords was generated with this python code numberWords = {} simple = '''zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty'''.split() for i, word in zip(xrange(0, 20+1), simple): numberWords[word] = i tense = '''thirty forty fifty sixty seventy eighty ninety hundred'''.split() for i, word in zip(xrange(30, 100+1, 10), tense): numberWords[word] = i larges = '''thousand million billion trillion quadrillion quintillion sextillion septillion'''.split() for i, word in zip(xrange(3, 24+1, 3), larges): numberWords[word] = 10**i */.

ProperNouns

No description provided by the author

QuantifingMods

No description provided by the author

Symbols

No description provided by the author

Verbs

No description provided by the author