Categorygithub.com/jdkato/prose/v2
modulepackage
2.0.0
Repository: https://github.com/jdkato/prose.git
Documentation: pkg.go.dev

# README

prose Build Status Build status GoDoc Coverage Status Go Report Card Awesome

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get gopkg.in/jdkato/prose.v2

Usage

Contents

Overview

package main

import (
    "fmt"
    "log"

    "gopkg.in/jdkato/prose.v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmenatation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of hanlding modern text, including the non-word character spans shown below.

TypeExample
Email addresses[email protected]
Hashtags#trending
Mentions@jdkato
URLshttps://github.com/jdkato/prose
Emoticons:-), >:(, o_0, etc.
package main

import (
    "fmt"
    "log"

    "gopkg.in/jdkato/prose.v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

NameLanguageLicenseGRS (English)GRS (Other)Speed†
Pragmatic SegmenterRubyMIT98.08% (51/52)100.00%3.84 s
proseGoMIT75.00% (39/52)N/A0.96 s
TactfulTokenizerRubyGNU GPLv365.38% (34/52)48.57%46.32 s
OpenNLPJavaAPLv259.62% (31/52)45.71%1.27 s
Standford CoreNLPJavaGNU GPLv359.62% (31/52)31.43%0.92 s
SplittaPythonAPLv255.77% (29/52)37.14%N/A
PunktPythonAPLv246.15% (24/52)48.57%1.79 s
SRX EnglishRubyGNU GPLv330.77% (16/52)28.57%6.19 s
ScapelRubyGNU GPLv328.85% (15/52)20.00%0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

LibraryAccuracy5-Run Average (sec)
NLTK0.8937.224
prose0.9612.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAGDESCRIPTION
(left round bracket
)right round bracket
,comma
:colon
.period
''closing quotation mark
``opening quotation mark
#number sign
$currency
CCconjunction, coordinating
CDcardinal number
DTdeterminer
EXexistential there
FWforeign word
INconjunction, subordinating or preposition
JJadjective
JJRadjective, comparative
JJSadjective, superlative
LSlist item marker
MDverb, modal auxiliary
NNnoun, singular or mass
NNPnoun, proper singular
NNPSnoun, proper plural
NNSnoun, plural
PDTpredeterminer
POSpossessive ending
PRPpronoun, personal
PRP$pronoun, possessive
RBadverb
RBRadverb, comparative
RBSadverb, superlative
RPadverb, particle
SYMsymbol
TOinfinitival to
UHinterjection
VBverb, base form
VBDverb, past tense
VBGverb, gerund or present participle
VBNverb, past participle
VBPverb, non-3rd person singular present
VBZverb, 3rd person singular present
WDTwh-determiner
WPwh-pronoun, personal
WP$wh-pronoun, possessive
WRBwh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "gopkg.in/jdkato/prose.v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

# Functions

Asset loads and returns the asset for the given name.
AssetDir returns the file names below a certain directory embedded in the file by go-bindata.
AssetInfo loads and returns the asset info for the given name.
AssetNames returns the names of the assets.
ModelFromData creates a new Model from user-provided training data.
ModelFromDisk loads a Model from the user-provided location.
MustAsset is like Asset but panics when Asset would return an error.
NewDocument creates a Document according to the user-specified options.
ReadTagged converts pre-tagged input into a TupleSlice suitable for training.
RestoreAsset restores an asset under the given directory.
RestoreAssets restores an asset under the given directory recursively.
UsingEntities creates a NER from labeled data.
UsingModel can enable (the default) or disable named-entity extraction.
WithExtraction can enable (the default) or disable named-entity extraction.
WithSegmentation can enable (the default) or disable sentence segmentation.
WithTagging can enable (the default) or disable POS tagging.
WithTokenization can enable (the default) or disable tokenization.

# Structs

DocOpts controls the Document creation process:.
A Document represents a parsed body of text.
An Entity represents an individual named-entity.
EntityContext represents text containing named-entities.
LabeledEntity represents an externally-labeled named-entity.
A Model holds the structures and data used internally by prose.
A Sentence represents a segmented portion of text.
A Token represents an individual token of text such as a word or punctuation symbol.

# Type aliases

DataSource provides training data to a Model.
A DocOpt represents a setting that changes the document creation process.
TupleSlice is a slice of tuples in the form (words, tags).