# README

prose

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get gopkg.in/jdkato/prose.v2

Overview

package main

import (
    "fmt"
    "log"

    "gopkg.in/jdkato/prose.v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmenatation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of hanlding modern text, including the non-word character spans shown below.

Type	Example
Email addresses	`[email protected]`
Hashtags	`#trending`
Mentions	`@jdkato`
URLs	`https://github.com/jdkato/prose`
Emoticons	`:-)`, `>:(`, `o_0`, etc.

package main

import (
    "fmt"
    "log"

    "gopkg.in/jdkato/prose.v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name	Language	License	GRS (English)	GRS (Other)	Speed†
Pragmatic Segmenter	Ruby	MIT	98.08% (51/52)	100.00%	3.84 s
prose	Go	MIT	75.00% (39/52)	N/A	0.96 s
TactfulTokenizer	Ruby	GNU GPLv3	65.38% (34/52)	48.57%	46.32 s
OpenNLP	Java	APLv2	59.62% (31/52)	45.71%	1.27 s
Standford CoreNLP	Java	GNU GPLv3	59.62% (31/52)	31.43%	0.92 s
Splitta	Python	APLv2	55.77% (29/52)	37.14%	N/A
Punkt	Python	APLv2	46.15% (24/52)	48.57%	1.79 s
SRX English	Ruby	GNU GPLv3	30.77% (16/52)	28.57%	6.19 s
Scapel	Ruby	GNU GPLv3	28.85% (15/52)	20.00%	0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library	Accuracy	5-Run Average (sec)
NLTK	0.893	7.224
`prose`	0.961	2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG	DESCRIPTION
`(`	left round bracket
`)`	right round bracket
`,`	comma
`:`	colon
`.`	period
`''`	closing quotation mark
``	opening quotation mark
`#`	number sign
`$`	currency
`CC`	conjunction, coordinating
`CD`	cardinal number
`DT`	determiner
`EX`	existential there
`FW`	foreign word
`IN`	conjunction, subordinating or preposition
`JJ`	adjective
`JJR`	adjective, comparative
`JJS`	adjective, superlative
`LS`	list item marker
`MD`	verb, modal auxiliary
`NN`	noun, singular or mass
`NNP`	noun, proper singular
`NNPS`	noun, proper plural
`NNS`	noun, plural
`PDT`	predeterminer
`POS`	possessive ending
`PRP`	pronoun, personal
`PRP$`	pronoun, possessive
`RB`	adverb
`RBR`	adverb, comparative
`RBS`	adverb, superlative
`RP`	adverb, particle
`SYM`	symbol
`TO`	infinitival to
`UH`	interjection
`VB`	verb, base form
`VBD`	verb, past tense
`VBG`	verb, gerund or present participle
`VBN`	verb, past participle
`VBP`	verb, non-3rd person singular present
`VBZ`	verb, 3rd person singular present
`WDT`	wh-determiner
`WP`	wh-pronoun, personal
`WP$`	wh-pronoun, possessive
`WRB`	wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "gopkg.in/jdkato/prose.v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketbal in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.