package
0.0.0-20220715001353-00e0c845ae1c
Repository: https://github.com/cdipaolo/goml.git
Documentation: pkg.go.dev
# README
Text Classification
import "github.com/cdipaolo/goml/text"
This package implements text classification algorithms. For algorithms that could be used numberically (most/all of them,) this package makes working with text documents easier than hand-rolling a bag-of-words model and integrating it with other models
implemented models
- multiclass naive bayes
- term frequency - inverse document frequency
- this model lets you easily calculate keywords from documents, as well as general importance scores for any word (with it's document) that you can throw at it!
- because this is so similar to Bayes under the hood, you train TFIDF by casting a trained Bayes model to it such as
tfidf := TFIDF(*myNaiveBayesModel)
example online naive bayes sentiment analysis
This is the general text classification example from the GoDoc package comment. Look there and at the tests for more detailed and varied examples of usage:
// create the channel of data and errors
stream := make(chan base.TextDatapoint, 100)
errors := make(chan error)
// make a new NaiveBayes model with
// 2 classes expected (classes in
// datapoints will now expect {0,1}.
// in general, given n as the classes
// variable, the model will expect
// datapoint classes in {0,...,n-1})
//
// Note that the model is filtering
// the text to omit anything except
// words and numbers (and spaces
// obviously)
model := NewNaiveBayes(stream, 2, base.OnlyWordsAndNumbers)
go model.OnlineLearn(errors)
stream <- base.TextDatapoint{
X: "I love the city",
Y: 1,
}
stream <- base.TextDatapoint{
X: "I hate Los Angeles",
Y: 0,
}
stream <- base.TextDatapoint{
X: "My mother is not a nice lady",
Y: 0,
}
close(stream)
for {
err, more := <- errors
if err != nil {
fmt.Printf("Error passed: %v", err)
} else {
// training is done!
break
}
}
// now you can predict like normal
class := model.Predict("My mother is in Los Angeles") // 0
# Functions
NewNaiveBayes returns a NaiveBayes model the given number of classes instantiated, ready to learn off the given data stream.
TermFrequencies gives the TermFrequency of all words in a document, and is more efficient at doing so than calling that function multiple times.
# Structs
Frequency holds word frequency information so you don't have to hold a map[string]float64 and can, then, sort.
NaiveBayes is a general classification
model that calculates the probability
that a datapoint is part of a class
by using Bayes Rule:
P(y|x) = P(x|y)*P(y)/P(x)
The unique part of this model is that
it assumes words are unrelated to
eachother.
SimpleTokenizer splits sentences into tokens delimited by its SplitOn string – space, for example.
Word holds the structural information needed to calculate the probability of.
# Interfaces
Tokenizer accepts a sentence as input and breaks it down into a slice of tokens.
# Type aliases
Frequencies is an array of word frequencies (stored as separate type to be able to sort).
TFIDF is a Term Frequency- Inverse
Document Frequency model that is created
from a trained NaiveBayes model (they
are very similar so you can just train
NaiveBayes and convert into TDIDF)
This is not a probabalistic model, necessarily,
and doesn't give classification.