package
0.0.0-20220715001353-00e0c845ae1c
Repository: https://github.com/cdipaolo/goml.git
Documentation: pkg.go.dev

# README

Text Classification

import "github.com/cdipaolo/goml/text"

GoDoc

This package implements text classification algorithms. For algorithms that could be used numberically (most/all of them,) this package makes working with text documents easier than hand-rolling a bag-of-words model and integrating it with other models

implemented models

  • multiclass naive bayes
  • term frequency - inverse document frequency
    • this model lets you easily calculate keywords from documents, as well as general importance scores for any word (with it's document) that you can throw at it!
    • because this is so similar to Bayes under the hood, you train TFIDF by casting a trained Bayes model to it such as tfidf := TFIDF(*myNaiveBayesModel)

example online naive bayes sentiment analysis

This is the general text classification example from the GoDoc package comment. Look there and at the tests for more detailed and varied examples of usage:

// create the channel of data and errors
stream := make(chan base.TextDatapoint, 100)
errors := make(chan error)

// make a new NaiveBayes model with
// 2 classes expected (classes in
// datapoints will now expect {0,1}.
// in general, given n as the classes
// variable, the model will expect
// datapoint classes in {0,...,n-1})
//
// Note that the model is filtering
// the text to omit anything except
// words and numbers (and spaces
// obviously)
model := NewNaiveBayes(stream, 2, base.OnlyWordsAndNumbers)

go model.OnlineLearn(errors)

stream <- base.TextDatapoint{
	X: "I love the city",
	Y: 1,
}

stream <- base.TextDatapoint{
	X: "I hate Los Angeles",
	Y: 0,
}

stream <- base.TextDatapoint{
	X: "My mother is not a nice lady",
	Y: 0,
}

close(stream)

for {
	err, more := <- errors
	if err != nil {
		fmt.Printf("Error passed: %v", err)
	} else {
		// training is done!
		break
	}
}

// now you can predict like normal
class := model.Predict("My mother is in Los Angeles") // 0

# Functions

NewNaiveBayes returns a NaiveBayes model the given number of classes instantiated, ready to learn off the given data stream.
TermFrequencies gives the TermFrequency of all words in a document, and is more efficient at doing so than calling that function multiple times.

# Structs

Frequency holds word frequency information so you don't have to hold a map[string]float64 and can, then, sort.
NaiveBayes is a general classification model that calculates the probability that a datapoint is part of a class by using Bayes Rule: P(y|x) = P(x|y)*P(y)/P(x) The unique part of this model is that it assumes words are unrelated to eachother.
SimpleTokenizer splits sentences into tokens delimited by its SplitOn string – space, for example.
Word holds the structural information needed to calculate the probability of.

# Interfaces

Tokenizer accepts a sentence as input and breaks it down into a slice of tokens.

# Type aliases

Frequencies is an array of word frequencies (stored as separate type to be able to sort).
TFIDF is a Term Frequency- Inverse Document Frequency model that is created from a trained NaiveBayes model (they are very similar so you can just train NaiveBayes and convert into TDIDF) This is not a probabalistic model, necessarily, and doesn't give classification.