Categorygithub.com/adrg/strutil
modulepackage
0.3.1
Repository: https://github.com/adrg/strutil.git
Documentation: pkg.go.dev

# README

strutil

Build status Code coverage pkg.go.dev documentation MIT license Go report card GitHub issues Buy me a coffee

strutil provides a collection of string metrics for calculating string similarity as well as other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.

Installation

go get github.com/adrg/strutil

String metrics

The package defines the StringMetric interface, which is implemented by all the string metrics. The interface is used with the Similarity function, which calculates the similarity between the specified strings, using the provided string metric.

type StringMetric interface {
    Compare(a, b string) float64
}

func Similarity(a, b string, metric StringMetric) float64 {
}

All defined string metrics can be found in the metrics package.

Hamming

Calculate similarity.

similarity := strutil.Similarity("text", "test", metrics.NewHamming())
fmt.Printf("%.2f\n", similarity) // Output: 0.75

Calculate distance.

ham := metrics.NewHamming()
fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2

More information and additional examples can be found on pkg.go.dev.

Levenshtein

Calculate similarity using default options.

similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein())
fmt.Printf("%.2f\n", similarity) // Output: 0.43

Configure edit operation costs.

lev := metrics.NewLevenshtein()
lev.CaseSensitive = false
lev.InsertCost = 1
lev.ReplaceCost = 2
lev.DeleteCost = 1

similarity := strutil.Similarity("make", "Cake", lev)
fmt.Printf("%.2f\n", similarity) // Output: 0.50

Calculate distance.

lev := metrics.NewLevenshtein()
fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4

More information and additional examples can be found on pkg.go.dev.

Jaro

similarity := strutil.Similarity("think", "tank", metrics.NewJaro())
fmt.Printf("%.2f\n", similarity) // Output: 0.78

More information and additional examples can be found on pkg.go.dev.

Jaro-Winkler

similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler())
fmt.Printf("%.2f\n", similarity) // Output: 0.80

More information and additional examples can be found on pkg.go.dev.

Smith-Waterman-Gotoh

Calculate similarity using default options.

swg := metrics.NewSmithWatermanGotoh()
similarity := strutil.Similarity("times roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.82

Customize gap penalty and substitution function.

swg := metrics.NewSmithWatermanGotoh()
swg.CaseSensitive = false
swg.GapPenalty = -0.1
swg.Substitution = metrics.MatchMismatch {
    Match:    1,
    Mismatch: -0.5,
}

similarity := strutil.Similarity("Times Roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.96

More information and additional examples can be found on pkg.go.dev.

Sorensen-Dice

Calculate similarity using default options.

sd := metrics.NewSorensenDice()
similarity := strutil.Similarity("time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.62

Customize n-gram size.

sd := metrics.NewSorensenDice()
sd.CaseSensitive = false
sd.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.53

More information and additional examples can be found on pkg.go.dev.

Jaccard

Calculate similarity using default options.

j := metrics.NewJaccard()
similarity := strutil.Similarity("time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.45

Customize n-gram size.

j := metrics.NewJaccard()
j.CaseSensitive = false
j.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.36

The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.

Sorensen-Dice to Jaccard.

J = SD/(2-SD)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

Jaccard to Sorensen-Dice.

SD = 2*J/(1+J)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

More information and additional examples can be found on pkg.go.dev.

Overlap Coefficient

Calculate similarity using default options.

oc := metrics.NewOverlapCoefficient()
similarity := strutil.Similarity("time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.67

Customize n-gram size.

oc := metrics.NewOverlapCoefficient()
oc.CaseSensitive = false
oc.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.57

More information and additional examples can be found on pkg.go.dev.

References

For more information see:

Stargazers over time

Stargazers over time

Contributing

Contributions in the form of pull requests, issues or just general feedback, are always welcome.
See CONTRIBUTING.MD.

License

Copyright (c) 2019 Adrian-George Bostan.

This project is licensed under the MIT license. See LICENSE for more details.

# Packages

No description provided by the author

# Functions

CommonPrefix returns the common prefix of the specified strings.
NgramCount returns the n-gram count of the specified size for the provided term.
NgramIntersection returns a map of the n-grams of the specified size found in both terms, along with their frequency.
NgramMap returns a map of all n-grams of the specified size for the provided term, along with their frequency.
Ngrams returns all the n-grams of the specified size for the provided term.
Similarity returns the similarity of a and b, computed using the specified string metric.
SliceContains returns true if terms contains q, or false otherwise.
UniqueSlice returns a slice containing the unique items from the specified string slice.

# Interfaces

StringMetric represents a metric for measuring the similarity between strings.