Categorygithub.com/rivo/uniseg
modulepackage
0.4.7
Repository: https://github.com/rivo/uniseg.git
Documentation: pkg.go.dev

# README

Unicode Text Segmentation for Go

Go Reference Go Report

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29, Unicode Line Breaking according to Unicode Standard Annex #14 (Unicode version 15.0.0), and monospace font string width calculation similar to wcwidth.

Background

Grapheme Clusters

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

StringBytes (UTF-8)Code points (runes)Grapheme clusters
Käse6 bytes: 4b 61 cc 88 73 655 code points: 4b 61 308 73 654 clusters: [4b],[61 308],[73],[65]
🏳️‍🌈14 bytes: f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 884 code points: 1f3f3 fe0f 200d 1f3081 cluster: [1f3f3 fe0f 200d 1f308]
🇩🇪8 bytes: f0 9f 87 a9 f0 9f 87 aa2 code points: 1f1e9 1f1ea1 cluster: [1f1e9 1f1ea]

This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).

Monospace Width

Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See here for more information.

Installation

go get github.com/rivo/uniseg

Examples

Counting Characters in a String

n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
fmt.Println(n)
// 2

Calculating the Monospace String Width

width := uniseg.StringWidth("🇩🇪🏳️‍🌈!")
fmt.Println(width)
// 5

Using the Graphemes Class

This is the most convenient method of iterating over grapheme clusters:

gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}
// [1f44d 1f3fc] [21]

Using the Step or StepString Function

This avoids allocating a new Graphemes object but it requires the handling of states and boundaries:

str := "🇩🇪🏳️‍🌈"
state := -1
var c string
for len(str) > 0 {
	c, str, _, state = uniseg.StepString(str, state)
	fmt.Printf("%x ", []rune(c))
}
// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]

Advanced Examples

The Graphemes class offers the most convenient way to access all functionality of this package. But in some cases, it may be better to use the specialized functions directly. For example, if you're only interested in word segmentation, use FirstWord or FirstWordInString:

str := "Hello, world!"
state := -1
var c string
for len(str) > 0 {
	c, str, state = uniseg.FirstWordInString(str, state)
	fmt.Printf("(%s)\n", c)
}
// (Hello)
// (,)
// ( )
// (world)
// (!)

Similarly, use

If you're only interested in the width of characters, use FirstGraphemeCluster or FirstGraphemeClusterInString. It is much faster than using Step, StepString, or the Graphemes class because it does not include the logic for word / sentence / line boundaries.

Finally, if you need to reverse a string while preserving grapheme clusters, use ReverseString:

fmt.Println(uniseg.ReverseString("🇩🇪🏳️‍🌈"))
// 🏳️‍🌈🇩🇪

Documentation

Refer to https://pkg.go.dev/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Sponsor this Project

Become a Sponsor on GitHub to support this project!

Your Feedback

Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.

# Functions

FirstGraphemeCluster returns the first grapheme cluster found in the given byte slice according to the rules of [Unicode Standard Annex #29, Grapheme Cluster Boundaries].
FirstGraphemeClusterInString is like [FirstGraphemeCluster] but its input and outputs are strings.
FirstLineSegment returns the prefix of the given byte slice after which a decision to break the string over to the next line can or must be made, according to the rules of [Unicode Standard Annex #14].
FirstLineSegmentInString is like [FirstLineSegment] but its input and outputs are strings.
FirstSentence returns the first sentence found in the given byte slice according to the rules of [Unicode Standard Annex #29, Sentence Boundaries].
FirstSentenceInString is like [FirstSentence] but its input and outputs are strings.
FirstWord returns the first word found in the given byte slice according to the rules of [Unicode Standard Annex #29, Word Boundaries].
FirstWordInString is like [FirstWord] but its input and outputs are strings.
GraphemeClusterCount returns the number of user-perceived characters (grapheme clusters) for the given string.
HasTrailingLineBreak returns true if the last rune in the given byte slice is one of the hard line break code points defined in LB4 and LB5 of [UAX #14].
HasTrailingLineBreakInString is like [HasTrailingLineBreak] but for a string.
NewGraphemes returns a new grapheme cluster iterator.
ReverseString reverses the given string while observing grapheme cluster boundaries.
Step returns the first grapheme cluster (user-perceived character) found in the given byte slice.
StepString is like [Step] but its input and outputs are strings.
StringWidth returns the monospace width for the given string, that is, the number of same-size cells to be occupied by the string.

# Constants

You may or may not break the line here.
You may not break the line here.
You must break the line here.
The bit masks used to extract boundary information returned by [Step].
The bit masks used to extract boundary information returned by [Step].
The bit masks used to extract boundary information returned by [Step].
The number of bits to shift the boundary information returned by [Step] to obtain the monospace width of the grapheme cluster.

# Variables

EastAsianAmbiguousWidth specifies the monospace width for East Asian characters classified as Ambiguous.

# Structs

Graphemes implements an iterator over Unicode grapheme clusters, or user-perceived characters.