Categorygithub.com/mywrap/textproc
modulepackage
0.3.2
Repository: https://github.com/mywrap/textproc.git
Documentation: pkg.go.dev

# README

Text processing

Extract information from text and HTML.
Created because of daominah/scraper, a scraper that can filter nearly duplicate news and extract keywords.

Functions

  • NormalizeText normalize different representations of a character.

  • TextToNGrams creates a set of n-gram (lowercase) from input text.

  • HTMLXPath finds all html nodes match the xpath query.

  • HTMLGetHREFs returns all URLs (absolute form) in a HTML.

  • HTMLGetText get content from a HTML (javascript, spaces removed)

Example

Detail in text_test.go and html_test.go.

# Packages

No description provided by the author

# Functions

CheckValidXPath returns nil if the input xPath is valid.
GenRandomVarName returns an alpha numeric string, first char is a letter.
No description provided by the author
HashTextToInt is a unique and fast hash func.
HTMLGetHREFs returns all URLs in the HTML as absolute URLs, URLs with different fragments are treated as one URL.
HTMLGetImgSrc returns absolute url of the image.
HTMLGetText returns all text in the HTML.
HTMLParseToNode parses a HTML content (string, []byte or io_Reader) into a html_Node (returns an empty node on error).
HTMLXPath finds all html nodes match the xpath query.
There are often several ways to represent the same string.
RemoveRedundantSpace replaces continuous spaces with one space.
example: Đào => Dao.
TextToNGrams creates a set of n-gram (lowercase) from input text.
TextToWords splits a text to list of words (punctuations removed).
WordsToNGrams creates a set of n-gram from input words, (A n-gram is a contiguous sequence of n words).

# Variables

runes group by types, used for checking character type (Vietnamese alphabet).
runes group by types, used for checking character type (Vietnamese alphabet).
runes group by types, used for checking character type (Vietnamese alphabet).
runes group by types, used for checking character type (Vietnamese alphabet).