# README
Text processing
Extract information from text and HTML.
Created because of daominah/scraper, a scraper that can filter nearly
duplicate news and extract keywords.
Functions
-
NormalizeText normalize different representations of a character.
-
TextToNGrams creates a set of n-gram (lowercase) from input text.
-
HTMLXPath finds all html nodes match the xpath query.
-
HTMLGetHREFs returns all URLs (absolute form) in a HTML.
-
HTMLGetText get content from a HTML (javascript, spaces removed)
Example
Detail in text_test.go and html_test.go.
# Packages
No description provided by the author
# Functions
CheckValidXPath returns nil if the input xPath is valid.
GenRandomVarName returns an alpha numeric string, first char is a letter.
No description provided by the author
HashTextToInt is a unique and fast hash func.
HTMLGetHREFs returns all URLs in the HTML as absolute URLs, URLs with different fragments are treated as one URL.
HTMLGetImgSrc returns absolute url of the image.
HTMLGetText returns all text in the HTML.
HTMLParseToNode parses a HTML content (string, []byte or io_Reader) into a html_Node (returns an empty node on error).
HTMLXPath finds all html nodes match the xpath query.
There are often several ways to represent the same string.
RemoveRedundantSpace replaces continuous spaces with one space.
example: Đào => Dao.
TextToNGrams creates a set of n-gram (lowercase) from input text.
TextToWords splits a text to list of words (punctuations removed).
WordsToNGrams creates a set of n-gram from input words, (A n-gram is a contiguous sequence of n words).
# Variables
runes group by types, used for checking character type (Vietnamese alphabet).
runes group by types, used for checking character type (Vietnamese alphabet).
runes group by types, used for checking character type (Vietnamese alphabet).
runes group by types, used for checking character type (Vietnamese alphabet).