Categorygithub.com/mywrap/textproc

modulepackage

0.3.2

Repository: https://github.com/mywrap/textproc.git

Documentation: pkg.go.dev

# README

Text processing

Extract information from text and HTML.
Created because of daominah/scraper, a scraper that can filter nearly duplicate news and extract keywords.

Functions

NormalizeText normalize different representations of a character.
TextToNGrams creates a set of n-gram (lowercase) from input text.
HTMLXPath finds all html nodes match the xpath query.
HTMLGetHREFs returns all URLs (absolute form) in a HTML.
HTMLGetText get content from a HTML (javascript, spaces removed)

Example

Detail in text_test.go and html_test.go.

# Packages

example

No description provided by the author

# Functions

CheckValidXPath

CheckValidXPath returns nil if the input xPath is valid.

GenRandomVarName

GenRandomVarName returns an alpha numeric string, first char is a letter.

GenRandomWord

No description provided by the author

HashTextToInt

HashTextToInt is a unique and fast hash func.

HTMLGetHREFs

HTMLGetHREFs returns all URLs in the HTML as absolute URLs, URLs with different fragments are treated as one URL.

HTMLGetImgSrc

HTMLGetImgSrc returns absolute url of the image.

HTMLGetText

HTMLGetText returns all text in the HTML.

HTMLParseToNode

HTMLParseToNode parses a HTML content (string, []byte or io_Reader) into a html_Node (returns an empty node on error).

HTMLXPath

HTMLXPath finds all html nodes match the xpath query.

NormalizeText

There are often several ways to represent the same string.

RemoveRedundantSpace

RemoveRedundantSpace replaces continuous spaces with one space.

RemoveVietnamDiacritic

example: Đào => Dao.

TextToNGrams

TextToNGrams creates a set of n-gram (lowercase) from input text.

TextToWords

TextToWords splits a text to list of words (punctuations removed).

WordsToNGrams

WordsToNGrams creates a set of n-gram from input words, (A n-gram is a contiguous sequence of n words).

# Variables

AlphaEnList

runes group by types, used for checking character type (Vietnamese alphabet).

AlphaNumeric

runes group by types, used for checking character type (Vietnamese alphabet).

AlphaNumericEnList

runes group by types, used for checking character type (Vietnamese alphabet).

AlphaNumericList

runes group by types, used for checking character type (Vietnamese alphabet).