Exploring HTML structure

HTML is parsed using golang.org/x/net/html which produces a tree.

The module provides basic functionality to compare HTML tags or nodes and their trees. The search of an HTML tag using a *node.HTML type ignores pointers. It always returns the first match. By ignoring some properties, tags like <button> are easy to count. Text value of a tag (title, error message,...) can be checked.

Good to know

Parsing is not done according to the complete syntax checker of HTML. For instance, tags like <p> for which a closing tag would fail a comparison.

Siblings must always have the same order or comparison fails. Order of attributes is treated as irrelevant.

How to start

Detailed documentation includes examples.

Versions

v1.0.6 updates golang/go/x/net package to remove CVE-2022-27664 which does not affect x/net/html v1.0.5 requires Go 1.16+ as ioutil package use is removed.
v1.0.4 requires Go 1.17+ which implements lazy loading of modules to avoid go.mod updates.
v1.0.0 was created on Go 1.12 which supports modules.

# Functions

AttrIncluded

AttrIncluded returns true if list of attributes of n is included in reference node m whatever their order.

Equal

Equal returns true if all fields of nodes m and n are equal except pointers reflect.DeepEqual(tag1, tag2) is unusable as pointers are checked too.

ExploreNode

ExploreNode prints node tags with name s and type t Without name, all tags are printed When type ErrorNode (iota == 0) prints tags of all types.

FindNode

FindNode find the first occurrence of a node.

FindTag

FindTag finds the first occurrence of a tag name (i.e.

FindTags

FindTags finds all occurrences of a tag name whatever their attributes.

GetText

GetText prints the text content of a tree structure like PrintNodes w/o any formatting TODO Check usage of (* Tokenizer) Text equivalent in net/html package.

IdenticalNodes

IdenticalNodes fails if trees have different size.

IncludedNode

IncludedNode checks if n is included in m.

IncludedNodeTyped

IncludedNodeTyped is like IncludeNode where only tags of type t are compared.

IsTextNode

IsTextNode checks the presence of a node and its text value in a buffer.

IsTextTag

IsTextTag checks the presence of a tag and its text value in a buffer.

ParseFile

ParseFile returns a *Node containing the parsed file or an error (file or parsing).

PrintData

PrintData returns a string with Node information (not its relationships) nil will panic.

PrintNodes

PrintNodes prints the tree structure of node m until n node is equal.

PrintTags

PrintTags prints node structure until a tag name is found (whatever attributes) Without name, all tags are printed tagOnly selects ElementNode, otherwise tags are printed whatever type.

# README

Exploring HTML structure

Good to know

How to start

Versions

# Functions