Categorygithub.com/parsiya/parsia-codegophercises04-link

package

0.0.0-20240228044302-56ad08b2fa1c

Repository: https://github.com/parsiya/parsia-code.git

Documentation: pkg.go.dev

# README

Gophercises - 4 - Link

Problem

Solutions

link: link package.
main: Use link package to extract links from HTML.

Lessons Learned

/x/net/html

Read the package example: https://godoc.org/golang.org/x/net/html

Token struct:

type Token struct {
    Type     TokenType
    DataAtom atom.Atom
    Data     string
    Attr     []Attribute
}

Type can give us information about what kind of token it is. Important ones for this exercise are:
- StartTagToken: <a href>
- EndTagToken: </a>
- TextToken: Text in between. Using text nodes will skip other elements inside the link.
Data contains the data in the node.
- Anchor tags: a.
- Text nodes: The actual text of the node.

Attribute is of type:

type Attribute struct {
    Namespace, Key, Val string
}

Key is the name of the attribute and Value is the value.
- <a href="example.net">: key = href and value = example.net.

Parse

Parse is easy.

Go through the nodes. If you reach a start anchor tag, set the capturing flag to start capturing. Store the href.
While capturing, add the text of every text node (trim all white space but add a space between nodes).
After reaching the end anchor tag, stop capturing and store the link.
Add link to the links slice.

Issues:

Nested links are ignored. Child links are not stored and their text is stored as part of the parent link.
- For an example run go run main.go -f ex5.html.

strings.Builder

Example: https://golang.org/pkg/strings/#example_Builder

var sb strings.Builder  // Create the builder.
sb.WriteString("whatever")  // Write to it. We can use fmt.Sprintf as param too.
return sb.String()  // Get the final string.