Categorygithub.com/yhat/scrape
modulepackage
0.0.0-20161128144610-24b7890b0945
Repository: https://github.com/yhat/scrape.git
Documentation: pkg.go.dev

# README

scrape

A simple, higher level interface for Go web scraping.

When scraping with Go, I find myself redefining tree traversal and other utility functions.

This package is a place to put some simple tools which build on top of the Go HTML parsing library.

For the full interface check out the godoc GoDoc

Sample

Scrape defines traversal functions like Find and FindAll while attempting to be generic. It also defines convenience functions such as Attr and Text.

// Parse the page
root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
// Search for the title
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))
if ok {
    // Print the title
    fmt.Println(scrape.Text(title))
}

A full example: Scraping Hacker News

package main

import (
	"fmt"
	"net/http"

	"github.com/yhat/scrape"
	"golang.org/x/net/html"
	"golang.org/x/net/html/atom"
)

func main() {
	// request and parse the front page
	resp, err := http.Get("https://news.ycombinator.com/")
	if err != nil {
		panic(err)
	}
	root, err := html.Parse(resp.Body)
	if err != nil {
		panic(err)
	}

	// define a matcher
	matcher := func(n *html.Node) bool {
		// must check for nil values
		if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {
			return scrape.Attr(n.Parent.Parent, "class") == "athing"
		}
		return false
	}
	// grab all articles and print them
	articles := scrape.FindAll(root, matcher)
	for i, article := range articles {
		fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))
	}
}

# Packages

# Functions

Attr returns the value of an HTML attribute.
ByClass returns a Matcher which matches all nodes with the provided class.
ById returns a Matcher which matches all nodes with the provided id.
ByTag returns a Matcher which matches all nodes of the provided tag type.
Find returns the first node which matches the matcher using depth-first search.
FindAll returns all nodes which match the provided Matcher.
FindAllNested returns all nodes which match the provided Matcher and _will_ discover matching subnodes of matching nodes.
Find returns the first node which matches the matcher using next sibling search.
FindParent searches up HTML tree from the current node until either a match is found or the top is hit.
Find returns the first node which matches the matcher using previous sibling search.
Text returns text from all descendant text nodes joined.
TextJoin returns a string from all descendant text nodes joined by a caller provided join function.

# Type aliases

Matcher should return true when a desired node is found.