pkg.gl

Categorygithub.com/yields/ant

modulepackage

0.0.0-20220301182912-6c7a55c5708c

Repository: https://github.com/yields/ant.git

Documentation: pkg.go.dev

# README

ant (alpha) is a web crawler for Go.

Declarative

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.


var data struct { Title string `css:"title"` }
page, _ := ant.Fetch(ctx, "https://apple.com")
page.Scan(&data)
data.Title // => Apple

Headless

By default the crawler uses http.Client, however if you're crawling SPAs youc an use the antcdp.Client implementation which allows you to use chrome headless browser to crawl pages.

eng, err := ant.Engine(ant.EngineConfig{
  Fetcher: &ant.Fetcher{
    Client: antcdp.Client{},
  },
})

Polite

The crawler automatically fetches and caches robots.txt, making sure that it never causes issues to small website owners. Of-course you can disable this behavior.

eng, err := ant.NewEngine(ant.EngineConfig{
  Impolite: true,
})
eng.Run(ctx)

Concurrent

The crawler maintains a configurable amount of "worker" goroutines that read URLs off the queue, and spawn a goroutine for each URL.

Depending on your configuration, you may want to increase the number of workers to speed up URL reads, of-course if you don't have enough resources you can reduce the number of workers too.

eng, err := ant.NewEngine(ant.EngineConfig{
  // Spawn 5 worker goroutines that dequeue
  // URLs and spawn a new goroutine for each URL.
  Workers: 5,
})
eng.Run(ctx)

Rate limits

The package includes a powerful ant.Limiter interface that allows you to define rate limits per URL. There are some built-in limiters as well.

ant.Limit(1) // 1 rps on all URLs.
ant.LimitHostname(5, "amazon.com") // 5 rps on amazon.com hostname.
ant.LimitPattern(5, "amazon.com.*") // 5 rps on URLs starting with `amazon.co.`.
ant.LimitRegexp(5, "^apple.com\/iphone\/*") // 5 rps on URLs that match the regex.

Note that LimitPattern and LimitRegexp only match on the host and path of the URL.

Matchers

Another powerful interface is ant.Matcher which allows you to define URL matchers, the matchers are called before URLs are queued.

ant.MatchHostname("amazon.com") // scrape amazon.com URLs only.
ant.MatchPattern("amazon.com/help/*")
ant.MatchRegexp("amazon\.com\/help/.+")

Robust

The crawl engine automatically retries any errors that implement Temporary() error that returns true.

Becuase the standard library returns errors that implement that interface the engine will retry most temporary network and HTTP errors.

eng, err := ant.NewEngine(ant.EngineConfig{
  Scraper: myscraper{},
  MaxAttempts: 5,
})

// Blocks until one of the following is true:
//
// 1. No more URLs to crawl (the scraper stops returning URLs)
// 2. A non-temporary error occured.
// 3. MaxAttempts was reached.
//
err = eng.Run(ctx)

Built-in Scrapers

The whole point of scraping is to extract data from websites into a machine readable format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously easy, here's a full cralwer that extracts quotes into stdout.

func main() {
	var url = "http://quotes.toscrape.com"
	var ctx = context.Background()
	var start = time.Now()

	type quote struct {
		Text string   `css:".text"   json:"text"`
		By   string   `css:".author" json:"by"`
		Tags []string `css:".tag"    json:"tags"`
	}

	type page struct {
		Quotes []quote `css:".quote" json:"quotes"`
	}

	eng, err := ant.NewEngine(ant.EngineConfig{
		Scraper: ant.JSON(os.Stdout, page{}, `li.next > a`),
		Matcher: ant.MatchHostname("quotes.toscrape.com"),
	})
	if err != nil {
		log.Fatalf("new engine: %s", err)
	}

	if err := eng.Run(ctx, url); err != nil {
		log.Fatal(err)
	}

	log.Printf("scraped in %s :)", time.Since(start))
}

Testing

anttest package makes it easy to test your scraper implementation it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.

The func depends on the file's modtime, the file expires daily, you can adjust the TTL by setting antttest.FetchTTL.

// Fetch calls `t.Fatal` on errors.
page := anttest.Fetch(t, "https://apple.com")
_, err := myscraper.Scrape(ctx, page)
assert.NoError(err)

# Packages

antcache

Package antcache implements an HTTP client that caches responses.

antcdp

Package antcdp is an experimental package that implements an `ant.Client` that performs HTTP requests using chrome and returns a rendered response.

anttest

Package anttest implements scraper test helpers.

# Functions

DedupeBF

DedupeBF returns a new deduper backed by bloom filter.

DedupeMap

DedupeMap returns a new deduper backed by sync.Map.

Fetch

Fetch fetches a page from URL.

JSON

JSON returns a new JSON scraper.

Limit

Limit returns a new limiter.

LimitHostname

LimitHostname returns a hostname limiter.

LimitPattern

LimitPattern returns a pattern limiter.

LimitRegexp

LimitRegexp returns a new regexp limiter.

MatchHostname

MatchHostname returns a new hostname matcher.

MatchPattern

MatchPattern returns a new pattern matcher.

MatchRegexp

MatchRegexp returns a new regexp matcher.

MemoryQueue

MemoryQueue returns a new memory queue.

NewEngine

NewEngine returns a new engine.

# Variables

DefaultClient

DefaultClient is the default client to use.

DefaultFetcher

DefaultFetcher is the default fetcher to use.

UserAgent

UserAgent is the default user agent to use.

# Structs

Engine

Engine implements web crawler engine.

EngineConfig

EngineConfig configures the engine.

Fetcher

Fetcher implements a page fetcher.

FetchError

FetchError represents a fetch error.

Page

Page represents a page.

# Interfaces

Client

Client represents an HTTP client.

Deduper

Deduper represents a URL de-duplicator.

Limiter

Limiter controls how many requests can be made by the engine.

Matcher

Matcher represents a URL matcher.

Queue

Queue represents a URL queue.

Scraper

Scraper represents a scraper.

# Type aliases

LimiterFunc

LimiterFunc implements a limiter.

List

List represents a list of nodes.

MatcherFunc

MatcherFunc implements a Matcher.

StaticAgent

StaticAgent is a static user agent string.

URL

URL represents a parsed URL.

URLs

URLs represents a slice of parsed URLs.