Categorygithub.com/gosom/scrapemate
modulepackage
0.7.1
Repository: https://github.com/gosom/scrapemate.git
Documentation: pkg.go.dev

# README

scrapemate

GoDoc build Go Report Card

Scrapemate is a web crawling and scraping framework written in Golang. It is designed to be simple and easy to use, yet powerful enough to handle complex scraping tasks.

Features

  • Low level API & Easy High Level API
  • Customizable retry and error handling
  • Javascript Rendering with ability to control the browser
  • Screenshots support (when JS rendering is enabled)
  • Capability to write your own result exporter
  • Capability to write results in multiple sinks
  • Default CSV writer
  • Caching (File/LevelDB/Custom)
  • Custom job providers (memory provider included)
  • Headless and Headful support when using JS rendering
  • Automatic cookie and session handling
  • Rotating SOCKS5 proxy support

Installation

go get github.com/gosom/scrapemate

Quickstart

package main

import (
	"context"
	"encoding/csv"
	"fmt"
	"net/http"
	"os"
	"strings"
	"time"

	"github.com/PuerkitoBio/goquery"
	"github.com/gosom/scrapemate"
	"github.com/gosom/scrapemate/adapters/writers/csvwriter"
	"github.com/gosom/scrapemate/scrapemateapp"
)

func main() {
	csvWriter := csvwriter.NewCsvWriter(csv.NewWriter(os.Stdout))

	cfg, err := scrapemateapp.NewConfig(
		[]scrapemate.ResultWriter{csvWriter},
	)
	if err != nil {
		panic(err)
	}
	app, err := scrapemateapp.NewScrapeMateApp(cfg)
	if err != nil {
		panic(err)
	}
	seedJobs := []scrapemate.IJob{
		&SimpleCountryJob{
			Job: scrapemate.Job{
				ID:     "identity",
				Method: http.MethodGet,
				URL:    "https://www.scrapethissite.com/pages/simple/",
				Headers: map[string]string{
					"User-Agent": scrapemate.DefaultUserAgent,
				},
				Timeout:    10 * time.Second,
				MaxRetries: 3,
			},
		},
	}
	err = app.Start(context.Background(), seedJobs...)
	if err != nil && err != scrapemate.ErrorExitSignal {
		panic(err)
	}
}

type SimpleCountryJob struct {
	scrapemate.Job
}

func (j *SimpleCountryJob) Process(ctx context.Context, resp *scrapemate.Response) (any, []scrapemate.IJob, error) {
	doc, ok := resp.Document.(*goquery.Document)
	if !ok {
		return nil, nil, fmt.Errorf("failed to cast response document to goquery document")
	}
	var countries []Country
	doc.Find("div.col-md-4.country").Each(func(i int, s *goquery.Selection) {
		var country Country
		country.Name = strings.TrimSpace(s.Find("h3.country-name").Text())
		country.Capital = strings.TrimSpace(s.Find("div.country-info span.country-capital").Text())
		country.Population = strings.TrimSpace(s.Find("div.country-info span.country-population").Text())
		country.Area = strings.TrimSpace(s.Find("div.country-info span.country-area").Text())
		countries = append(countries, country)
	})
	return countries, nil, nil
}

type Country struct {
	Name       string
	Capital    string
	Population string
	Area       string
}

func (c Country) CsvHeaders() []string {
	return []string{"Name", "Capital", "Population", "Area"}
}

func (c Country) CsvRow() []string {
	return []string{c.Name, c.Capital, c.Population, c.Area}
}

go mod tidy
go run main.go 1>countries.csv

(hit CTRL-C to exit)

Documentation

For the High Level API see this example.

Read also how to use high level api

For the Low Level API see books.toscrape.com

Additionally, for low level API you can read the blogpost

See an example of how you can use scrapemate go scrape Google Maps: https://github.com/gosom/google-maps-scraper

Contributing

Contributions are welcome.

Licence

Scrapemate is licensed under the MIT License. See LICENCE file

# Packages

No description provided by the author
Package mock is a generated GoMock package.
No description provided by the author

# Functions

ContextWithLogger returns a new context with the logger.
GetLoggerFromContext returns a logger from the context or a default logger.
New creates a new scrapemate.
WithCache sets the cache for the scrapemate.
WithConcurrency sets the concurrency for the scrapemate.
WithContext sets the context for the scrapemate.
No description provided by the author
WithFailed sets the failed jobs channel for the scrapemate.
WithHTMLParser sets the html parser for the scrapemate.
WithHTTPFetcher sets the http fetcher for the scrapemate.
WithInitJob sets the first job to be processed It will be processed before the jobs from the job provider It is useful if you want to start the scraper with a specific job instead of the first one from the job provider A real use case is when you want to obtain some cookies before starting the scraping process (e.g.
WithJobProvider sets the job provider for the scrapemate.
WithLogger sets the logger for the scrapemate.

# Constants

DefaultMaxRetryDelay the default max delay between 2 consequive retries.
DefaultUserAgent is the default user agent scrape mate uses.
DiscardJob just discard it in case crawling fails.
PriorityHigh high priority.
PriorityLow low priority.
PriorityMedium medium priority.
RefreshIP refresh the api and then retry job.
RetryJob retry a job.
StopScraping exit scraping completely when an error happens.

# Variables

ErrInactivityTimeout returned when the system exits because of inactivity.
ErrorConcurrency returned when you try to initialize it with concurrency <1.
ErroExitSignal is returned when scrapemate exits because of a system interrupt.
ErrorNoCacher returned when you try to initialized with a nil Cacher.
ErrorNoContext returned when you try to initialized it with a nil context.
ErrorNoHTMLFetcher returned when you try to initialize with a nil httpFetcher.
ErrorNoHTMLParser returned when you try to initialized with a nil HtmlParser.
ErrorNoJobProvider returned when you do not set a job provider in initialization.
ErrorNoLogger returned when you try to initialize it with a nil logger.
ErrorNoCsvCapable returned when you try to write a csv file without a csv capable Data.

# Structs

Job is the base job that we may use.
Response is the struct that it is returned when crawling finishes.
Result is the struct items of which the Results channel has.
Scrapemate contains unexporter fields.

# Interfaces

Cacher is an interface for cache go:generate mockgen -destination=mock/mock_cacher.go -package=mock .
CsvCapable is an interface for types that can be converted to csv It is used to convert the Data of a Result to csv.
HTMLParser is an interface for html parsers go:generate mockgen -destination=mock/mock_parser.go -package=mock .
HTTPFetcher is an interface for http fetchers go:generate mockgen -destination=mock/mock_http_fetcher.go -package=mock .
IJob is a job to be processed by the scrapemate.
JobProvider is an interface for job providers a job provider is a service that provides jobs to scrapemate scrapemate will call the job provider to get jobs go:generate mockgen -destination=mock/mock_provider.go -package=mock .
ProxyRotator is an interface for proxy rotators go:generate mockgen -destination=mock/mock_proxy_rotator.go -package=mock .
ResultWriter is an interface for result writers go:generate mockgen -destination=mock/mock_writer.go -package=mock .

# Type aliases

No description provided by the author