Categorygithub.com/gosom/scrapemate

modulepackage

0.7.1

Repository: https://github.com/gosom/scrapemate.git

Documentation: pkg.go.dev

scrapemate

Scrapemate is a web crawling and scraping framework written in Golang. It is designed to be simple and easy to use, yet powerful enough to handle complex scraping tasks.

Features

Low level API & Easy High Level API
Customizable retry and error handling
Javascript Rendering with ability to control the browser
Screenshots support (when JS rendering is enabled)
Capability to write your own result exporter
Capability to write results in multiple sinks
Default CSV writer
Caching (File/LevelDB/Custom)
Custom job providers (memory provider included)
Headless and Headful support when using JS rendering
Automatic cookie and session handling
Rotating SOCKS5 proxy support

Installation

go get github.com/gosom/scrapemate

Quickstart

package main

import (
	"context"
	"encoding/csv"
	"fmt"
	"net/http"
	"os"
	"strings"
	"time"

	"github.com/PuerkitoBio/goquery"
	"github.com/gosom/scrapemate"
	"github.com/gosom/scrapemate/adapters/writers/csvwriter"
	"github.com/gosom/scrapemate/scrapemateapp"
)

func main() {
	csvWriter := csvwriter.NewCsvWriter(csv.NewWriter(os.Stdout))

	cfg, err := scrapemateapp.NewConfig(
		[]scrapemate.ResultWriter{csvWriter},
	)
	if err != nil {
		panic(err)
	}
	app, err := scrapemateapp.NewScrapeMateApp(cfg)
	if err != nil {
		panic(err)
	}
	seedJobs := []scrapemate.IJob{
		&SimpleCountryJob{
			Job: scrapemate.Job{
				ID:     "identity",
				Method: http.MethodGet,
				URL:    "https://www.scrapethissite.com/pages/simple/",
				Headers: map[string]string{
					"User-Agent": scrapemate.DefaultUserAgent,
				},
				Timeout:    10 * time.Second,
				MaxRetries: 3,
			},
		},
	}
	err = app.Start(context.Background(), seedJobs...)
	if err != nil && err != scrapemate.ErrorExitSignal {
		panic(err)
	}
}

type SimpleCountryJob struct {
	scrapemate.Job
}

func (j *SimpleCountryJob) Process(ctx context.Context, resp *scrapemate.Response) (any, []scrapemate.IJob, error) {
	doc, ok := resp.Document.(*goquery.Document)
	if !ok {
		return nil, nil, fmt.Errorf("failed to cast response document to goquery document")
	}
	var countries []Country
	doc.Find("div.col-md-4.country").Each(func(i int, s *goquery.Selection) {
		var country Country
		country.Name = strings.TrimSpace(s.Find("h3.country-name").Text())
		country.Capital = strings.TrimSpace(s.Find("div.country-info span.country-capital").Text())
		country.Population = strings.TrimSpace(s.Find("div.country-info span.country-population").Text())
		country.Area = strings.TrimSpace(s.Find("div.country-info span.country-area").Text())
		countries = append(countries, country)
	})
	return countries, nil, nil
}

type Country struct {
	Name       string
	Capital    string
	Population string
	Area       string
}

func (c Country) CsvHeaders() []string {
	return []string{"Name", "Capital", "Population", "Area"}
}

func (c Country) CsvRow() []string {
	return []string{c.Name, c.Capital, c.Population, c.Area}
}

go mod tidy
go run main.go 1>countries.csv

(hit CTRL-C to exit)

Documentation

For the High Level API see this example.

Contributing

Contributions are welcome.

Licence

Scrapemate is licensed under the MIT License. See LICENCE file

# Packages

adapters

No description provided by the author

mock

Package mock is a generated GoMock package.

scrapemateapp

No description provided by the author

# Functions

ContextWithLogger

ContextWithLogger returns a new context with the logger.

GetLoggerFromContext

GetLoggerFromContext returns a logger from the context or a default logger.

New

New creates a new scrapemate.

WithCache

WithCache sets the cache for the scrapemate.

WithConcurrency

WithConcurrency sets the concurrency for the scrapemate.

WithContext

WithContext sets the context for the scrapemate.

WithExitBecauseOfInactivity

No description provided by the author

WithFailed

WithFailed sets the failed jobs channel for the scrapemate.

WithHTMLParser

WithHTMLParser sets the html parser for the scrapemate.

WithHTTPFetcher

WithHTTPFetcher sets the http fetcher for the scrapemate.

WithInitJob

WithInitJob sets the first job to be processed It will be processed before the jobs from the job provider It is useful if you want to start the scraper with a specific job instead of the first one from the job provider A real use case is when you want to obtain some cookies before starting the scraping process (e.g.

WithJobProvider

WithJobProvider sets the job provider for the scrapemate.

WithLogger

WithLogger sets the logger for the scrapemate.

# Constants

DefaultMaxRetryDelay

DefaultMaxRetryDelay the default max delay between 2 consequive retries.

DefaultUserAgent

DefaultUserAgent is the default user agent scrape mate uses.

DiscardJob

DiscardJob just discard it in case crawling fails.

PriorityHigh

PriorityHigh high priority.

PriorityLow

PriorityLow low priority.

PriorityMedium

PriorityMedium medium priority.

RefreshIP

RefreshIP refresh the api and then retry job.

RetryJob

RetryJob retry a job.

StopScraping

StopScraping exit scraping completely when an error happens.

# Variables

ErrInactivityTimeout

ErrInactivityTimeout returned when the system exits because of inactivity.

ErrorConcurrency

ErrorConcurrency returned when you try to initialize it with concurrency <1.

ErrorExitSignal

ErroExitSignal is returned when scrapemate exits because of a system interrupt.

ErrorNoCacher

ErrorNoCacher returned when you try to initialized with a nil Cacher.

ErrorNoContext

ErrorNoContext returned when you try to initialized it with a nil context.

ErrorNoHTMLFetcher

ErrorNoHTMLFetcher returned when you try to initialize with a nil httpFetcher.

ErrorNoHTMLParser

ErrorNoHTMLParser returned when you try to initialized with a nil HtmlParser.

ErrorNoJobProvider

ErrorNoJobProvider returned when you do not set a job provider in initialization.

ErrorNoLogger

ErrorNoLogger returned when you try to initialize it with a nil logger.

ErrorNotCsvCapable

ErrorNoCsvCapable returned when you try to write a csv file without a csv capable Data.

# Structs

Job

Job is the base job that we may use.

Response

Response is the struct that it is returned when crawling finishes.

Result

Result is the struct items of which the Results channel has.

ScrapeMate

Scrapemate contains unexporter fields.

# Interfaces

Cacher

Cacher is an interface for cache go:generate mockgen -destination=mock/mock_cacher.go -package=mock .

CsvCapable

CsvCapable is an interface for types that can be converted to csv It is used to convert the Data of a Result to csv.

HTMLParser

HTMLParser is an interface for html parsers go:generate mockgen -destination=mock/mock_parser.go -package=mock .

HTTPFetcher

HTTPFetcher is an interface for http fetchers go:generate mockgen -destination=mock/mock_http_fetcher.go -package=mock .

IJob

IJob is a job to be processed by the scrapemate.

JobProvider

JobProvider is an interface for job providers a job provider is a service that provides jobs to scrapemate scrapemate will call the job provider to get jobs go:generate mockgen -destination=mock/mock_provider.go -package=mock .

ProxyRotator

ProxyRotator is an interface for proxy rotators go:generate mockgen -destination=mock/mock_proxy_rotator.go -package=mock .

ResultWriter

ResultWriter is an interface for result writers go:generate mockgen -destination=mock/mock_writer.go -package=mock .

# Type aliases

RetryPolicy

No description provided by the author