# README
scrapemate
Scrapemate is a web crawling and scraping framework written in Golang. It is designed to be simple and easy to use, yet powerful enough to handle complex scraping tasks.
Features
- Low level API & Easy High Level API
- Customizable retry and error handling
- Javascript Rendering with ability to control the browser
- Screenshots support (when JS rendering is enabled)
- Capability to write your own result exporter
- Capability to write results in multiple sinks
- Default CSV writer
- Caching (File/LevelDB/Custom)
- Custom job providers (memory provider included)
- Headless and Headful support when using JS rendering
- Automatic cookie and session handling
- Rotating SOCKS5 proxy support
Installation
go get github.com/gosom/scrapemate
Quickstart
package main
import (
"context"
"encoding/csv"
"fmt"
"net/http"
"os"
"strings"
"time"
"github.com/PuerkitoBio/goquery"
"github.com/gosom/scrapemate"
"github.com/gosom/scrapemate/adapters/writers/csvwriter"
"github.com/gosom/scrapemate/scrapemateapp"
)
func main() {
csvWriter := csvwriter.NewCsvWriter(csv.NewWriter(os.Stdout))
cfg, err := scrapemateapp.NewConfig(
[]scrapemate.ResultWriter{csvWriter},
)
if err != nil {
panic(err)
}
app, err := scrapemateapp.NewScrapeMateApp(cfg)
if err != nil {
panic(err)
}
seedJobs := []scrapemate.IJob{
&SimpleCountryJob{
Job: scrapemate.Job{
ID: "identity",
Method: http.MethodGet,
URL: "https://www.scrapethissite.com/pages/simple/",
Headers: map[string]string{
"User-Agent": scrapemate.DefaultUserAgent,
},
Timeout: 10 * time.Second,
MaxRetries: 3,
},
},
}
err = app.Start(context.Background(), seedJobs...)
if err != nil && err != scrapemate.ErrorExitSignal {
panic(err)
}
}
type SimpleCountryJob struct {
scrapemate.Job
}
func (j *SimpleCountryJob) Process(ctx context.Context, resp *scrapemate.Response) (any, []scrapemate.IJob, error) {
doc, ok := resp.Document.(*goquery.Document)
if !ok {
return nil, nil, fmt.Errorf("failed to cast response document to goquery document")
}
var countries []Country
doc.Find("div.col-md-4.country").Each(func(i int, s *goquery.Selection) {
var country Country
country.Name = strings.TrimSpace(s.Find("h3.country-name").Text())
country.Capital = strings.TrimSpace(s.Find("div.country-info span.country-capital").Text())
country.Population = strings.TrimSpace(s.Find("div.country-info span.country-population").Text())
country.Area = strings.TrimSpace(s.Find("div.country-info span.country-area").Text())
countries = append(countries, country)
})
return countries, nil, nil
}
type Country struct {
Name string
Capital string
Population string
Area string
}
func (c Country) CsvHeaders() []string {
return []string{"Name", "Capital", "Population", "Area"}
}
func (c Country) CsvRow() []string {
return []string{c.Name, c.Capital, c.Population, c.Area}
}
go mod tidy
go run main.go 1>countries.csv
(hit CTRL-C to exit)
Documentation
For the High Level API see this example.
Read also how to use high level api
For the Low Level API see books.toscrape.com
Additionally, for low level API you can read the blogpost
See an example of how you can use scrapemate
go scrape Google Maps: https://github.com/gosom/google-maps-scraper
Contributing
Contributions are welcome.
Licence
Scrapemate is licensed under the MIT License. See LICENCE file
# Packages
No description provided by the author
Package mock is a generated GoMock package.
No description provided by the author
# Functions
ContextWithLogger returns a new context with the logger.
GetLoggerFromContext returns a logger from the context or a default logger.
New creates a new scrapemate.
WithCache sets the cache for the scrapemate.
WithConcurrency sets the concurrency for the scrapemate.
WithContext sets the context for the scrapemate.
No description provided by the author
WithFailed sets the failed jobs channel for the scrapemate.
WithHTMLParser sets the html parser for the scrapemate.
WithHTTPFetcher sets the http fetcher for the scrapemate.
WithInitJob sets the first job to be processed It will be processed before the jobs from the job provider It is useful if you want to start the scraper with a specific job instead of the first one from the job provider A real use case is when you want to obtain some cookies before starting the scraping process (e.g.
WithJobProvider sets the job provider for the scrapemate.
WithLogger sets the logger for the scrapemate.
# Constants
DefaultMaxRetryDelay the default max delay between 2 consequive retries.
DefaultUserAgent is the default user agent scrape mate uses.
DiscardJob just discard it in case crawling fails.
PriorityHigh high priority.
PriorityLow low priority.
PriorityMedium medium priority.
RefreshIP refresh the api and then retry job.
RetryJob retry a job.
StopScraping exit scraping completely when an error happens.
# Variables
ErrInactivityTimeout returned when the system exits because of inactivity.
ErrorConcurrency returned when you try to initialize it with concurrency <1.
ErroExitSignal is returned when scrapemate exits because of a system interrupt.
ErrorNoCacher returned when you try to initialized with a nil Cacher.
ErrorNoContext returned when you try to initialized it with a nil context.
ErrorNoHTMLFetcher returned when you try to initialize with a nil httpFetcher.
ErrorNoHTMLParser returned when you try to initialized with a nil HtmlParser.
ErrorNoJobProvider returned when you do not set a job provider in initialization.
ErrorNoLogger returned when you try to initialize it with a nil logger.
ErrorNoCsvCapable returned when you try to write a csv file without a csv capable Data.
# Structs
Job is the base job that we may use.
Response is the struct that it is returned when crawling finishes.
Result is the struct items of which the Results channel has.
Scrapemate contains unexporter fields.
# Interfaces
Cacher is an interface for cache
go:generate mockgen -destination=mock/mock_cacher.go -package=mock .
CsvCapable is an interface for types that can be converted to csv It is used to convert the Data of a Result to csv.
HTMLParser is an interface for html parsers
go:generate mockgen -destination=mock/mock_parser.go -package=mock .
HTTPFetcher is an interface for http fetchers
go:generate mockgen -destination=mock/mock_http_fetcher.go -package=mock .
IJob is a job to be processed by the scrapemate.
JobProvider is an interface for job providers a job provider is a service that provides jobs to scrapemate scrapemate will call the job provider to get jobs
go:generate mockgen -destination=mock/mock_provider.go -package=mock .
ProxyRotator is an interface for proxy rotators
go:generate mockgen -destination=mock/mock_proxy_rotator.go -package=mock .
ResultWriter is an interface for result writers
go:generate mockgen -destination=mock/mock_writer.go -package=mock .
# Type aliases
No description provided by the author