Categorygithub.com/gocolly/colly/v2

modulepackage

2.1.0

Repository: https://github.com/gocolly/colly.git

Documentation: pkg.go.dev

# README

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Caching
Automatic encoding of non-unicode responses
Robots.txt support
Distributed scraping
Configuration via environment variables
Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

greenpeace/check-my-pages Scraping script to test the Spanish Greenpeace web archive.
altsab/gowap Wappalyzer implementation in Go.
jesuiscamille/goquotes A quotes scrapper, making your day a little better!
jivesearch/jivesearch A search engine that doesn't track you.
Leagify/colly-draft-prospects A scraper for future NFL Draft prospects.
lucasepe/go-ps4 Search playstation store for your favorite PS4 games using the command line.
yringler/inside-chassidus-scraper Scrapes Rabbi Paltiel's web site for lesson metadata.
gamedb/gamedb A database of Steam games.
lawzava/scrape CLI for email scraping from any website.
eureka101v/WeiboSpiderGo A sina weibo(chinese twitter) scrapper
Go-phie/gophie Search, Download and Stream movies from your terminal
imthaghost/goclone Clone websites to your computer within seconds.

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

# Packages

No description provided by the author

No description provided by the author

Package extensions implements various helper addons for Colly.

No description provided by the author

No description provided by the author

No description provided by the author

# Functions

AllowedDomains sets the domain whitelist used by the Collector.

AllowURLRevisit

AllowURLRevisit instructs the Collector to allow multiple downloads of the same URL.

Async turns on asynchronous network requests.

CacheDir specifies the location where GET requests are cached as files.

CheckHead performs a HEAD request before every GET to pre-validate the response.

Debugger sets the debugger used by the Collector.

DetectCharset enables character encoding detection for non-utf8 response bodies without explicit charset declaration.

DisallowedDomains

DisallowedDomains sets the domain blacklist used by the Collector.

DisallowedURLFilters

DisallowedURLFilters sets the list of regular expressions which restricts visiting URLs.

ID sets the unique identifier of the Collector.

IgnoreRobotsTxt

IgnoreRobotsTxt instructs the Collector to ignore any restrictions set by the target host's robots.txt file.

MaxBodySize sets the limit of the retrieved response body in bytes.

MaxDepth limits the recursion depth of visited URLs.

NewCollector creates a new Collector instance with default configuration.

NewContext initializes a new Context instance.

NewHTMLElementFromSelectionNode

NewHTMLElementFromSelectionNode creates a HTMLElement from a goquery.Selection Node.

NewXMLElementFromHTMLNode

NewXMLElementFromHTMLNode creates a XMLElement from a html.Node.

NewXMLElementFromXMLNode

NewXMLElementFromXMLNode creates a XMLElement from a xmlquery.Node.

ParseHTTPErrorResponse

ParseHTTPErrorResponse allows parsing responses with HTTP errors.

SanitizeFileName

SanitizeFileName replaces dangerous characters in a string so the return value can be used as a safe file name.

TraceHTTP instructs the Collector to collect and report request trace data on the Response.Trace.

UnmarshalHTML declaratively extracts text or attributes to a struct from HTML response using struct tags composed of css selectors.

URLFilters sets the list of regular expressions which restricts visiting URLs.

UserAgent sets the user agent used by the Collector.

# Constants

ProxyURLKey is the context key for the request proxy address.

# Variables

ErrAbortedAfterHeaders

ErrAbortedAfterHeaders is the error returned when OnResponseHeaders aborts the transfer.

ErrAlreadyVisited

ErrAlreadyVisited is the error type for already visited URLs.

ErrEmptyProxyURL

ErrEmptyProxyURL is the error type for empty Proxy URL list.

ErrForbiddenDomain

ErrForbiddenDomain is the error thrown if visiting a domain which is not allowed in AllowedDomains.

ErrForbiddenURL

ErrForbiddenURL is the error thrown if visiting a URL which is not allowed by URLFilters.

ErrMaxDepth is the error type for exceeding max depth.

ErrMissingURL is the error type for missing URL errors.

ErrNoCookieJar is the error type for missing cookie jar.

ErrNoPattern is the error type for LimitRules without patterns.

ErrNoURLFiltersMatch

ErrNoURLFiltersMatch is the error thrown if visiting a URL which is not allowed by URLFilters.

ErrQueueFull is the error returned when the queue is full.

ErrRobotsTxtBlocked

ErrRobotsTxtBlocked is the error type for robots.txt errors.

# Structs

Collector provides the scraper instance for a scraping job.

Context provides a tiny layer for passing data between callbacks.

HTMLElement is the representation of a HTML tag.

HTTPTrace provides a datastructure for storing an http trace.

LimitRule provides connection restrictions for domains.

Request is the representation of a HTTP request made by a Collector.

Response is the representation of a HTTP response made by a Collector.

XMLElement is the representation of a XML tag.

# Type aliases

CollectorOption

A CollectorOption sets an option on a Collector.

ErrorCallback is a type alias for OnError callback functions.

HTMLCallback is a type alias for OnHTML callback functions.

ProxyFunc is a type alias for proxy setter functions.

RequestCallback

RequestCallback is a type alias for OnRequest callback functions.

ResponseCallback

ResponseCallback is a type alias for OnResponse callback functions.

ResponseHeadersCallback

ResponseHeadersCallback is a type alias for OnResponseHeaders callback functions.

ScrapedCallback

ScrapedCallback is a type alias for OnScraped callback functions.

XMLCallback is a type alias for OnXML callback functions.