Categorygithub.com/gocolly/colly/v2
modulepackage
2.1.0
Repository: https://github.com/gocolly/colly.git
Documentation: pkg.go.dev

# README

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

GoDoc Backers on Open Collective Sponsors on Open Collective build status report card view examples Code Coverage FOSSA Status Twitter URL

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

FOSSA Status

# Packages

No description provided by the author
No description provided by the author
Package extensions implements various helper addons for Colly.
No description provided by the author
No description provided by the author
No description provided by the author

# Functions

AllowedDomains sets the domain whitelist used by the Collector.
AllowURLRevisit instructs the Collector to allow multiple downloads of the same URL.
Async turns on asynchronous network requests.
CacheDir specifies the location where GET requests are cached as files.
CheckHead performs a HEAD request before every GET to pre-validate the response.
Debugger sets the debugger used by the Collector.
DetectCharset enables character encoding detection for non-utf8 response bodies without explicit charset declaration.
DisallowedDomains sets the domain blacklist used by the Collector.
DisallowedURLFilters sets the list of regular expressions which restricts visiting URLs.
ID sets the unique identifier of the Collector.
IgnoreRobotsTxt instructs the Collector to ignore any restrictions set by the target host's robots.txt file.
MaxBodySize sets the limit of the retrieved response body in bytes.
MaxDepth limits the recursion depth of visited URLs.
NewCollector creates a new Collector instance with default configuration.
NewContext initializes a new Context instance.
NewHTMLElementFromSelectionNode creates a HTMLElement from a goquery.Selection Node.
NewXMLElementFromHTMLNode creates a XMLElement from a html.Node.
NewXMLElementFromXMLNode creates a XMLElement from a xmlquery.Node.
ParseHTTPErrorResponse allows parsing responses with HTTP errors.
SanitizeFileName replaces dangerous characters in a string so the return value can be used as a safe file name.
TraceHTTP instructs the Collector to collect and report request trace data on the Response.Trace.
UnmarshalHTML declaratively extracts text or attributes to a struct from HTML response using struct tags composed of css selectors.
URLFilters sets the list of regular expressions which restricts visiting URLs.
UserAgent sets the user agent used by the Collector.

# Constants

ProxyURLKey is the context key for the request proxy address.

# Variables

ErrAbortedAfterHeaders is the error returned when OnResponseHeaders aborts the transfer.
ErrAlreadyVisited is the error type for already visited URLs.
ErrEmptyProxyURL is the error type for empty Proxy URL list.
ErrForbiddenDomain is the error thrown if visiting a domain which is not allowed in AllowedDomains.
ErrForbiddenURL is the error thrown if visiting a URL which is not allowed by URLFilters.
ErrMaxDepth is the error type for exceeding max depth.
ErrMissingURL is the error type for missing URL errors.
ErrNoCookieJar is the error type for missing cookie jar.
ErrNoPattern is the error type for LimitRules without patterns.
ErrNoURLFiltersMatch is the error thrown if visiting a URL which is not allowed by URLFilters.
ErrQueueFull is the error returned when the queue is full.
ErrRobotsTxtBlocked is the error type for robots.txt errors.

# Structs

Collector provides the scraper instance for a scraping job.
Context provides a tiny layer for passing data between callbacks.
HTMLElement is the representation of a HTML tag.
HTTPTrace provides a datastructure for storing an http trace.
LimitRule provides connection restrictions for domains.
Request is the representation of a HTTP request made by a Collector.
Response is the representation of a HTTP response made by a Collector.
XMLElement is the representation of a XML tag.

# Type aliases

A CollectorOption sets an option on a Collector.
ErrorCallback is a type alias for OnError callback functions.
HTMLCallback is a type alias for OnHTML callback functions.
ProxyFunc is a type alias for proxy setter functions.
RequestCallback is a type alias for OnRequest callback functions.
ResponseCallback is a type alias for OnResponse callback functions.
ResponseHeadersCallback is a type alias for OnResponseHeaders callback functions.
ScrapedCallback is a type alias for OnScraped callback functions.
XMLCallback is a type alias for OnXML callback functions.