Categorygithub.com/robole-dev/grawler

repositorypackage

1.6.2

Repository: https://github.com/robole-dev/grawler.git

Documentation: pkg.go.dev

# Packages

cmd

No description provided by the author

Grawler

Grawler is a web crawler written in go. It scrapes the website of the given url and finds all relative links and visit these urls. Initially this application was developed to build up the cache of a page and to check the availability of existing pages.

Install

Binary

Download and use a binary suitable for your system from the prebuild releases.

Go

If you have go installed (Go version >= 1.22.3) you can use go install to install the application on your system.

go install github.com/robole-dev/grawler@latest

Usage

grawler grawl <url>

Example

grawler grawl https://www.google.de

All features can be read via the help flag

grawler -h

More examples below.

Examples

Crawl website

grawler grawl https://books.toscrape.com

crawles the given url
search for anchor tags href elements (<a href="...">) and crawls these urls too

Save result to a CSV-file

grawler grawl https://books.toscrape.com -o out.csv

Allow parallel requests

Set to 8 requests in parallel

grawler grawl https://books.toscrape.com -l 8

Limit the search depth

Limit to a search recursion depth to 2

grawler grawl https://books.toscrape.com --max-depth 2

Set a delay for each request

Set a delay of 500 milliseconds

grawler grawl https://books.toscrape.com --delay 500

Request a page with http basic auth

To rrequest a website that uses/requires a http basic auth you can set the username and password as flags

grawler grawl https://books.toscrape.com --username user_xy --password mypassword

Optionally you can ommit the password. Then you will be asked to enter the password when you start grawling

grawler grawl https://books.toscrape.com --username user_xy
No config file found.
Grawling https://books.toscrape.com
✔ Password: █

Add allowed domains

By default, only the domain of the start url is allowed to be crawled. All other urls from other domains are being skipped. You can allow more domains with the -a flag

grawler grawl https://quotes.toscrape.com -a example.com

You can also add multiple domains

grawler grawl https://quotes.toscrape.com -a example.com -a google.de

Skip/Disallow urls

You can define one or multiple regular expression to skip urls when they match this/these expressions.

Here we skip all urls starting with https://books.toscrape.com/catalogue/category/books/ with a max depth of 2:

grawler grawl https://books.toscrape.com --disallowed-url-filters "^https://books.toscrape.com/catalogue/category/books/.*" --max-depth 2

Here we skip all urls which contain the word category and the word art:

grawler grawl https://books.toscrape.com --disallowed-url-filters "category" --disallowed-url-filters "art"

Configuration

Precedence for configuration is first given to the flags set on the command-line, then to what's set in your configuration file.

Grawler looks first for the command-line flag --config (path to the config file), then to the file grawler.yaml in the current working directory and at least to the path $HOME/.config/grawler/conf.yaml.

You can generate a config file with default values with the init command.

A sample config files can be found here: sample-conf.yaml.

Need to know

Currently we have some trouble to track the redirect http status codes.

More infos about that: