This web crawler was initially conceived in Python -- a language I deemed suitable for these type of tasks. However, upon realizing I would need multithreaded processing to make it fast enough, I quickly realized it would more aptly benefit from go's native concurrency.

In stark contrast to the Python implementation, the Go counterpart -- even without leveraging concurrency -- astoundingly outperformed its predecessor. Processing the identical task in under 4 seconds, Go showcased an 8-fold acceleration over Python's 32-second execution time, while consuming considerably fewer resources.

The decision to opt for Python over Go is an interesting topic which I intend to delve into extensively on my blog. In the meantime, I have this as a library, ready to integrate it seamlessly into another of my projects.

How to use

CLI

First, install or download the program

// Make sure your go bin is in your path if installing via go
go install github.com/gtsteffaniak/html-web-crawler@latest

html-web-crawler crawl --urls https://apnews.com/

Use --help to see more options:

usage: ./html-web-crawler <command> [options] --urls <urls>
  commands:
    crawl    Used to gather URLs that match a search, crawling urls recursively. The result is a map of urls to their full html content. This is quicker and more efficient than collect.
    collect  collect is more intensive and specific than crawl. The result is a mpa of urls with arrays of matched items, such as urls for images, search terms, etc. This does not return full html. Defaults to html search content.
    install  Install chrome browser for javascript enabled scraping.
               Note: Consider instead to install via native package manager,
                     then set "CHROME_EXECUTABLE" in the environment

Available flags will very by command given.

Example CMD commands and purpose

To get all links on a given page, but not crawl any further:

html-web-crawler crawl --urls https://apnews.com/ --max-depth 1

To query duck duck go search with javascript enabled:

Note: Javascript requires chrome be installed and CHROME_EXECUTABLE path set for js enabled searching.

$ ./html-web-crawler collect --urls https://duckduckgo.com/?t=h_&q=puppies&iax=images&ia=images \
--js-depth 1 --filetypes images
Collect function returned data:
https://duckduckgo.com/assets/icons/meta/DDG-iOS-icon_60x60.png
https://duckduckgo.com/assets/icons/meta/DDG-iOS-icon_76x76.png
https://duckduckgo.com/assets/icons/meta/DDG-iOS-icon_120x120.png
https://duckduckgo.com/assets/icons/meta/DDG-iOS-icon_152x152.png
https://duckduckgo.com/assets/icons/meta/DDG-icon_256x256.png
https://duckduckgo.com/i/a49fa21e.jpg

To collect pages that include text:

$ ./html-web-crawler collect --urls "https://gportal.link/blog" --search-any "my search string"

To collect images from crawled pages:

$ ./html-web-crawler collect --urls "https://gportal.link/blog" --filetypes images

Include as a module in your go program

Note: you can also see ai-earthquake-tracker as an example.

package main

import (
	"fmt"

	"github.com/gtsteffaniak/html-web-crawler/crawler"
)

func main() {
	Crawler := crawler.NewCrawler()
	// add crawling html selector classes
	Crawler.Selectors.Classes = []string{"PageList-items-item"}
	// Allow 50 consecutive pages to crawl at a time
	Crawler.Threads = 50
	// Crawl starting with a given url
	crawledData, _ := Crawler.Crawl("https://apnews.com/hub/earthquakes")
	// Print/Process the crawled data
	fmt.Println("Total: ", len(crawledData))
}

# Packages

# README

HTML Web Crawler

About

How to use

CLI

Example CMD commands and purpose

Include as a module in your go program