Categorygithub.com/patrickbucher/checklinks
modulepackage
0.0.8
Repository: https://github.com/patrickbucher/checklinks.git
Documentation: pkg.go.dev

# README

checklinks: Crawl a Website for Dead URLs

The checklinks utility takes a single website address and crawls that page for links (i.e. href attributes of <a> tags). TLS issues are ignored.

Run It

$ go run cmd/checklinks.go [url]

If the URL does not start with an http:// or https:// prefix, http:// is automatically assumed.

Build It, Then Run It

$ go build cmd/checklinks.go
$ ./checklinks [url]

Install It

Pick a tag (e.g. v0.0.7) and use go install to install that particular version:

$ go install github.com/patrickbucher/checklinks/[email protected]
go: downloading github.com/patrickbucher/checklinks v0.0.7

Flags

The success and failure of each individual link is reported to the terminal. Use the flags to control the output and request timeout:

$ ./checklinks -help
Usage of ./checklinks:
  -ignored
        report ignored links (e.g. mailto:...)
  -nofailed
        do NOT report failed links (e.g. 404)
  -success
        report succeeded links (OK)
  -timeout int
        request timeout (in seconds) (default 10)

TODO

  • introduce command line flags
    • user agent (optional)
    • level of parallelism (optional)
    • allow insecure SSL/TLS
  • refactor code
    • introduce Config struct for handing over the entire configuration from the command line to the crawler function
    • introduce Channels struct for handing over channels to Process functions

# Packages

No description provided by the author

# Functions

CrawlPage crawls the given site's URL and reports successfully checked links, ignored links, and failed links (according to the flags ok, ignore, fail, respectively).
ExtractTagAttribute traverses the given node's tree, searches it for nodes with the given tag name, and extracts the given attribute value from it.
FetchDocument gets the document indicated by the given url using the given client, and returns its root (document) node.
NewLink creates a Link from the given address.
ProcessLeaf uses the given http.Client to fetch the given link using a GET request, and reports the result of that request.
ProcessNode uses the given http.Client to fetch the given link, and reports the extracted links on the page (indicated by <a href="...">).
QualifyInternalURL creates a new URL by merging scheme and host information from the page URL with the rest of the URL indication from the link URL.

# Constants

Parallelism is the max.
UserAgent defines a value used for the "User-Agent" header to avoid being blocked.

# Structs

Link represents a link (URL) in the context of a web site (Site).
Result describes the result of processing a Link.