modulepackage
0.0.8
Repository: https://github.com/patrickbucher/checklinks.git
Documentation: pkg.go.dev
# README
checklinks
: Crawl a Website for Dead URLs
The checklinks
utility takes a single website address and crawls that page for
links (i.e. href
attributes of <a>
tags). TLS issues are ignored.
Run It
$ go run cmd/checklinks.go [url]
If the URL does not start with an http://
or https://
prefix, http://
is
automatically assumed.
Build It, Then Run It
$ go build cmd/checklinks.go
$ ./checklinks [url]
Install It
Pick a tag (e.g. v0.0.7
) and use go install
to install that particular
version:
$ go install github.com/patrickbucher/checklinks/[email protected]
go: downloading github.com/patrickbucher/checklinks v0.0.7
Flags
The success and failure of each individual link is reported to the terminal. Use the flags to control the output and request timeout:
$ ./checklinks -help
Usage of ./checklinks:
-ignored
report ignored links (e.g. mailto:...)
-nofailed
do NOT report failed links (e.g. 404)
-success
report succeeded links (OK)
-timeout int
request timeout (in seconds) (default 10)
TODO
- introduce command line flags
- user agent (optional)
- level of parallelism (optional)
- allow insecure SSL/TLS
- refactor code
- introduce Config struct for handing over the entire configuration from the command line to the crawler function
- introduce Channels struct for handing over channels to Process functions
# Packages
No description provided by the author
# Functions
CrawlPage crawls the given site's URL and reports successfully checked links, ignored links, and failed links (according to the flags ok, ignore, fail, respectively).
ExtractTagAttribute traverses the given node's tree, searches it for nodes with the given tag name, and extracts the given attribute value from it.
FetchDocument gets the document indicated by the given url using the given client, and returns its root (document) node.
NewLink creates a Link from the given address.
ProcessLeaf uses the given http.Client to fetch the given link using a GET request, and reports the result of that request.
ProcessNode uses the given http.Client to fetch the given link, and reports the extracted links on the page (indicated by <a href="...">).
QualifyInternalURL creates a new URL by merging scheme and host information from the page URL with the rest of the URL indication from the link URL.
# Constants
Parallelism is the max.
UserAgent defines a value used for the "User-Agent" header to avoid being blocked.