Categorygithub.com/kavorite/spaider
repositorypackage
0.2.1
Repository: https://github.com/kavorite/spaider.git
Documentation: pkg.go.dev

# README

What is this?

This program crawls a domain and tries to find all of the documents on it for conditioning an AI. Currently that just means it takes anything with a text content type that is under 500KiB, and if HTML, converts it to markdown, before writing it to standard output.

How do I use this?

Note: To ensure the following works as intended, install Go, and make sure $GOPATH/bin is on your path.

go install github.com/kavorite/spaider
# you can use any documentation website, I guess. 
spaider https://flax.readthedocs.io/en/latest/ > summary.txt
# you're now free to upload the summary to an LLM inference platform somewhere

Background

The program treats the starting URL as the "index." It may navigate through hyperlinks on the same domain that aren't "children" of the relative URL given in the starting path, but it will only actually print the content of those whose paths are children of the path provided. This means its runtime depends on the size of the website as navigable from the start page, but the size of its output depends only on the size of the path hierarchy found under the chosen index.

Big shoutout to httparchive.org for making all data from their crawls publicly accessible under Google BigQuery, because otherwise getting the 99th percentile of the size of an HTML payload would probably be pretty difficult. It wasn't 500KiB, it was 300K or so, but I just rounded up to the nearest power of two to be safe.