Categorygithub.com/mal0ner/wikiscrape
modulepackage
0.0.0-20240418084808-8ee748d62865
Repository: https://github.com/mal0ner/wikiscrape.git
Documentation: pkg.go.dev

# README

image
🌐 Get wiki pages from the command line. Increase brain volume 🧠.

GitHub go.mod Go version Static Badge


[!WARNING]
This project is unfinished! Not all of the features listed in the README are available.

Wikiscrape

Get wiki pages. Export to your desired format. Wikiscrape is a command-line tool which aims to provide a wiki-agnostic method for retrieving and exporting data from wiki pages.

Made with VHS

The whole motivation for this project is to provide a consistent and convenient interface for interacting with the sometimes frustrating world of wiki APIs. Despite the vast majority of wikis being built upon a small number of frameworks, I often found even those which shared a backend framework to have vastly different access patterns.

For example, despite both being built on top of MediaWiki, wikipedia and the oldschool runescape wiki differ in the following:

  • API Endpoint: en.wikipedia.org/w/api.php vs. oldschool.runescape.wiki/api.php
  • Page Prefix: en.wikipedia.org/wiki/pageName vs. oldschool.runescape.wiki/w/pageName

Features

  • Bl-Moderately Fast 🚀🔥
  • Effortless retrieval of full wiki pages or specific sections
  • Support for multiple wiki backends
  • Manifest file support: wikiscrape can iteratively scrape a from a list of pages given a json file.

Wiki Support

Because of the differences in API access patterns mentioned above, wikis must be explicitly supported by Wikiscrape in order to retrieve content from them. "Support" involves the following:

  • A wikiInfo entry in internal/util/wikisupport.go, which allows mapping known wiki names or URL host segments to information about their respective backends, API endpoints, and page prefixes for handling parsing page names from URLs.

  • A scraper and response in internal/scraper designed specifically for the wiki's backend to handle parsing API responses and their content.

For a list of the wikis and backends supported by Wikiscrape, please see the command wikiscrape list -h. Currently, supported backends are:

  • MediaWiki

If you have a wiki that you would like supported, and there is already existing support for its backend in the aforementioned internal/scraper, please feel free to submit an issue. If you have the skill or the time, please also feel free to contribute directly to the project by adding the wiki to the wikiHostInfo and wikiNameInfo maps in internal/util/wikisupport.go! Please see the contribution guide

Installation

Right now, the best way to get wikiscrape on your machine is to just use go:

go install github.com/mal0ner/wikiscrape@latest

Usage

Wikiscrape gives you a simple and intuitive command-line interface.

Scrape a single page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear"

# by name
wikiscrape page "Bear" --wiki wikipedia

Scrape the list of section headings from a page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear" --section-titles

# by name
wikiscrape page "Bear" --wiki wikipedia -t

Scrape a specific section:

wikiscrape page "Bear" --wiki wikipedia --section "Taxonomy"

# short
wikiscrape page Bear -w wikipedia -s Taxonomy

Scrape multiple pages from a manifest file:

wikiscrape pages --wiki wikipedia --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -f path/to/manifest.json

Scrape just references from a list of pages:

wikiscrape pages --wiki wikipedia --section "References" --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -s References -f path/to/manifest.json

Manifest

The format of the manifest file is just a simple JSON array. This was probably a strange design decision but I don't really wan't to change it! Page titles can be included raw without the need for url encoding, as this step is taken care of by the program.

["Hammer", "Zulrah/Strategies"]

This could potentially be expanded in the future to allow for the user to specify a section to scrape on a per-page basis, i.e. {"page": "Hammer", "section": "Uses"} but I have no plans for that now.

FAQ

Will you ever fix the logo alignment?

No 👍

Contribution

We welcome contributions, If you'd like to help out, please follow these steps:

  • Fork the repository
  • Create a new branch for your feature or bug fix
  • Make your changes and commit them with descriptive messages
  • Push your changes to your forked repository
  • Submit a pull request to the main repository

Roadmap

  • Multi-language support
  • Fuzzy-find pages (low priority)
  • Fuzzy-find sections (low priority)
  • Add more export formats
  • Link preservation
  • Table parsing
  • List parsing
  • Reference parsing and potentially BibTeX export? Could have a --references flag
  • Tests!
  • Adding more wikis (and the confluence backend)
  • Proper SemVer
  • Add configuration file for configuring default behaviour (for less verbosity)

# Packages

No description provided by the author