Categorygithub.com/mal0ner/wikiscrape

modulepackage

0.0.0-20240418084808-8ee748d62865

Repository: https://github.com/mal0ner/wikiscrape.git

Documentation: pkg.go.dev

# README

🌐 Get wiki pages from the command line. Increase brain volume 🧠.

[!WARNING]
This project is unfinished! Not all of the features listed in the README are available.

Wikiscrape

Get wiki pages. Export to your desired format. Wikiscrape is a command-line tool which aims to provide a wiki-agnostic method for retrieving and exporting data from wiki pages.

The whole motivation for this project is to provide a consistent and convenient interface for interacting with the sometimes frustrating world of wiki APIs. Despite the vast majority of wikis being built upon a small number of frameworks, I often found even those which shared a backend framework to have vastly different access patterns.

For example, despite both being built on top of MediaWiki, wikipedia and the oldschool runescape wiki differ in the following:

API Endpoint: en.wikipedia.org/w/api.php vs. oldschool.runescape.wiki/api.php
Page Prefix: en.wikipedia.org/wiki/pageName vs. oldschool.runescape.wiki/w/pageName

Features

Bl-Moderately Fast 🚀🔥
Effortless retrieval of full wiki pages or specific sections
Support for multiple wiki backends
Manifest file support: wikiscrape can iteratively scrape a from a list of pages given a json file.

Wiki Support

Because of the differences in API access patterns mentioned above, wikis must be explicitly supported by Wikiscrape in order to retrieve content from them. "Support" involves the following:

A wikiInfo entry in internal/util/wikisupport.go, which allows mapping known wiki names or URL host segments to information about their respective backends, API endpoints, and page prefixes for handling parsing page names from URLs.
A scraper and response in internal/scraper designed specifically for the wiki's backend to handle parsing API responses and their content.

For a list of the wikis and backends supported by Wikiscrape, please see the command wikiscrape list -h. Currently, supported backends are:

MediaWiki

If you have a wiki that you would like supported, and there is already existing support for its backend in the aforementioned internal/scraper, please feel free to submit an issue. If you have the skill or the time, please also feel free to contribute directly to the project by adding the wiki to the wikiHostInfo and wikiNameInfo maps in internal/util/wikisupport.go! Please see the contribution guide

Installation

Right now, the best way to get wikiscrape on your machine is to just use go:

go install github.com/mal0ner/wikiscrape@latest

Usage

Wikiscrape gives you a simple and intuitive command-line interface.

Scrape a single page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear"

# by name
wikiscrape page "Bear" --wiki wikipedia

Scrape the list of section headings from a page:

# by url
wikiscrape "https://en.wikipedia.org/wiki/Bear" --section-titles

# by name
wikiscrape page "Bear" --wiki wikipedia -t

Scrape a specific section:

wikiscrape page "Bear" --wiki wikipedia --section "Taxonomy"

# short
wikiscrape page Bear -w wikipedia -s Taxonomy

Scrape multiple pages from a manifest file:

wikiscrape pages --wiki wikipedia --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -f path/to/manifest.json

Scrape just references from a list of pages:

wikiscrape pages --wiki wikipedia --section "References" --from-manifest "path/to/manifest.json"

# short
wikiscrape pages -w wikipedia -s References -f path/to/manifest.json

Manifest

The format of the manifest file is just a simple JSON array. This was probably a strange design decision but I don't really wan't to change it! Page titles can be included raw without the need for url encoding, as this step is taken care of by the program.

["Hammer", "Zulrah/Strategies"]

This could potentially be expanded in the future to allow for the user to specify a section to scrape on a per-page basis, i.e. {"page": "Hammer", "section": "Uses"} but I have no plans for that now.

FAQ

Will you ever fix the logo alignment?

No 👍

Contribution

We welcome contributions, If you'd like to help out, please follow these steps:

Fork the repository
Create a new branch for your feature or bug fix
Make your changes and commit them with descriptive messages
Push your changes to your forked repository
Submit a pull request to the main repository

Roadmap

# Packages

cmd

No description provided by the author