Categorygithub.com/miku/span

modulepackage

0.2.4

Repository: https://github.com/miku/span.git

Documentation: pkg.go.dev

# README

Span

Span started as a single tool to convert Crossref API data into a VuFind/SOLR format as used in finc. An intermediate representation for article metadata is used for normalizing various input formats. Go was choosen as the implementation language because it is easy to deploy and has concurrency support built into the language. A basic scatter-gather design allowed to process millions of records fast.

While span has a few independent tools (like fetching or compacting crossref feeds), it is mostly used inside siskin, a set of tasks to build an aggregated index.

Installation

$ go install github.com/miku/span/cmd/...@latest

Span has frequent releases, although not all versions will be packaged as deb or rpm.

Background

Initial import Tue Feb 3 19:11:08 2015, a single span command. In March 2015, span-import and span-export appeared. There were some rudimentary commands for dealing with holding files of various formats. In early 2016, a licensing tool was briefly named span-label before becoming span-tag. In Summer 2016, span-check, span-deduplicate, span-redact were added, later a first man-page followed. In Summer 2017, span-deduplicate was gone, the doi-based deduplication was split up between the blunt, but fast groupcover and the generic span-update-labels. A new span-oa-filter helped to mark open-access records. In Winter 2017, a span-freeze was added to allow for fixed configuration across dozens of files. The span-crossref-snapshot tool replaced a sequence of luigi tasks responsible for creating a snapshot of crossref data (the process has been summarized in a comment). In Summer 2018, three new tools were added: span-compare for generating index diffs for index update tickets, span-review for generating reports based on SOLR queries and span-webhookd for triggering index reviews and ticket updates through GitLab. During the development, new input and output formats have been added. The parallel processing of records has been streamlined with the help of a small library called parallel. Since Winter 2017, the zek struct generator takes care of the initial screening of sources serialized as XML - making the process of mapping new data sources easier.

Since about 2018 (0.1.211), the span tools have seen mostly small fixes and additions. Notable, since 2021, the previous scripts used to fetch daily metadata updates from crossref has been put into a standalone tool, span-crossref-sync, which merely adds some retry logic and consistent file naming to the API harvest. In 2024, span-webhookd, span-check, span-review, span-tagger are gone.

Documentation

See: manual source.

Performance

In the best case no complete processing of the data should take more than two hours or run slower than 20000 records/s. The most expensive part currently seems to be the JSON serialization, but we keep JSON for now for the sake of readability. Experiments with faster JSON serializers and msgpack have been encouraging, a faster serialization should be the next measure to improve performance.

Most tools that work on lines will try to use as many workers as CPU cores. Except for span-tag - which needs to keep all holdings data in memory - all tools work well in a low-memory environment.

More cores can help (but returns may diminsh): On a 64 core 2021 Xeon, we find that e.g. span-export can process (decompression, deserialization, conversion, serialization, compression) on average 130000 JSON documents/s. The final pipeline stage (from normalized data to deduplicated and indexable data) seems to take about three hours.

Integration

The span tools are used in various tasks in siskin (which contains all orchestration code). All span tools work fine standalone, and most will accept input from stdin as well, allowing for one-off things like:

$ metha-cat http://oai.web | span-import -i name | span-tag -c amsl | span-export | solrbulk

# Packages

assetutil

atomic

No description provided by the author

cmd

No description provided by the author

configutil

Package configutil handles application configuration and location and loading of various mapping files.

container

dateutil

Package dateutil provides interval handling.

doi

Package doi helps to find DOI in JSON documents.

encoding

No description provided by the author

filter

XXX: Generalize.

folio

Package folio add support for a minimal subset of the FOLIO library platform API.

formats

No description provided by the author

licensing

Package licensing implements support for KBART and ISIL attachments.

parallel

Package parallel implements helpers for fast processing of line oriented inputs.

solrutil

Package solrutil implements helpers to access a SOLR index.

strutil

No description provided by the author

xflag

Package xflag add an additional flag type Array for repeated string flags.

xio

No description provided by the author

# Functions

DetectLang3

DetectLang3 returns the best guess 3-letter language code for a given text.

GenFincID

GenFincID returns a finc.id string consisting of an arbitraty prefix (e.g.

LanguageIdentifier

LanguageIdentifier returns the three letter identifier from a variety of language name notations.

UnfreezeFilterConfig

UnfreezeFilterConfig takes the name of a zipfile (from span-freeze) and returns of the path the thawed filterconfig (along with the temporary directory and error).

# Constants

AppVersion

AppVersion of span package.

KeyLengthLimit

KeyLengthLimit was a limit imposed by the memcached protocol, which was used for blob storage until Q1 2017.

# Variables

ISO639BibliographicToThree

ISO639BibliographicToThree maps 639-2 identifier of the bibliographic applications to three-letter 639-3 identifier.

ISO639NameToThree

ISO639NameToThree maps a language name to three letter identifier.

ISO639NameToThreeLower

ISO639NameToThreeLower converts lowercase ISO language name to ISO639-3.

ISO639OneToThree

ISO639OneToThree maps 639-1 identifier (two letters) (if there is one) to a three-letter 639-3 identifier.

Static

go:embed assets.

# Structs

Skip

Skip marks records to skip.