Categorygithub.com/datatogether/warc
modulepackage
0.0.0-20190806125150-74ef3f5ea69f
Repository: https://github.com/datatogether/warc.git
Documentation: pkg.go.dev

# README

warc

GitHub Slack GoDoc License

warc is an implementation of ISO28500 1.0, the WebARCive specfication. it provides readers, writers, and structs for working with warc records.

from the spec:

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries. package warc

License & Copyright

Affero General Public License v3

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Usage

import "github.com/datatogether/warc"

# Functions

CanonicalKey conforms keys to CanonicalMIMEHeaderKey (which is Capitals-For-First-Letter-Separated-By-Dashes) for any general input with exceptions for capitalized "WARC" header keys.
CountWriter implements a limited version of io.Seeker around the provided Writer.
NewReader creates a new WARC reader from an io.Reader Always use NewReader, (instead of manually allocating a reader).
NewRequestResponseRecords creates a new request/response record pair for the provided HTTP request and response.
NewUUID generates a new version 4 uuid.
NewWriterCompressed initializes a WARC Writer writing to a compressed stream.
NewWriterRaw initializes a WARC Writer writing to an uncompressed stream.
ParseRecordType parses a RecordType from a string.
Sanitize removes any data from a warc record body that may interfere with parsing.
Sha1Digest calculates the shasum of a slice of bytes.
UnmarshalRecord reads a single record from data.
UnmarshalRecords reads a slice of records from a slice of bytes.
WriteHTTPHeaders writes all http headers to an io.Writer, separated by newlines Used to add http headers to a record.
WriteRecords calls Write on each record to w.
WriteRequestMethodAndHeaders calls req.Write(w).

# Constants

The number of octets in the block, similar to [RFC2616].
The MIME type [RFC2045] of the information contained in the record's block.
An optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.
The WARC-Record-IDs of any records created as part of the same capture event as the current record.
A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601 [W3CDTF].
The WARC-Filename field may be used in 'warcinfo' type records and shall not be used for other record types.
The content-type of the record's payload as determined by an independent check.
The numeric Internet address contacted to retrieve any included content.
An optional parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record - which is not necessarily equivalent to the record block.
A URI signifying the kind of analysis and handling applied in a 'revisit' record.
An identifier assigned to the current record that is globally unique for its period of intended use.
The WARC-Refers-To field may be used to associate a 'metadata' record to another record it describes.
Reports the current record's relative ordering in a sequence of segmented records.
Identifies the starting record in a series of segmented records whose content blocks are reassembled to obtain a logically complete content block.
In the final record of a segmented series, reports the total length of all segment content blocks when concatenated together.
The original URI whose capture gave rise to the information content in this record.
For practical reasons, writers of the WARC format may place limits on the time or storage allocated to archiving a single resource.
The type of WARC record: one of 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', or 'continuation'.
When present, indicates the WARC-Record-ID of the associated 'warcinfo' record for this record.
RecordFormatUnknown reporesents unknown / errored record format.
RecordFormatWarc default is the Warc Format 1.0.
RecordTypeContinuation blocks from 'continuation' records must be appended to corresponding prior record block(s) (e.g., from other WARC files) to create the logically complete full-sized original record.
RecordTypeConversion shall contain an alternative version of another record's content that was created as the result of an archival process.
RecordTypeMetadata contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types.
RecordTypeRequest holds the details of a complete scheme-specific request, including network protocol information where possible.
RecordTypeResource contains a resource, without full protocol response information.
RecordTypeResponse should contain a complete scheme-specific response, including network protocol information where possible.
RecordTypeRevisit describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record.
RecordTypeUnknown is the default type of record, which shouldn't be accepted by anything that wants to know a type of record.
RecordTypeWarcInfo describes the records that follow it, up through end of file, end of input, or until next 'warcinfo' record.
TimeFormat is time.RFC3339, but with no timezone (just a Z).

# Structs

CaptureHelper is used for the NewRequestResponseRecords() method.
Reader parses WARC records from an underlying scanner.
A Record consists of a version indicator (eg: WARC/1.0), zero or more headers, and possibly a content block.
Writer provides functionality for writing WARC files in compressed and uncompressed formats.

# Type aliases

Header mimics net/http's header package, but with string values Users should use Get & Set methods instead of accessing the map directly.
RecordFormat determines different formats for records, this is for any later support of ARC files, should we need to add it.
Records provides utility functions for slices of records.
RecordType enumerates different types of WARC Records.