Categorygithub.com/caltechlibrary/dataset

modulepackage

1.1.0

Repository: https://github.com/caltechlibrary/dataset.git

Documentation: pkg.go.dev

# README

Dataset Project

The Dataset Project provides tools for working with collections of JSON Object documents stored on the local file system. Two tools are provided.

dataset command line tool

dataset is a command line tool for working with collections of JSON objects. Collections are stored on the file system. JSON objects are stored in collections as plain UTF-8 text files. This means the objects can be accessed with common Unix text processing tools as well as most programming languages.

The dataset command line tool supports common data management operations such as initialization of collections; document creation, reading, updating and deleting; listing keys of JSON objects in the collection; and associating non-JSON documents (attachments) with specific JSON documents in the collection.

enhanced features include

aggregate objects into data frames
import, export and synchronize JSON objects to and from CSV files
generate sample sets of keys and objects

See Getting started with dataset for a tour and tutorial.

dataset as a web service

datasetd is a web service implementation of the dataset command line program. It features a sub-set of capability found in the command line tool. This allows dataset collections to be integrated safely into other web applications or used by multiple processes.

Design choices

dataset and datasetd are intended to be simple tools for managing collections JSON object documents in a predictable structured way.

dataset and datasetd are guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. dataset is intended to be simple to use with minimal setup (e.g. dataset init mycollection.ds creates a new collection called 'mycollection.ds').

dataset and datasetd store JSON object documents in collections
- collections are folder(s) containing
  - collection.json metadata file describing the collection and keys
  - a pairtree of JSON object documents
  - non-JSON attachments can be associated with a JSON document and found in a semver (semantic version number) named sub directory

The choice of plain UTF-8 is intended to help future proof reading dataset collections. Care has been taken to keep dataset simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally comfortable on a more resource rich server or desktop environment. dataset can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current implementation is in the Go language.

Features

dataset supports

Listing Keys in a collection
Object level actions
- create
- read
- update
- delete
- Documents as attachments
  - attach
  - retrieve
  - prune
Import and export of CSV files
The ability to reshape data by performing simple object joins
The ability to create data frames from while collections or based on keys lists
- frames are defined using dot paths describing what is to be pulled out of a stored JSON objects

datasetd supports

List collections available from the web service
List or update a collection's metadata
List a collection's Keys
Object level actions
- create
- read
- update
- delete
- Documents as attachments
  - attach
  - retrieve
  - prune

Both dataset and datasetd maybe useful for general data science applications needing intermediate JSON object management but not a full blown database or repository system.

Limitations of dataset and datasetd

dataset has many limitations, some are listed below

it is not a multi-process, multi-user data store
it is not a general purpose database system
it does not supply automatic version control on collections, objects or attachments
it stores all keys to lower case in order to deal with file systems that are not case sensitive
it does not have a built-in query language, search or sorting
it should NOT be used for sensitive or secret information

datasetd is a simple web service intended to run on "localhost:8485".

it is not a RESTful service
it does not include support for authentication
it does not support a query language, search or sorting
it does not support data frames
it does not support access control by users or roles
it does not provide auto key generation or versioning
it limits the size of JSON documents stored to less than 1 MiB
it limits the size of attached files to less than 250 MiB
it does not support partial JSON record updates or retrieval
it does not provide an interactive Web UI for working with dataset collections
it does not support HTTPS or "at rest" encryption
it should NOT be used for sensitive or secret information

Authors and history

R. S. Doiel
Tommy Morrell

Releases

Compiled versions are provided for Linux (x86), Mac OS X (x86 and M1), Windows 10 (x86) and Raspberry Pi OS (ARM7).

github.com/caltechlibrary/dataset/releases

Related projects

You can use dataset from Python via the py_dataset package.

# Packages

cli

* * cli is a package intended to encourage some standardization in the * command line user interface for programs developed for Caltech Library.

cmd

No description provided by the author

tbl

tbl.go provides some utility functions to move string one and two demensional slices into/out of one and two deminsional slices.

# Functions

Analyzer

Analyzer checks the collection version and analyzes current state of collection reporting on errors.

DecodeJSON

DecodeJSON provides a common method for decoding data for use in Dataset.

DisplayLicense

DisplayLicense returns the license associated with dataset application.

DisplayUsage

DisplayUsage displays a usage message.

DisplayVersion

DisplayVersion returns the of the dataset application.

EncodeJSON

EncodeJSON provides a common method for encoding data for use in Dataset.

Init

Init - creates a new collection and opens it.

InitDatasetAPI

InitDatasetAPI initializes the web service by reading in a configuration file.

IsCollection

IsCollection checks to see if a given path contains a collection.json file.

LoadConfig

LoadConfig reads the JSON configuration file provided, validates it and either returns a Config structure or error.

Open

Open reads in a collection's metadata and returns and new collection structure or error.

ParseSemver

ParseSemver takes a byte slice and returns a version struct, and an error value.

Repair

Repair takes a collection name and calls walks the pairtree and repairs collection.json as appropriate.

RunDatasetAPI

RunDatasetAPI runs a dataset web service.

Shutdown

Shutdown shutdowns the dataset web service started with RunDatasetAPI.

# Constants

Asc

Asc is used to identify ascending sorts.

Desc

Desc is used to identify descending sorts.

License

License is a formatted from for dataset package based command line tools.

Version

Version of package.

# Structs

Attachment

Attachment is a structure for holding non-JSON content metadata you wish to store alongside a JSON document in a collection.

Collection

Collection is the container holding a pairtree containing JSON docs.

Config

Config holds a configuration file structure used by EPrints Extended API Configuration file is expected to be in JSON format.

DataFrame

DataFrame is the basic structure holding a list of objects as well as the definition of the list (so you can regenerate an updated list from a changed collection).

Err

Err holds Semver's error messages.

KeyValue

KeyValue holds an ID string and value interface, this lets us work with numeric keys and to sort them.

PersonOrOrg

PersonOrOrg holds a the description of a person or organizaion associated with the dataset collection.

Semver

Semver holds the information to generate a semver string.

Settings

Settings holds the specific settings for a collection.