Categorygithub.com/caltechlibrary/datatools
modulepackage
1.3.0
Repository: https://github.com/caltechlibrary/datatools.git
Documentation: pkg.go.dev

# README

datatools

datatools is a rich collection of command line programs targetting data conversion, cleanup and analysis directly from your favorite POSIX shell. It has proven useful for data collaberations where individual members of a project may prefer different toolsets in their analysis (e.g. Julia, R, Python) but want to work from a common baseline. It also has been used intensively for internal reporting from various Caltech Library metadata sources.

The tools fall into three broad categories

  • data transformation and conversion
  • shell scripting helpers
  • "string", a tool providing the common string operations missing from shell

See user manual for a complete list of the command line programs. The data transformation tools include support for formats such as Excel XML, csv, tab delimited files, json, yaml and toml.

Compiled versions of the datatools collection are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/datatools/releases.

Use "-help" option for a full list of options for each utility (e.g. csv2json -help).

Data transformation

The tooling around transformation includes data conversion. These include tools that work with CSV, tab delimited, JSON, TOML, YAML and Excel XML.

There is also tooling to change data shapes using JSON as the intermediate data format.

For the shell

Various utilities for simplifying work on the command line.

  • findfile - find files based on prefix, suffix or contained string
  • finddir - find directories based on prefix, suffix or contained string
  • mergepath - prefix, append, clip path variables
  • range - emit a range of integers (useful for numbered loops in Bash)
  • reldate - display a relative date in YYYY-MM-DD format
  • reltime - display a relative time in 24 hour notation, HH:MM:SS format
  • timefmt - format a time value based on Golang's time format language
  • urlparse - split a URL into parts

For strings

datatools provides the string command for working with text strings (limited to memory available). This is commonly needed when cleanup data for analysis. The string command was created for when the old Unix standbys- grep, awk, sed, tr are unwieldly or inconvient. string provides operations are common in most language like, trimming, spliting, and transforming letter case. The string command also makes it easy to join JSON string arrays into single a string using a delimiter or split a string into a JSON array based on a delimiter. The form of the command is string [OPTIONS] [ACTION] [ARCTION_PARAMETERS...]

    string toupper "one two three"

Would yield "ONE TWO THREE".

Some of the features included

  • change case (upper, lower, title, English title)
  • length, position and count of substrings
  • has prefix, suffix or contains
  • trim prefix, suffix and cutsets
  • split and join to/from JSON string arrays

See string for full details

Installation

See INSTALL.md for details for installing pre-compiled versions of the programs.

# Packages

No description provided by the author
No description provided by the author
Package reldate generates a date in YYYY-MM-DD format based on a relative time description (e.g.
timefmt provides additional common formats found around the web that are missing from Golang's own time package.

# Functions

ApplyStopWords takes a list of words (array of strings) and removes any occurrences of the stop words return a revised list of words.
CodemetaToCitationCff converts a file in Codemeta.json to CITATION.cff formats.
CSVMarshal takes a list of strings and returns a byte array of CSV formated output.
CSVRandomRows reads a in, creates a csv Reader and Writer and randomly selectes the rowCount number of rows to write out.
CSVRows renders the rows numbers in rowNos using the delimiter to out.
CSVRowsAll renders the all rows in rowNos using the delimiter to out.
EnglishTitle - uses an improve capitalization rules for English titles.
Filter filters out characters from string.
FmtHelp lets you process a text block with simple curly brace markup.
JSONMarshal provides provide a custom json encoder to solve a an issue with HTML entities getting converted to UTF-8 code points by json.Marshal(), json.MarshalIndent().
JSONMarshalIndent provides provide a custom json encoder to solve a an issue with HTML entities getting converted to UTF-8 code points by json.Marshal(), json.MarshalIndent().
JSONObjectsToCSV takes an JSON array of objects mapping to CSV colum/rows.
JSONUnmarshal is a custom JSON decoder so we can treat numbers easier.
Levenshtein does a fuzzy match on two strings.
NormalizeDelimiters handles the messy translation from a format string received as an option in the cli to something useful to pass to Join.
NormalizeDelimiterRune take a delimiter string and returns a single Rune.
OpenSQLStore opens a mysql, postgres or SQLite database based on a data source name expressed as a URL.
ParseRange takes a string in the form of a "range expression" like 1,2 (one and two), 1-3 (one, two, three) or 1,2,8-10 (one, two, eight, nine, ten) and returns an array of ints holding the values of the range expression.
SetEOLToCRLF sets the end of line to CRLF if true, otherwise false for the funcs with a csv.Writer in csv.go.
Text2Fields process a io.Reader as input and returns byte array of fields and error Options provides the configuration to apply.
UseCRLF returns the current setting of CRLF used in funcs with csv.Writer in csv.go.

# Constants

No description provided by the author
Constants for datatools functions.
No description provided by the author
No description provided by the author
ReleaseDate, the date version.go was generated.
ReleaseHash, the Git hash when version.go was generated.
Version number of release.

# Structs

Options is the data structure to configure the Text2Fields parser.
SQLCfg holds the information for connecting to a SQLStore and options for the CSV output.
SQLSrouce represents a wrapper SQL database drivers using a common struct.