Categorygithub.com/aih/bills
modulepackage
0.0.0-20211213174351-a9b073a84c7c
Repository: https://github.com/aih/bills.git
Documentation: pkg.go.dev

# README

:toc: auto

Process Bills with Golang

This repository (bills) defines a module (github.com/aih/bills) with a number of packages defined in the /cmd directory.

To build and create commands in the bills/cmd directory, run make build (to apply the Makefile).

To test, run make test (the tests are not complete).

To run both, simply make.

The packages in the cmd directory, which build to cmd/bin are:

badgerkv:: a test for storing data in the badger database. (TODO: convert this instead to a test for the badgerkv package.) billmeta:: command-line tool to create bill metadata and store it to a file. Command-line options include -p to specify a parent path for the bills to process, or -billNumber to process a specific bill. The metadata is created by makeBillsMeta and enriched by finding bills that have the same titles and main titles. To run a sample and store results in testMeta.json, run cmd/bin/billmeta -parentPath ./samples -billMetaPath ./samples/test/results/testMeta.json committees:: command-line tool to download committees.yaml to tmp/committees.yaml comparematrix:: a command-line tool to which takes a list of bills as input and outputs a matrix of bill similarity (including the category of similarity) esquery:: find the similar bills for each section of bills. It depends on having an Elasticsearch index of bills, divided into sections. The esquery command can be run on a sample of bills, or all bills. Bills are not yet processed concurrently, but the architecture (processing one bill at a time, by bill number) is designed to allow this. jsonpgx:: a command-line tool to work with posgtresql legislators:: a command-line tool to download legislators.yaml to tmp/legislators.yaml unitedstates:: a stub (not currently working) that will download and process bill data and metadata

Note: Some of these commands process many files in parallel. In order to prevent problems on systems that limit open files (e.g. Ubuntu), we've added a max open files parameter (see, e.g. billmeta). In addition, to prevent crashes due to system memory limitations, on the production server, I increased file swap size to 4Gb (see https://askubuntu.com/a/1075516/686037).

Bill Processing Pipeline

Bills are downloaded into the file structure defined in https://github.com/unitedstates/congress. Additional JSON metadata files are created and stored in the filepath for each bill, e.g. [path]/congress/data/117/bills/hr/hr1500. The files are stored separately so that each process can be run independently and concurrently.

  1. Download documents to congress directory (using unitedstates repository at https://github.com/unitedstates/congress) TODO: develop a Go alternative for downloads.
  2. Process bill metadata (using billmeta) and store in the path for each bill, . There is also an option to store all metadata in a file [path]/congress/billMetaGo.json and in Golang key/value stores. This processing also creates a key/value store for titles and for main titles. These are stored in files (titleNoYearIndexGo.json and mainTitleNoYearIndexGo.json). TODO: add an option to save these indexes to a database
  3. Index bill xml to Elasticsearch. Currently, this is done in Python in https://github.com/aih/BillMap. The processing there is relatively fast (< 10 minutes to index all bills), and processing performance may be limited by calls to Elasticsearch, so a Go alternative may not result in much performance boost. Note that the billtoxml.go file contains utilities to parse XML and select sections.
  4. For each bill, find similar bills by section using the esquery command. The list of similar sections for each bill is stored in the filename defined as esSimilarityFileName = "esSimilarity.json" in cmd/esquery/main.go. The bills that are similar to the latest version of a given bill are collected in another file, defined as esSimilarBillsDictFileName = "esSimilarBillsDict.json".
  5. For the most similar bills, calculate similarity scores and assign categories (e.g. identical, nearly identical, includes, includedby). A map of bill:categories is stored (also as part of esquery) in a file defined by esSimilarCategoryFileName = "esSimilarCategory.json"

# Packages

No description provided by the author

# Functions

Converts a bill_id of the form `hr299-116` into `116hr299`.
Gets billnumber + version from the bill path E.g.
Converts a bill number of the form `116hr299` into `hr299-116`.
No description provided by the author
For each path to a data file, creates a random sample of tokenized words of length sampleFraction * number of tokenized words Sends the result to a channel.
Collects a random sample of tokenized words for each bill in 'document.xml' files in the 'congress' directory Writes the results to the wordSamplePath.
No description provided by the author
No description provided by the author
Compares a sample list of documents, defined in dOC_PATHS.
Copy the src file to dst.
Tokenizer function that returns words longer than 3 characters which do not have certain punctuation.
No description provided by the author
DownloadFile will download a url to a local file.
No description provided by the author
Extracts bill metadata from a path to a data.json file; sends it to the billMetaStorageChannel as part of a WaitGroup passed as wg.
Find takes a slice and looks for an element in it.
Returns a map of regex capture groups to the items that are matched.
Gets all ids, which includes bill and version.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Sort the eh, es, and enr as latest Then sort by date TODO: better method is to get the latest version in Fdsys_billstatus.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Set sample size to <= 0 to use all sections.
similars is the result of the MLT query.
Gets keys of a sync.Map.
No description provided by the author
Walk 'congress' directory and get filepaths to 'data.json' which contains metadata for the bill.
Walk 'congress' directory and get filepaths to 'document.xml' which contains the bill xml.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Walks the 'congress' directory Creates three metadata files: bills, titlesJson and billMeta bills is the list of bill numbers (billCongressTypeNumber) titles is a list of titles (no year) billMeta collects metadata from data.json files.
No description provided by the author
Creates a map with ngrams as keys and number of occurences as values n is the number of words in each n-gram.
Creates a list of ngrams.
Make local tmp directory if it doesn't exist.
Returns the keys of a map of type map[string]int.
Marshals a sync.Map object of the type map[string]BillMeta see https://stackoverflow.com/a/46390611/628748 and https://stackoverflow.com/a/65442862/628748.
No description provided by the author
Marshals a sync.Map object of the type map[string][]string see https://stackoverflow.com/a/46390611/628748 and https://stackoverflow.com/a/65442862/628748.
No description provided by the author
Gets bill path from the billnumber + version E.g.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Removes duplicates in a list of strings Returns the deduplicated list Trims leading and trailing space for each element.
No description provided by the author
No description provided by the author
Reverses a slice of strings.
No description provided by the author
No description provided by the author
Saves Data in JSON to bill directory.
Saves bill metadata to billMeta.json.
Saves bill metadata to db (badger or bolt) via bh.
Performs scroll query over indices in `searchIndices`; sends result to the resultChan for processing to extract billnumbers See https://github.com/elastic/go-elasticsearch/issues/44#issuecomment-483974031.
No description provided by the author
No description provided by the author
No description provided by the author
Unmarshals from JSON to a syncMap See https://stackoverflow.com/a/65442862/628748.
No description provided by the author
Walk directory with a filter.
TODO: return saved path.
No description provided by the author
No description provided by the author

# Variables

Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
globals.
globals.
globals.
globals.
Constants for this package.
e.g.
globals.
globals.
Constants for this package.
Constants for this package.
Constants for this package.
Constants for this package.
titleSyncMap = new(sync.Map).
for xpath.
Constants for this package.
Constants for this package.
Set to ../../congress.
No description provided by the author
Constants for this package.
Constants for this package.
Constants for this package.
matches strings of the form '...of 1979', where the year is a 4-digit number.
Constants for this package.
matches the title in the <dc:title> element.
Constants for this package.

# Structs

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
ResultHits represents the result of the search hits.
SearchResult represents the result of the search operation.
No description provided by the author
No description provided by the author
No description provided by the author
This is the form of item in `es_similar_bills_dict`; for each billnumber (e.g.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Type aliases

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author