# README
gitcollector

gitcollector collects and stores git repositories.
gitcollector is the source{d} tool to download and update git repositories at large scale. To that end, it uses a custom repository storage file format called siva optimized for saving storage space and keeping repositories up-to-date.
Status
The project is in a preliminary stable stage and under active development.
Storing repositories using rooted repositories
A rooted repository is a bare Git repository that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the Siva file format.
Rooted repositories have a few particularities that you should know to work with them effectively:
- They have no
HEAD
reference. - All references are of the following form:
{REFERENCE_NAME}/{REMOTE_NAME}
. For example, the referencerefs/heads/master
of the remotefoo
would be/refs/heads/master/foo
. - Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
- A rooted repository is simply a repository with all the objects from all the repositories which share the same root commit.
- The root commit for a repository is obtained following the first parent of each commit from HEAD.
Getting started
Plain command
gitcollector entry point usage is done through the subcommand download
(at this time is the only subcommand):
Usage:
gitcollector [OPTIONS] download [download-OPTIONS]
Help Options:
-h, --help Show this help message
[download command options]
--library= path where download to [$GITCOLLECTOR_LIBRARY]
--bucket= library bucketization level (default: 2) [$GITCOLLECTOR_LIBRARY_BUCKET]
--tmp= directory to place generated temporal files (default: /tmp) [$GITCOLLECTOR_TMP]
--workers= number of workers, default to GOMAXPROCS [$GITCOLLECTOR_WORKERS]
--half-cpu set the number of workers to half of the set workers [$GITCOLLECTOR_HALF_CPU]
--no-updates don't allow updates on already downloaded repositories [$GITCOLLECTOR_NO_UPDATES]
--no-forks github forked repositories will not be downloaded [$GITCOLLECTOR_NO_FORKS]
--orgs= list of github organization names separated by comma [$GITHUB_ORGANIZATIONS]
--excluded-repos= list of repos to exclude separated by comma [$GITCOLLECTOR_EXCLUDED_REPOS]
--token= github token [$GITHUB_TOKEN]
--metrics-db= uri to a database where metrics will be sent [$GITCOLLECTOR_METRICS_DB_URI]
--metrics-db-table= table name where the metrics will be added (default: gitcollector_metrics) [$GITCOLLECTOR_METRICS_DB_TABLE]
--metrics-sync-timeout= timeout in seconds to send metrics (default: 30) [$GITCOLLECTOR_METRICS_SYNC]
Log Options:
--log-level=[info|debug|warning|error] Logging level (default: info) [$LOG_LEVEL]
--log-format=[text|json] log format, defaults to text on a terminal and json otherwise [$LOG_FORMAT]
--log-fields= default fields for the logger, specified in json [$LOG_FIELDS]
--log-force-format ignore if it is running on a terminal or not [$LOG_FORCE_FORMAT]
Usage example, --library
and --orgs
are always required:
gitcollector download --library=/path/to/repos/directoy --orgs=src-d
To collect repositories from several github organizations:
gitcollector download --library=/path/to/repos/directoy --orgs=src-d,bblfsh
Note that all the download command options are also configurable with environment variables.
Docker
gitcollector upload a new docker image to docker hub on each new release. To use it:
docker run --rm --name gitcollector_1 \
-e "GITHUB_ORGANIZATIONS=src-d,bblfsh" \
-e "GITHUB_TOKEN=foo" \
-v /path/to/repos/directory:/library \
srcd/gitcollector:latest
Note that you must mount a local directory into the specific container path shown in -v /path/to/repos/directory:/library
. This directory is where the repositories will be downloaded into rooted repositories in siva files format.
License
GPL v3.0, see LICENSE