Categorygithub.com/lorenzosaino/go-bqstreamer
modulepackage
2.0.1+incompatible
Repository: https://github.com/lorenzosaino/go-bqstreamer.git
Documentation: pkg.go.dev

# README

BigQuery Streamer BigQuery GoDoc Build Status Coverage Status

Stream insert data into BigQuery fast and concurrently, using InsertAll().

Features

  • Insert rows from multiple tables, datasets, and projects, and insert them bulk. No need to manage data structures and sort rows by tables - bqstreamer does it for you.
  • Multiple background workers (i.e. goroutines) to enqueue and insert rows.
  • Insert can be done in a blocking or in the background (asynchronously).
  • Perform insert operations in predefined set sizes, according to BigQuery's quota policy.
  • Handle and retry BigQuery server errors.
  • Backoff interval between failed insert operations.
  • Error reporting.
  • Production ready, and thoroughly tested. We - at Rounds - are using it in our data gathering workflow.
  • Thorough testing and documentation for great good!

Getting Started

  1. Install Go, version should be at least 1.5.
  2. Clone this repository and download dependencies:
  3. Version v2: go get gopkg.in/rounds/go-bqstreamer.v2
  4. Version v1: go get gopkg.in/rounds/go-bqstreamer.v1
  5. Acquire Google OAuth2/JWT credentials, so you can authenticate with BigQuery.

How Does It Work?

There are two types of inserters you can use:

  1. SyncWorker, which is a single blocking (synchronous) worker.
  2. It enqueues rows and performs insert operations in a blocking manner.
  3. AsyncWorkerGroup, which employes multiple background SyncWorkers.
  4. The AsyncWorkerGroup enqueues rows, and its background workers pull and insert in a fan-out model.
  5. An insert operation is executed according to row amount or time thresholds for each background worker.
  6. Errors are reported to an error channel for processing by the user.
  7. This provides a higher insert throughput for larger scale scenarios.

Examples

Check the GoDoc examples section.

Contribute

  1. Please check the issues page.
  2. File new bugs and ask for improvements.
  3. Pull requests welcome!

Test

# Run unit tests and check coverage.
$ make test

# Run integration tests.
# This requires an active project, dataset and pem key.
$ export BQSTREAMER_PROJECT=my-project
$ export BQSTREAMER_DATASET=my-dataset
$ export BQSTREAMER_TABLE=my-table
$ export BQSTREAMER_KEY=my-key.json
$ make testintegration

# Functions

New returns a new AsyncWorkerGroup using given OAuth2/JWT configuration.
NewJWTConfig returns a new JWT configuration from a JSON key, acquired via https://console.developers.google.com.
NewRow returns a new Row instance, with an automatically generated insert ID used for deduplication purposes.
NewRowWithID returns a new Row instance with given insert ID.
NewSyncWorker returns a new SyncWorker.
SetAsyncErrorChannel sets the asynchronous workers' error channel.
SetAsyncIgnoreUnknownValues sets whether to accept rows that contain values that do not match the table schema.
SetAsyncMaxDelay sets the maximum time delay a worker should wait before an insert operation is executed.
SetAsyncMaxRetries sets the maximum amount of retries a failed insert operation can be retried, before dropping the rows and giving up on the insert operation entirely.
SetAsyncMaxRows sets the maximum amount of rows a worker can enqueue before an insert operation is executed.
SetAsyncNumWorkers sets the amount of background workers.
SetAsyncRetryInterval sets the time delay before retrying a failed insert operation (if required).
SetAsyncSkipInvalidRows sets whether to insert all valid rows of a request, even if invalid rows exist.
SetSyncIgnoreUnknownValues sets whether to accept rows that contain values that do not match the table schema.
SetSyncMaxRetries sets the maximum amount of retries a failed insert operation is allowed to retry, before dropping the rows and giving up on the insert operation entirely.
SetSyncRetryInterval sets the time delay before retrying a failed insert operation (if required).
SetSyncSkipInvalidRows sets whether to insert all valid rows of a request, even if invalid rows exist.

# Constants

No description provided by the author
No description provided by the author
No description provided by the author
BigQuery has a quota policy regarding how big and often inserts should be.
No description provided by the author

# Structs

AsyncWorkerGroup asynchronously streams rows to BigQuery in bulk.
InsertErrors is returned from an insert attempt.
Row associates a single BigQuery table row to a project, dataset and table.
RowErrors contains errors relating to a single row.
SyncWorker streams rows to BigQuery in bulk using synchronous calls.
TableInsertAttemptErrors contains errors relating to a single table insert attempt.
TableInsertErrors contains errors relating to a specific table from a bulk insert operation.
TooManyFailedInsertAttemptsError is returned when a specific insert attempt has been retried and failed multiple times, causing the worker to stop retrying and drop that table's insert operation entirely.

# Type aliases

No description provided by the author
No description provided by the author