package
1.0.1
Repository: https://github.com/couchbaselabs/cbmigrate.git
Documentation: pkg.go.dev

# README

Hugging Face Dataset to Couchbase Migrator CLI

A command-line tool to interact with Hugging Face datasets and migrate them to Couchbase, with support for streaming data.

Features

  • List dataset configurations, splits, and fields
  • Migrate datasets directly to Couchbase
  • Support for private datasets with authentication
  • Streaming data support
  • Batch processing capabilities
  • Customizable document ID generation

Commands

1. List Configurations

Lists all available configurations for a dataset.

cbmigrate hugging-face list-configs --path dataset

Flags:

  • --path: Path or name of the dataset (required)
  • --revision: Version of the dataset script to load
  • --download-config: Specific download configuration parameters
  • --download-mode: Download mode (reuse_dataset_if_exists or force_redownload)
  • --dynamic-modules-path: Path to dynamic modules
  • --data-files: Path(s) to source data file(s)
  • --token: Authentication token for private datasets
  • --json-output: Output the configurations in JSON format
  • --debug: Enable debug output
  • --trust-remote-code: Allow loading arbitrary code from the dataset repository

2. List Splits

Lists all available splits for a dataset.

cbmigrate hugging-face list-splits --path dataset

Flags:

  • --path: Path or name of the dataset (required)
  • --name: Configuration name of the dataset
  • --data-files: Path(s) to source data file(s)
  • --download-config: Specific download configuration parameters
  • --download-mode: Download mode (reuse_dataset_if_exists or force_redownload)
  • --revision: Version of the dataset script to load
  • --token: Authentication token for private datasets
  • --json-output: Output the splits in JSON format
  • --debug: Enable debug output
  • --trust-remote-code: Allow loading arbitrary code from the dataset repository

3. List Fields

Lists all fields (columns) in a dataset.

cbmigrate hugging-face list-fields --path dataset

Flags:

  • --path: Path or name of the dataset (required)
  • --name: Name of the dataset configuration
  • --data-files: Paths to source data files
  • --download-config: Specific download configuration parameters
  • --revision: Version of the dataset script to load
  • --token: Hugging Face token for private datasets
  • --split: Which split of the data to load
  • --json-output: Output the fields in JSON format
  • --debug: Enable debug output
  • --trust-remote-code: Allow loading arbitrary code from the dataset repository

4. Migrate Dataset

Migrates data from Hugging Face to Couchbase.

cbmigrate hugging-face migrate \
    --path dataset \
    --id-fields id_field \
    --cb-url couchbase://localhost \
    --cb-username user \
    --cb-password pass \
    --cb-bucket my_bucket \
    --cb-scope my_scope \
    --cb-collection my_collection

Flags:

  • --path: Path or name of the dataset (required)
  • --id-fields: Comma-separated list of field names to use as document ID (required)
  • --cb-url: Couchbase cluster URL (required)
  • --cb-username: Couchbase username (required)
  • --cb-password: Couchbase password (required)
  • --cb-bucket: Couchbase bucket name (required)
  • --cb-scope: Couchbase scope name (required)
  • --cb-collection: Couchbase collection name
  • --name: Configuration name of the dataset
  • --data-files: Path(s) to source data file(s)
  • --split: Which split of the data to load
  • --cache-dir: Cache directory for datasets
  • --download-config: Specific download configuration parameters
  • --download-mode: Download mode (reuse_dataset_if_exists or force_redownload)
  • --verification-mode: Verification mode (no_checks, basic_checks, or all_checks)
  • --keep-in-memory: Keep dataset in memory
  • --save-infos: Save dataset information
  • --revision: Version of the dataset script to load
  • --token: Authentication token for private datasets
  • --no-streaming: Disable streaming mode
  • --num-proc: Number of processes to use
  • --storage-options: Storage options for remote filesystems
  • --trust-remote-code: Allow loading arbitrary code from the dataset repository
  • --cb-batch-size: Number of documents to insert per batch (default: 1000)
  • --debug: Enable debug output

Examples

List configurations for a public dataset:

cbmigrate hugging-face list-configs --path dataset

List configurations for a private dataset:

cbmigrate hugging-face list-configs --path my-dataset --token YOUR_HF_TOKEN

List splits with specific configuration:

cbmigrate hugging-face list-splits --path dataset --name config-name

Migrate a dataset with multiple ID fields:

cbmigrate hugging-face migrate \
    --path dataset \
    --id-fields field1,field2 \
    --cb-url couchbase://localhost \
    --cb-username user \
    --cb-password pass \
    --cb-bucket my_bucket \
    --cb-scope my_scope \
    --cb-collection my_collection

Migrate a specific split with streaming:

cbmigrate hugging-face migrate \
    --path dataset \
    --split train \
    --id-fields id_field \
    --cb-url couchbase://localhost \
    --cb-username user \
    --cb-password pass \
    --cb-bucket my_bucket \
    --cb-scope my_scope

Error Handling

The CLI will exit with a non-zero status code if an error occurs during execution. Error messages will be displayed on stderr.

Logging

  • Use --debug flag with any command to enable debug-level logging
  • JSON output options are available for machine-readable output
  • Progress information is displayed during migration

Authentication

  • For private Hugging Face datasets, use the --token option
  • Couchbase credentials are required for migration operations
  • Credentials can be provided via command-line options

# Packages

No description provided by the author

# Functions

No description provided by the author