Categorygithub.com/cockroachdb/pebble/v2
modulepackage
2.0.2
Repository: https://github.com/cockroachdb/pebble.git
Documentation: pkg.go.dev

# README

Pebble Build Status GoDoc Coverage

Nightly benchmarks

Pebble is a LevelDB/RocksDB inspired key-value store focused on performance and internal usage by CockroachDB. Pebble inherits the RocksDB file formats and a few extensions such as range deletion tombstones, table-level bloom filters, and updates to the MANIFEST format.

Pebble intentionally does not aspire to include every feature in RocksDB and specifically targets the use case and feature set needed by CockroachDB:

  • Block-based tables
  • Checkpoints
  • Indexed batches
  • Iterator options (lower/upper bound, table filter)
  • Level-based compaction
  • Manual compaction
  • Merge operator
  • Prefix bloom filters
  • Prefix iteration
  • Range deletion tombstones
  • Reverse iteration
  • SSTable ingestion
  • Single delete
  • Snapshots
  • Table-level bloom filters

RocksDB has a large number of features that are not implemented in Pebble:

  • Backups
  • Column families
  • Delete files in range
  • FIFO compaction style
  • Forward iterator / tailing iterator
  • Hash table format
  • Memtable bloom filter
  • Persistent cache
  • Pin iterator key / value
  • Plain table format
  • SSTable ingest-behind
  • Sub-compactions
  • Transactions
  • Universal compaction style

WARNING: Pebble may silently corrupt data or behave incorrectly if used with a RocksDB database that uses a feature Pebble doesn't support. Caveat emptor!

Production Ready

Pebble was introduced as an alternative storage engine to RocksDB in CockroachDB v20.1 (released May 2020) and was used in production successfully at that time. Pebble was made the default storage engine in CockroachDB v20.2 (released Nov 2020). Pebble is being used in production by users of CockroachDB at scale and is considered stable and production ready.

Advantages

Pebble offers several improvements over RocksDB:

  • Faster reverse iteration via backwards links in the memtable's skiplist.
  • Faster commit pipeline that achieves better concurrency.
  • Seamless merged iteration of indexed batches. The mutations in the batch conceptually occupy another memtable level.
  • L0 sublevels and flush splitting for concurrent compactions out of L0 and reduced read-amplification during heavy write load.
  • Faster LSM edits in LSMs with large numbers of sstables through use of a copy-on-write B-tree to hold file metadata.
  • Delete-only compactions that drop whole sstables that fall within the bounds of a range deletion.
  • Block-property collectors and filters that enable iterators to skip tables, index blocks and data blocks that are irrelevant, according to user-defined properties over key-value pairs.
  • Range keys API, allowing KV pairs defined over a range of keyspace with user-defined semantics and interleaved during iteration.
  • Smaller, more approachable code base.

See the Pebble vs RocksDB: Implementation Differences doc for more details on implementation differences.

RocksDB Compatibility

Pebble v1 strives for forward compatibility with RocksDB 6.2.1 (the latest version of RocksDB used by CockroachDB). Forward compatibility means that a DB generated by RocksDB 6.2.1 can be upgraded for use by Pebble. Pebble versions in the v1 series may open DBs generated by RocksDB 6.2.1. Since its introduction, Pebble has adopted various backwards-incompatible format changes that are gated behind new 'format major versions'. Pebble v2 and newer does not support opening DBs generated by RocksDB. DBs generated by RocksDB may only be used with recent versions of Pebble after migrating them through format major version upgrades using previous versions of Pebble. See the below section of format major versions.

Even the RocksDB-compatible versions of Pebble only provide compatibility with the subset of functionality and configuration used by CockroachDB. The scope of RocksDB functionality and configuration is too large to adequately test and document all the incompatibilities. The list below contains known incompatibilities.

  • Pebble's use of WAL recycling is only compatible with RocksDB's kTolerateCorruptedTailRecords WAL recovery mode. Older versions of RocksDB would automatically map incompatible WAL recovery modes to kTolerateCorruptedTailRecords. New versions of RocksDB will disable WAL recycling.
  • Column families. Pebble does not support column families, nor does it attempt to detect their usage when opening a DB that may contain them.
  • Hash table format. Pebble does not support the hash table sstable format.
  • Plain table format. Pebble does not support the plain table sstable format.
  • SSTable format version 3 and 4. Pebble does not support version 3 and version 4 format sstables. The sstable format version is controlled by the BlockBasedTableOptions::format_version option. See #97.

Format major versions

Over time Pebble has introduced new physical file formats. Backwards incompatible changes are made through the introduction of 'format major versions'. By default, when Pebble opens a database, it defaults to the lowest supported version. In v1, this is FormatMostCompatible, which is bi-directionally compatible with RocksDB 6.2.1 (with the caveats described above).

Databases created by RocksDB or Pebble versions v1 and earlier must be upgraded to a compatible format major version before running newer Pebble versions. Newer Pebble versions will refuse to open databases in no longer supported formats.

To opt into new formats, a user may set FormatMajorVersion on the Options supplied to Open, or upgrade the format major version at runtime using DB.RatchetFormatMajorVersion. Format major version upgrades are permanent; There is no option to return to an earlier format.

The table below outlines the history of format major versions, along with what range of Pebble versions support that format.

NameValueMigrationPebble support
FormatMostCompatible1Nov1
FormatVersioned3Nov1
FormatSetWithDelete4Nov1
FormatBlockPropertyCollector5Nov1
FormatSplitUserKeysMarked6Backgroundv1
FormatSplitUserKeysMarkedCompacted7Blockingv1
FormatRangeKeys8Nov1
FormatMinTableFormatPebblev19Nov1
FormatPrePebblev1Marked10Backgroundv1
FormatSSTableValueBlocks12Nov1
FormatFlushableIngest13Nov1, v2, master
FormatPrePebblev1MarkedCompacted14Blockingv1, v2, master
FormatDeleteSizedAndObsolete15Nov1, v2, master
FormatVirtualSSTables16Nov1, v2, master
FormatSyntheticPrefixSuffix17Nov2, master
FormatFlushableIngestExcises18Nov2, master
FormatColumnarBlocks19Nov2, master

Upgrading to a format major version with 'Background' in the migration column may trigger background activity to rewrite physical file formats, typically through compactions. Upgrading to a format major version with 'Blocking' in the migration column will block until a migration is complete. The database may continue to serve reads and writes if upgrading a live database through RatchetFormatMajorVersion, but the method call will not return until the migration is complete.

Upgrading existing stores can be performed via the RatchetFormatMajorVersion method. If the database does not use a custom comparer, merger, or block property collectors, the pebble tool can also be used, at the latest version that supports the format. For example:

# WARNING: only use if no custom comparer/merger/property collector are necessary.
go run github.com/cockroachdb/pebble/v2/cmd/[email protected] db upgrade <db-dir>

For reference, the table below lists the range of supported Pebble format major versions for CockroachDB releases.

CockroachDB releaseEarliest supportedLatest supported
20.1 through 21.1FormatMostCompatibleFormatMostCompatible
21.2FormatMostCompatibleFormatSetWithDelete
21.2FormatMostCompatibleFormatSetWithDelete
22.1FormatMostCompatibleFormatSplitUserKeysMarked
22.2FormatMostCompatibleFormatPrePebblev1Marked
23.1FormatSplitUserKeysMarkedCompactedFormatFlushableIngest
23.2FormatPrePebblev1MarkedFormatVirtualSSTables
24.1FormatFlushableIngestFormatSyntheticPrefixSuffix
24.2FormatVirtualSSTablesFormatSyntheticPrefixSuffix
24.3FormatSyntheticPrefixSuffixFormatColumnarBlocks

Pedigree

Pebble is based on the incomplete Go version of LevelDB:

https://github.com/golang/leveldb

The Go version of LevelDB is based on the C++ original:

https://github.com/google/leveldb

Optimizations and inspiration were drawn from RocksDB:

https://github.com/facebook/rocksdb

Getting Started

Example Code

package main

import (
	"fmt"
	"log"

	"github.com/cockroachdb/pebble/v2"
)

func main() {
	db, err := pebble.Open("demo", &pebble.Options{})
	if err != nil {
		log.Fatal(err)
	}
	key := []byte("hello")
	if err := db.Set(key, []byte("world"), pebble.Sync); err != nil {
		log.Fatal(err)
	}
	value, closer, err := db.Get(key)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s %s\n", key, value)
	if err := closer.Close(); err != nil {
		log.Fatal(err)
	}
	if err := db.Close(); err != nil {
		log.Fatal(err)
	}
}

# Packages

Package batchrepr provides interfaces for reading and writing the binary batch representation.
Package bloom implements Bloom filters.
No description provided by the author
Package metamorphic provides a testing framework for running randomized tests over multiple Pebble databases with varying configurations.
No description provided by the author
Package rangekey provides functionality for working with range keys.
Package record reads and writes sequences of records.
Package replay implements collection and replaying of compaction benchmarking workloads.
Package sstable implements readers and writers of pebble tables.
No description provided by the author
No description provided by the author
No description provided by the author

# Functions

CanDeterministicallySingleDelete takes a valid iterator and examines internal state to determine if a SingleDelete deleting Iterator.Key() would deterministically delete the key.
DebugCheckLevels calls CheckLevels on the provided database.
GetVersion returns the engine version string from the latest options file present in dir.
IsCorruptionError returns true if the given error indicates database corruption.
LockDirectory acquires the database directory lock in the named directory, preventing another process from opening the database.
MakeInternalKey constructs an internal key from a specified user key, sequence number and kind.
MakeInternalKeyTrailer constructs a trailer from a specified sequence number and kind.
MakeLoggingEventListener creates an EventListener that logs all events to the specified logger.
NewCache creates a new cache of the specified size.
NewExternalIter takes an input 2d array of sstable files which may overlap across subarrays but not within a subarray (at least as far as points are concerned; range keys are allowed to overlap arbitrarily even within a subarray), and returns an Iterator over the merged contents of the sstables.
NewExternalIterWithContext is like NewExternalIter, and additionally accepts a context for tracing.
NewTableCache will create a reference to the table cache.
Open opens a DB whose files live in the given directory.
Peek looks for an existing database in dirname on the provided FS.
TableCacheSize can be used to determine the table cache size for a single db, given the maximum open files which can be used by a table cache which is only used by a single db.
TeeEventListener wraps two EventListeners, forwarding all events to both.
WithApproximateSpanBytes enables capturing the approximate number of bytes that overlap the provided key span for each sstable.
WithFlushedWAL enables flushing and syncing the WAL prior to constructing a checkpoint.
WithInitialSizeBytes sets a custom initial size for the batch.
WithKeyRangeFilter ensures returned sstables overlap start and end (end-exclusive) if start and end are both nil these properties have no effect.
WithMaxRetainedSizeBytes sets a custom max size for the batch to be re-used.
WithProperties enable return sstable properties in each TableInfo.
WithRestrictToSpans specifies spans of interest for the checkpoint.

# Constants

BackingTypeExternal denotes an sstable stored on external storage, not owned by any Pebble instance and with no refcounting/cleanup methods or lifecycle management.
BackingTypeLocal denotes an sstable stored on local disk according to the objprovider.
BackingTypeShared denotes an sstable stored on shared storage, created by this Pebble instance and possibly shared by other Pebble instances.
BackingTypeSharedForeign denotes an sstable stored on shared storage, created by a Pebble instance other than this one.
Exported Compression constants.
FormatColumnarBlocks is a format major version enabling use of the TableFormatPebblev5 table format, that encodes sstable data blocks, index blocks and keyspan blocks by organizing the KVs into columns within the block.
FormatDefault leaves the format version unspecified.
FormatDeleteSizedAndObsolete is a format major version that adds support for deletion tombstones that encode the size of the value they're expected to delete.
FormatFlushableIngest is a format major version that enables lazy addition of ingested sstables into the LSM structure.
FormatFlushableIngestExcises is a format major version that adds support for having excises unconditionally being written as flushable ingestions.
FormatMinForSharedObjects it the minimum format version that supports shared objects (see CreateOnShared option).
FormatMinSupported is the minimum format version that is supported by this Pebble version.
FormatNewest is the most recent format major version.
FormatPrePebblev1MarkedCompacted is a format major version that guarantees that all sstables explicitly marked for compaction in the manifest (see FormatPrePebblev1Marked) have been compacted.
FormatSyntheticPrefixSuffix is a format major version that adds support for sstables to have their content exposed in a different prefix or suffix of keyspace than the actual prefix/suffix persisted in the keys in such sstables.
FormatVirtualSSTables is a format major version that adds support for virtual sstables that can reference a sub-range of keys in an underlying physical sstable.
InterfaceCall represents calls to Iterator.
InternalIterCall represents calls by Iterator to its internalIterator.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
These constants are part of the file format, and should not be changed.
IterAtLimit represents an Iterator that has a non-exhausted internalIterator, but has reached a limit without any key for the caller.
IteratorLevelFlushable indicates a flushable (i.e.
IteratorLevelLSM indicates an LSM level.
IteratorLevelUnknown indicates an unknown LSM level.
IterExhausted represents an Iterator that is exhausted.
IterKeyTypePointsAndRanges configures an iterator iterate over both point keys and range keys simultaneously.
IterKeyTypePointsOnly configures an iterator to iterate over point keys only.
IterKeyTypeRangesOnly configures an iterator to iterate over range keys only.
IterValid represents an Iterator that is valid.
Exported Compression constants.
NumStatsKind is the number of kinds, and is used for array sizing.
Exported Compression constants.
Exported TableFilter constants.
Exported Compression constants.

# Variables

CheckComparer is a mini test suite that verifies a comparer implementation.
DefaultComparer exports the base.DefaultComparer variable.
DefaultLogger logs to the Go stdlib logs.
DefaultMerger exports the base.DefaultMerger variable.
ErrBatchTooLarge indicates that a batch is invalid or otherwise corrupted.
ErrCancelledCompaction is returned if a compaction is cancelled by a concurrent excise or ingest-split operation.
ErrClosed is panicked when an operation is performed on a closed snapshot or DB.
ErrCorruption is a marker to indicate that data in a file (WAL, MANIFEST, sstable) isn't in the expected format.
ErrDBAlreadyExists is generated when ErrorIfExists is set and the database already exists.
ErrDBDoesNotExist is generated when ErrorIfNotExists is set and the database does not exist.
ErrDBNotPristine is generated when ErrorIfNotPristine is set and the database already exists and is not pristine.
ErrInvalidBatch indicates that a batch is invalid or otherwise corrupted.
ErrInvalidSkipSharedIteration is returned by ScanInternal if it was called with a shared file visitor function, and a file in a shareable level (i.e.
ErrNotFound is returned when a get operation does not find the requested key.
ErrNotIndexed means that a read operation on a batch failed because the batch is not indexed and thus doesn't support reads.
ErrReadOnly is returned when a write operation is performed on a read-only database.
FsyncLatencyBuckets are prometheus histogram buckets suitable for a histogram that records latencies for fsyncs.
JemallocSizeClasses exports sstable.JemallocSizeClasses.
NoSync specifies the default write options for writes which do not synchronize to disk.
SecondaryCacheChannelWriteBuckets exported to enable exporting from package pebble to enable exporting metrics with below buckets in CRDB.
SecondaryCacheIOBuckets exported to enable exporting from package pebble to enable exporting metrics with below buckets in CRDB.
Sync specifies the default write options for writes which synchronize to disk.

# Structs

A Batch is a sequence of Sets, Merges, Deletes, DeleteRanges, RangeKeySets, RangeKeyUnsets, and/or RangeKeyDeletes that are applied atomically.
BatchCommitStats exposes stats related to committing a batch.
CheckLevelsStats provides basic stats on points and tombstones encountered.
CheckpointSpan is a key range [Start, End) (inclusive on Start, exclusive on End) of interest for a checkpoint.
CloneOptions configures an iterator constructed through Iterator.Clone.
CompactionInfo contains the info for a compaction event.
DB provides a concurrent, persistent ordered key/value store.
DBDesc briefly describes high-level state about a database.
DeferredBatchOp represents a batch operation (eg.
DownloadInfo contains the info for a DB.Download() event.
DownloadSpan is a key range passed to the Download method.
ErrMissingWALRecoveryDir is an error returned when a database is attempted to be opened without supplying a Options.WALRecoveryDir entry for a directory that may contain WALs required to recover a consistent database state.
EventListener contains a set of functions that will be invoked when various significant DB events occur.
EventuallyFileOnlySnapshot (aka EFOS) provides a read-only point-in-time view of the database state, similar to Snapshot.
ExternalFile are external sstables that can be referenced through objprovider and ingested as remote files that will not be refcounted or cleaned up.
FlushInfo contains the info for a flush event.
IngestOperationStats provides some information about where in the LSM the bytes were ingested.
Iterator iterates over a DB's key/value pairs in key order.
IteratorLevel is used with scanInternalIterator to surface additional iterator-specific info where possible.
IteratorMetrics holds per-iterator metrics.
IteratorStats contains iteration stats.
IterOptions hold the optional per-query parameters for NewIter.
KeyRange encodes a key range in user key space.
KeyStatistics keeps track of the number of keys that have been pinned by a snapshot as well as counts of the different key kinds in the lsm.
LevelInfo contains info pertaining to a particular level.
LevelMetrics holds per-level metrics such as the number of files and total size of the files, and compaction related metrics.
LevelOptions holds the optional per-level parameters.
Lock represents a file lock on a directory.
LSMKeyStatistics is used by DB.ScanStatistics.
ManifestCreateInfo contains info about a manifest creation event.
ManifestDeleteInfo contains the info for a Manifest deletion event.
Metrics holds metrics for various subsystems of the DB such as the Cache, Compactions, WAL, and per-Level metrics.
NoMultiLevel will never add an additional level to the compaction.
Options holds the optional parameters for configuring pebble.
ParseHooks contains callbacks to create options fields which can have user-defined implementations.
RangeKeyData describes a range key's data, set through RangeKeySet.
RangeKeyIteratorStats contains miscellaneous stats about range keys encountered by the iterator.
RangeKeyMasking configures automatic hiding of point keys by range keys.
ScanStatisticsOptions is used by DB.ScanStatistics.
SharedSSTMeta represents an sstable on shared storage that can be ingested by another pebble instance.
Snapshot provides a read-only point-in-time view of the DB state.
SSTableInfo export manifest.TableInfo with sstable.Properties alongside other file backing info.
TableCache is a shareable cache for open sstables.
TableCreateInfo contains the info for a table creation event.
TableDeleteInfo contains the info for a table deletion event.
TableIngestInfo contains the info for a table ingestion event.
TableStatsInfo contains the info for a table stats loaded event.
TableValidatedInfo contains information on the result of a validation run on an sstable.
WALCreateInfo contains info about a WAL creation event.
WALDeleteInfo contains the info for a WAL deletion event.
WALFailoverOptions configures the WAL failover mechanics to use during transient write unavailability on the primary WAL volume.
WriteAmpHeuristic defines a multi level compaction heuristic which will add an additional level to the picked compaction if it reduces predicted write amp of the compaction + the addPropensity constant.
WriteOptions hold the optional per-query parameters for Set and Delete operations.
WriteStallBeginInfo contains the info for a write stall begin event.

# Interfaces

BlockPropertyFilterMask extends the BlockPropertyFilter interface for use with range-key masking.
CPUWorkHandle represents a handle used by the CPUWorkPermissionGranter API.
CPUWorkPermissionGranter is used to request permission to opportunistically use additional CPUs to speed up internal background work.
MultiLevelHeuristic evaluates whether to add files from the next level into the compaction.
Reader is a readable key/value store.
Writer is a writable key/value store.

# Type aliases

AbbreviatedKey exports the base.AbbreviatedKey type.
ArchiveCleaner exports the base.ArchiveCleaner type.
AttributeAndLen exports the base.AttributeAndLen type.
BackingType denotes the type of storage backing a given sstable.
BatchOption allows customizing the batch.
BlockPropertyCollector exports the sstable.BlockPropertyCollector type.
BlockPropertyFilter exports the sstable.BlockPropertyFilter type.
Cache exports the cache.Cache type.
CacheMetrics holds metrics for the block and table cache.
CheckpointOption set optional parameters used by `DB.Checkpoint`.
Cleaner exports the base.Cleaner type.
Compare exports the base.Compare type.
Comparer exports the base.Comparer type.
Compression exports the base.Compression type.
DeletableValueMerger exports the base.DeletableValueMerger type.
DeleteCleaner exports the base.DeleteCleaner type.
DiskSlowInfo contains the info for a disk slowness event when writing to a file.
Equal exports the base.Equal type.
FileNum is an identifier for a file within a database.
FilterMetrics holds metrics for the filter policy.
FilterPolicy exports the base.FilterPolicy type.
FilterType exports the base.FilterType type.
FilterWriter exports the base.FilterWriter type.
FormatMajorVersion is a constant controlling the format of persisted data.
InternalIteratorStats contains miscellaneous stats produced by internal iterators.
InternalKey exports the base.InternalKey type.
InternalKeyKind exports the base.InternalKeyKind type.
InternalKeyTrailer exports the base.InternalKeyTrailer type.
IteratorLevelKind is used to denote whether the current ScanInternal iterator is unknown, belongs to a flushable, or belongs to an LSM level type.
IteratorStatsKind describes the two kind of iterator stats.
IterKeyType configures which types of keys an iterator should surface.
IterValidityState captures the state of the Iterator.
JobID identifies a job (like a compaction).
KeySchema exports the colblk.KeySchema type.
LazyFetcher exports the base.LazyFetcher type.
LazyValue is a lazy value.
Logger defines an interface for writing log messages.
LoggerAndTracer defines an interface for logging and tracing.
Merge exports the base.Merge type.
Merger exports the base.Merger type.
ReadaheadConfig controls the use of read-ahead.
SecondaryCacheMetrics holds metrics for the persistent secondary cache that caches commonly accessed blocks from blob storage on a local file system.
Separator exports the base.Separator type.
SeqNum exports the base.SeqNum type.
ShortAttribute exports the base.ShortAttribute type.
ShortAttributeExtractor exports the base.ShortAttributeExtractor type.
Split exports the base.Split type.
SSTablesOption set optional parameter used by `DB.SSTables`.
Successor exports the base.Successor type.
TableInfo exports the manifest.TableInfo type.
ThroughputMetric is a cumulative throughput metric.
UserKeyPrefixBound exports the sstable.UserKeyPrefixBound type.
ValueMerger exports the base.ValueMerger type.