Categorygithub.com/segmentio/parquet-go

modulepackage

0.0.0-20230712180008-5d42db8f0d47

Repository: https://github.com/segmentio/parquet-go.git

Documentation: pkg.go.dev

# README

Project has been Archived

Development has moved to https://github.com/parquet-go/parquet-go. No API's have changed, we just decided to create a new organization for this library. Thank you to all of the contributors for your hard work.

segmentio/parquet-go

High-performance Go library to manipulate parquet files.

# Packages

bloom

Package bloom implements parquet bloom filters.

compress

Package compress provides the generic APIs implemented by parquet compression codecs.

deprecated

No description provided by the author

encoding

Package encoding provides the generic APIs implemented by parquet encodings in its sub-packages.

format

No description provided by the author

hashprobe

Package hashprobe provides implementations of probing tables for various data types.

sparse

Package sparse contains abstractions to help work on arrays of values in sparse memory locations.

# Functions

AppendRow

AppendRow appends to row the given list of column values.

Ascending

Ascending constructs a SortingColumn value which dictates to sort the column at the path given as argument in ascending order.

AsyncPages

AsyncPages wraps the given Pages instance to perform page reads asynchronously in a separate goroutine.

BloomFilters

BloomFilters creates a configuration option which defines the bloom filters that parquet writers should generate.

BooleanValue

BooleanValue constructs a BOOLEAN parquet value from the bool passed as argument.

BSON

BSON constructs a leaf node of BSON logical type.

ByteArrayValue

ByteArrayValue constructs a BYTE_ARRAY parquet value from the byte slice passed as argument.

ColumnBufferCapacity

ColumnBufferCapacity creates a configuration option which defines the size of row group column buffers.

ColumnIndexSizeLimit

ColumnIndexSizeLimit creates a configuration option to customize the size limit of page boundaries recorded in column indexes.

ColumnPageBuffers

ColumnPageBuffers creates a configuration option to customize the buffer pool used when constructing row groups.

CompareDescending

CompareDescending constructs a comparison function which inverses the order of values.

CompareNullsFirst

CompareNullsFirst constructs a comparison function which assumes that null values are smaller than all other values.

CompareNullsLast

CompareNullsLast constructs a comparison function which assumes that null values are greater than all other values.

Compressed

Compressed wraps the node passed as argument to use the given compression codec.

Compression

Compression creates a configuration option which sets the default compression codec used by a writer for columns where none were defined.

Convert

Convert constructs a conversion function from one parquet schema to another.

ConvertRowGroup

ConvertRowGroup constructs a wrapper of the given row group which applies the given schema conversion to its rows.

ConvertRowReader

ConvertRowReader constructs a wrapper of the given row reader which applies the given schema conversion to the rows.

CopyPages

CopyPages copies pages from src to dst, returning the number of values that were copied.

CopyRows

CopyRows copies rows from src to dst.

CopyValues

CopyValues copies values from src to dst, returning the number of values that were written.

CreatedBy

CreatedBy creates a configuration option which sets the name of the application that created a parquet file.

DataPageStatistics

DataPageStatistics creates a configuration option which defines whether data page statistics are emitted.

DataPageVersion

DataPageVersion creates a configuration option which configures the version of data pages used when creating a parquet file.

Date

Date constructs a leaf node of DATE logical type.

Decimal

Decimal constructs a leaf node of decimal logical type with the given scale, precision, and underlying type.

DedupeRowReader

DedupeRowReader constructs a row reader which drops duplicated consecutive rows, according to the comparator function passed as argument.

DedupeRowWriter

DedupeRowWriter constructs a row writer which drops duplicated consecutive rows, according to the comparator function passed as argument.

DeepEqual

DeepEqual returns true if v1 and v2 are equal, including their repetition levels, definition levels, and column indexes.

DefaultFileConfig

DefaultFileConfig returns a new FileConfig value initialized with the default file configuration.

DefaultReaderConfig

DefaultReaderConfig returns a new ReaderConfig value initialized with the default reader configuration.

DefaultRowGroupConfig

DefaultRowGroupConfig returns a new RowGroupConfig value initialized with the default row group configuration.

DefaultSortingConfig

DefaultSortingConfig returns a new SortingConfig value initialized with the default row group configuration.

DefaultWriterConfig

DefaultWriterConfig returns a new WriterConfig value initialized with the default writer configuration.

Descending

Descending constructs a SortingColumn value which dictates to sort the column at the path given as argument in descending order.

DoubleValue

DoubleValue constructs a DOUBLE parquet value from the float64 passed as argument.

DropDuplicatedRows

DropDuplicatedRows configures whether a sorting writer will keep or remove duplicated rows.

Encoded

Encoded wraps the node passed as argument to use the given encoding.

Enum

Enum constructs a leaf node with a logical type representing enumerations.

Equal

Equal returns true if v1 and v2 are equal.

FileReadMode

FileReadMode is a file configuration option which controls the way pages are read.

FileSchema

FileSchema is used to pass a known schema in while opening a Parquet file.

FilterRowReader

FilterRowReader constructs a RowReader which exposes rows from reader for which the predicate has returned true.

FilterRowWriter

FilterRowWriter constructs a RowWriter which writes rows to writer for which the predicate has returned true.

Find

Find uses the ColumnIndex passed as argument to find the page in a column chunk (determined by the given ColumnIndex) that the given value is expected to be found in.

FixedLenByteArrayType

FixedLenByteArrayType constructs a type for fixed-length values of the given size (in bytes).

FixedLenByteArrayValue

FixedLenByteArrayValue constructs a BYTE_ARRAY parquet value from the byte slice passed as argument.

FloatValue

FloatValue constructs a FLOAT parquet value from the float32 passed as argument.

Int

Int constructs a leaf node of signed integer logical type of the given bit width.

Int32Value

Int32Value constructs a INT32 parquet value from the int32 passed as argument.

Int64Value

Int64Value constructs a INT64 parquet value from the int64 passed as argument.

Int96Value

Int96Value constructs a INT96 parquet value from the deprecated.Int96 passed as argument.

JSON

JSON constructs a leaf node of JSON logical type.

KeyValueMetadata

KeyValueMetadata creates a configuration option which adds key/value metadata to add to the metadata of parquet files.

Leaf

Leaf returns a leaf node of the given type.

List

List constructs a node of LIST logical type.

LookupCompressionCodec

LookupCompressionCodec returns the compression codec associated with the given code.

LookupEncoding

LookupEncoding returns the parquet encoding associated with the given code.

MakeRow

MakeRow constructs a Row from a list of column values.

Map

Map constructs a node of MAP logical type.

MaxRowsPerRowGroup

MaxRowsPerRowGroup configures the maximum number of rows that a writer will produce in each row group.

MergeRowGroups

MergeRowGroups constructs a row group which is a merged view of rowGroups.

MergeRowReaders

MergeRowReader constructs a RowReader which creates an ordered sequence of all the readers using the given compare function as the ordering predicate.

MultiRowGroup

MultiRowGroup wraps multiple row groups to appear as if it was a single RowGroup.

MultiRowWriter

MultiRowWriter constructs a RowWriter which dispatches writes to all the writers passed as arguments.

NewBuffer

NewBuffer constructs a new buffer, using the given list of buffer options to configure the buffer returned by the function.

NewBufferPool

NewBufferPool creates a new in-memory page buffer pool.

NewColumnIndex

NewColumnIndex constructs a ColumnIndex instance from the given parquet format column index.

NewFileBufferPool

NewFileBufferPool creates a new on-disk page buffer pool.

NewFileConfig

NewFileConfig constructs a new file configuration applying the options passed as arguments.

NewGenericBuffer

NewGenericBuffer is like NewBuffer but returns a GenericBuffer[T] suited to write rows of Go type T.

NewGenericReader

NewGenericReader is like NewReader but returns GenericReader[T] suited to write rows of Go type T.

NewGenericRowGroupReader

No description provided by the author

NewGenericWriter

NewGenericWriter is like NewWriter but returns a GenericWriter[T] suited to write rows of Go type T.

NewReader

NewReader constructs a parquet reader reading rows from the given io.ReaderAt.

NewReaderConfig

NewReaderConfig constructs a new reader configuration applying the options passed as arguments.

NewRowBuffer

NewRowBuffer constructs a new row buffer.

NewRowBuilder

NewRowBuilder constructs a RowBuilder which builds rows for the parquet schema passed as argument.

NewRowGroupConfig

NewRowGroupConfig constructs a new row group configuration applying the options passed as arguments.

NewRowGroupReader

NewRowGroupReader constructs a new Reader which reads rows from the RowGroup passed as argument.

NewRowGroupRowReader

No description provided by the author

NewSchema

NewSchema constructs a new Schema object with the given name and root node.

NewSortingConfig

NewSortingConfig constructs a new sorting configuration applying the options passed as arguments.

NewSortingWriter

NewSortingWriter constructs a new sorting writer which writes a parquet file where rows of each row group are ordered according to the sorting columns configured on the writer.

NewWriter

NewWriter constructs a parquet writer writing a file to the given io.Writer.

NewWriterConfig

NewWriterConfig constructs a new writer configuration applying the options passed as arguments.

NullsFirst

NullsFirst wraps the SortingColumn passed as argument so that it instructs the row group to place null values first in the column.

NullValue

NulLValue constructs a null value, which is the zero-value of the Value type.

OpenFile

OpenFile opens a parquet file and reads the content between offset 0 and the given size in r.

Optional

Optional wraps the given node to make it optional.

PageBufferSize

PageBufferSize configures the size of column page buffers on parquet writers.

PrintColumnChunk

No description provided by the author

PrintPage

No description provided by the author

PrintRowGroup

No description provided by the author

PrintSchema

No description provided by the author

PrintSchemaIndent

No description provided by the author

Read

Read reads and returns rows from the parquet file in the given reader.

ReadBufferSize

ReadBufferSize is a file configuration option which controls the default buffer sizes for reads made to the provided io.Reader.

ReadFile

ReadFile reads rows of the parquet file at the given path.

Release

Release is a helper function to decrement the reference counter of pages backed by memory which can be granularly managed by the application.

Repeated

Repeated wraps the given node to make it repeated.

Required

Required wraps the given node to make it required.

Retain

Retain is a helper function to increment the reference counter of pages backed by memory which can be granularly managed by the application.

ScanRowReader

ScanRowReader constructs a RowReader which exposes rows from reader until the predicate returns false for one of the rows, or EOF is reached.

SchemaOf

SchemaOf constructs a parquet schema from a Go value.

Search is like Find, but uses the default ordering of the given type.

SkipBloomFilters

SkipBloomFilters is a file configuration option which prevents automatically reading the bloom filters when opening a parquet file, when set to true.

SkipPageIndex

SkipPageIndex is a file configuration option which prevents automatically reading the page index when opening a parquet file, when set to true.

SortingBuffers

SortingBuffers creates a configuration option which sets the pool of buffers used to hold intermediary state when sorting parquet rows.

SortingColumns

SortingColumns creates a configuration option which defines the sorting order of columns in a row group.

SortingRowGroupConfig

SortingRowGroupConfig is a row group option which applies configuration specific sorting row groups.

SortingWriterConfig

SortingWriterConfig is a writer option which applies configuration specific to sorting writers.

SplitBlockFilter

SplitBlockFilter constructs a split block bloom filter object for the column at the given path, with the given bitsPerValue.

String

String constructs a leaf node of UTF8 logical type.

Time

Time constructs a leaf node of TIME logical type.

Timestamp

Timestamp constructs of leaf node of TIMESTAMP logical type.

TransformRowReader

TransformRowReader constructs a RowReader which applies the given transform to each row rad from reader.

TransformRowWriter

TransformRowWriter constructs a RowWriter which applies the given transform to each row writter to writer.

Uint

Uint constructs a leaf node of unsigned integer logical type of the given bit width.

UUID

UUID constructs a leaf node of UUID logical type.

ValueOf

ValueOf constructs a parquet value from a Go value v.

Write

Write writes the given list of rows to a parquet file written to w.

WriteBufferSize

WriteBufferSize configures the size of the write buffer.

WriteFile

Write writes the given list of rows to a parquet file written to w.

ZeroValue

ZeroValue constructs a zero value of the given kind.

# Constants

Boolean

No description provided by the author

ByteArray

No description provided by the author

DefaultColumnBufferCapacity

No description provided by the author

DefaultColumnIndexSizeLimit

No description provided by the author

DefaultDataPageStatistics

No description provided by the author

DefaultDataPageVersion

No description provided by the author

DefaultMaxRowsPerRowGroup

No description provided by the author

DefaultPageBufferSize

No description provided by the author

DefaultReadMode

No description provided by the author

DefaultSkipBloomFilters

No description provided by the author

DefaultSkipPageIndex

No description provided by the author

DefaultWriteBufferSize

No description provided by the author

Double

No description provided by the author

FixedLenByteArray

No description provided by the author

Float

No description provided by the author

Int32

No description provided by the author

Int64

No description provided by the author

Int96

No description provided by the author

MaxColumnDepth

MaxColumnDepth is the maximum column depth supported by this package.

MaxColumnIndex

MaxColumnIndex is the maximum column index supported by this package.

MaxDefinitionLevel

MaxDefinitionLevel is the maximum definition level supported by this package.

MaxRepetitionLevel

MaxRepetitionLevel is the maximum repetition level supported by this package.

MaxRowGroups

MaxRowGroups is the maximum number of row groups which can be contained in a single parquet file.

ReadModeAsync

ReadModeAsync reads pages asynchronously in the background.

ReadModeSync

ReadModeSync reads pages synchronously on demand (Default).

# Variables

BitPacked

BitPacked is the deprecated bit-packed encoding for repetition and definition levels.

BooleanType

No description provided by the author

Brotli

Brotli is the BROTLI parquet compression codec.

ByteArrayType

No description provided by the author

ByteStreamSplit

ByteStreamSplit is an encoding for floating-point data.

DeltaBinaryPacked

DeltaBinaryPacked is the delta binary packed parquet encoding.

DeltaByteArray

DeltaByteArray is the delta byte array parquet encoding.

DeltaLengthByteArray

DeltaLengthByteArray is the delta length byte array parquet encoding.

DoubleType

No description provided by the author

ErrCorrupted

ErrCorrupted is an error returned by the Err method of ColumnPages instances when they encountered a mismatch between the CRC checksum recorded in a page header and the one computed while reading the page data.

ErrInvalidConversion

ErrConversion is used to indicate that a conversion betwen two values cannot be done because there are no rules to translate between their physical types.

ErrMissingPageHeader

ErrMissingPageHeader is an error returned when a page reader encounters a malformed page header which is missing page-type-specific information.

ErrMissingRootColumn

ErrMissingRootColumn is an error returned when opening an invalid parquet file which does not have a root column.

ErrRowGroupSchemaMismatch

ErrRowGroupSchemaMismatch is an error returned when attempting to write a row group but the source and destination schemas differ.

ErrRowGroupSchemaMissing

ErrRowGroupSchemaMissing is an error returned when attempting to write a row group but the source has no schema.

ErrRowGroupSortingColumnsMismatch

ErrRowGroupSortingColumnsMismatch is an error returned when attempting to write a row group but the sorting columns differ in the source and destination.

ErrSeekOutOfRange

ErrSeekOutOfRange is an error returned when seeking to a row index which is less than the first row of a page.

ErrTooManyRowGroups

ErrTooManyRowGroups is returned when attempting to generate a parquet file with more than MaxRowGroups row groups.

ErrUnexpectedDefinitionLevels

ErrUnexpectedDefinitionLevels is an error returned when attempting to decode definition levels into a page which is part of a required column.

ErrUnexpectedDictionaryPage

ErrUnexpectedDictionaryPage is an error returned when a page reader encounters a dictionary page after the first page, or in a column which does not use a dictionary encoding.

ErrUnexpectedRepetitionLevels

ErrUnexpectedRepetitionLevels is an error returned when attempting to decode repetition levels into a page which is not part of a repeated column.

FloatType

No description provided by the author

Gzip

Gzip is the GZIP parquet compression codec.

Int32Type

No description provided by the author

Int64Type

No description provided by the author

Int96Type

No description provided by the author

Lz4Raw

Lz4Raw is the LZ4_RAW parquet compression codec.

Microsecond

No description provided by the author

Millisecond

No description provided by the author

Nanosecond

No description provided by the author

Plain

Plain is the default parquet encoding.

PlainDictionary

PlainDictionary is the plain dictionary parquet encoding.

RLE

RLE is the hybrid bit-pack/run-length parquet encoding.

RLEDictionary

RLEDictionary is the RLE dictionary parquet encoding.

Snappy

Snappy is the SNAPPY parquet compression codec.

Uncompressed

Uncompressed is a parquet compression codec representing uncompressed pages.

Zstd

Zstd is the ZSTD parquet compression codec.

# Structs

Buffer

Buffer represents an in-memory group of parquet rows.

Column

Column represents a column in a parquet file.

ConvertError

ConvertError is an error type returned by calls to Convert when the conversion of parquet schemas is impossible or the input row for the conversion is malformed.

DataPageHeaderV1

DataPageHeaderV1 is an implementation of the DataPageHeader interface representing data pages version 1.

DataPageHeaderV2

DataPageHeaderV2 is an implementation of the DataPageHeader interface representing data pages version 2.

DictionaryPageHeader

DictionaryPageHeader is an implementation of the PageHeader interface representing dictionary pages.

File

File represents a parquet file.

FileConfig

The FileConfig type carries configuration options for parquet files.

GenericBuffer

GenericBuffer is similar to a Buffer but uses a type parameter to define the Go type representing the schema of rows in the buffer.

GenericReader

GenericReader is similar to a Reader but uses a type parameter to define the Go type representing the schema of rows being read.

GenericWriter

GenericWriter is similar to a Writer but uses a type parameter to define the Go type representing the schema of rows being written.

LeafColumn

LeafColumn is a struct type representing leaf columns of a parquet schema.

Reader

Deprecated: A Reader reads Go values from parquet files.

ReaderConfig

The ReaderConfig type carries configuration options for parquet readers.

RowBuffer

RowBuffer is an implementation of the RowGroup interface which stores parquet rows in memory.

RowBuilder

RowBuilder is a type which helps build parquet rows incrementally by adding values to columns.

RowGroupConfig

The RowGroupConfig type carries configuration options for parquet row groups.

Schema

Schema represents a parquet schema created from a Go value.

SortingConfig

The SortingConfig type carries configuration options for parquet row groups.

SortingWriter

SortingWriter is a type similar to GenericWriter but it ensures that rows are sorted according to the sorting columns configured on the writer.

Value

The Value type is similar to the reflect.Value abstraction of Go values, but for parquet values.

Writer

Deprecated: A Writer uses a parquet schema and sequence of Go values to produce a parquet file to an io.Writer.

WriterConfig

The WriterConfig type carries configuration options for parquet writers.

# Interfaces

BloomFilter

BloomFilter is an interface allowing applications to test whether a key exists in a bloom filter.

BloomFilterColumn

The BloomFilterColumn interface is a declarative representation of bloom filters used when configuring filters on a parquet writer.

BooleanReader

BooleanReader is an interface implemented by ValueReader instances which expose the content of a column of boolean values.

BooleanWriter

BooleanWriter is an interface implemented by ValueWriter instances which support writing columns of boolean values.

BufferPool

BufferPool is an interface abstracting the underlying implementation of page buffer pools.

ByteArrayReader

ByteArrayReader is an interface implemented by ValueReader instances which expose the content of a column of variable length byte array values.

ByteArrayWriter

ByteArrayWriter is an interface implemented by ValueWriter instances which support writing columns of variable length byte array values.

ColumnBuffer

ColumnBuffer is an interface representing columns of a row group.

ColumnChunk

The ColumnChunk interface represents individual columns of a row group.

ColumnIndex

No description provided by the author

ColumnIndexer

The ColumnIndexer interface is implemented by types that support generating parquet column indexes.

Conversion

Conversion is an interface implemented by types that provide conversion of parquet rows from one schema to another.

DataPageHeader

DataPageHeader is a specialization of the PageHeader interface implemented by data pages.

Dictionary

The Dictionary interface represents type-specific implementations of parquet dictionaries.

DoubleReader

DoubleReader is an interface implemented by ValueReader instances which expose the content of a column of double-precision float point values.

DoubleWriter

DoubleWriter is an interface implemented by ValueWriter instances which support writing columns of double-precision floating point values.

Field

Field instances represent fields of a parquet node, which associate a node to their name in their parent node.

FileOption

FileOption is an interface implemented by types that carry configuration options for parquet files.

FixedLenByteArrayReader

FixedLenByteArrayReader is an interface implemented by ValueReader instances which expose the content of a column of fixed length byte array values.

FixedLenByteArrayWriter

FixedLenByteArrayWriter is an interface implemented by ValueWriter instances which support writing columns of fixed length byte array values.

FloatReader

FloatReader is an interface implemented by ValueReader instances which expose the content of a column of single-precision floating point values.

FloatWriter

FloatWriter is an interface implemented by ValueWriter instances which support writing columns of single-precision floating point values.

Int32Reader

Int32Reader is an interface implemented by ValueReader instances which expose the content of a column of int32 values.

Int32Writer

Int32Writer is an interface implemented by ValueWriter instances which support writing columns of 32 bits signed integer values.

Int64Reader

Int64Reader is an interface implemented by ValueReader instances which expose the content of a column of int64 values.

Int64Writer

Int64Writer is an interface implemented by ValueWriter instances which support writing columns of 64 bits signed integer values.

Int96Reader

Int96Reader is an interface implemented by ValueReader instances which expose the content of a column of int96 values.

Int96Writer

Int96Writer is an interface implemented by ValueWriter instances which support writing columns of 96 bits signed integer values.

Node

Node values represent nodes of a parquet schema.

OffsetIndex

No description provided by the author

Page

Page values represent sequences of parquet values.

PageHeader

PageHeader is an interface implemented by parquet page headers.

PageReader

PageReader is an interface implemented by types that support producing a sequence of pages.

Pages

Pages is an interface implemented by page readers returned by calling the Pages method of ColumnChunk instances.

PageWriter

PageWriter is an interface implemented by types that support writing pages to an underlying storage medium.

ReaderOption

ReaderOption is an interface implemented by types that carry configuration options for parquet readers.

RowGroup

RowGroup is an interface representing a parquet row group.

RowGroupOption

RowGroupOption is an interface implemented by types that carry configuration options for parquet row groups.

RowGroupReader

RowGroupReader is an interface implemented by types that expose sequences of row groups to the application.

RowGroupWriter

RowGroupWriter is an interface implemented by types that allow the program to write row groups.

RowReader

RowReader reads a sequence of parquet rows.

RowReaderFrom

RowReaderFrom reads parquet rows from reader.

RowReaderWithSchema

RowReaderWithSchema is an extension of the RowReader interface which advertises the schema of rows returned by ReadRow calls.

RowReadSeeker

RowReadSeeker is an interface implemented by row readers which support seeking to arbitrary row positions.

Rows

Rows is an interface implemented by row readers returned by calling the Rows method of RowGroup instances.

RowSeeker

RowSeeker is an interface implemented by readers of parquet rows which can be positioned at a specific row index.

RowWriter

RowWriter writes parquet rows to an underlying medium.

RowWriterTo

RowWriterTo writes parquet rows to a writer.

RowWriterWithSchema

RowWriterWithSchema is an extension of the RowWriter interface which advertises the schema of rows expected to be passed to WriteRow calls.

SortingColumn

SortingColumn represents a column by which a row group is sorted.

SortingOption

SortingOption is an interface implemented by types that carry configuration options for parquet sorting writers.

TimeUnit

TimeUnit represents units of time in the parquet type system.

Type

The Type interface represents logical types of the parquet type system.

ValueReader

ValueReader is an interface implemented by types that support reading batches of values.

ValueReaderAt

ValueReaderAt is an interface implemented by types that support reading values at offsets specified by the application.

ValueReaderFrom

ValueReaderFrom is an interface implemented by value writers to read values from a reader.

ValueWriter

ValueWriter is an interface implemented by types that support reading batches of values.

ValueWriterTo

ValueWriterTo is an interface implemented by value readers to write values to a writer.

WriterOption

WriterOption is an interface implemented by types that carry configuration options for parquet writers.

# Type aliases

Group

No description provided by the author

Kind

Kind is an enumeration type representing the physical types supported by the parquet type system.

ReadMode

ReadMode is an enum that is used to configure the way that a File reads pages.

Row

Row represents a parquet row as a slice of values.

RowReaderFunc

RowReaderFunc is a function type implementing the RowReader interface.

RowWriterFunc

RowWriterFunc is a function type implementing the RowWriter interface.

ValueReaderFunc

ValueReaderFunc is a function type implementing the ValueReader interface.

ValueWriterFunc

ValueWriterFunc is a function type implementing the ValueWriter interface.