package
0.1.0
Repository: https://github.com/rom1mouret/ml-essentials.git
Documentation: pkg.go.dev

# README

This README covers the basics.

GoDoc

link

Imports

import "github.com/rom1mouret/ml-essentials/dataframe"

DataFrame construction

DataFrames accept 4 types of columns.

typemissing valuecomment
float64NaN
int-1meant to store categorical values
boolnot supported
interface{}nilcalled "object" columns

Strings are stored in the interface{} columns. ml-essentials distinguishes between regular object columns and string columns by keeping around the names of the string columns. Some functions are specialized for string columns, e.g. Encode(newEncoding encoding.Encoding).

Storing categorical values is the preferred use of integer columns. That said, you are free to use them to store any kind of integers, including negative integers. Negative integers won't be treated as missing values unless you run IntImputer.

Construction with a DataBuilder
builder := dataframe.DataBuilder{RawData: dataframe.NewRawData()}
builder.AddFloats("height", 170, 180, 165)
builder.AddStrings("name", "Karen", "John", "Sophie")
df := builder.ToDataFrame()
df.PrintSummary().PrintHead(-1, "%.3f")
Construction from a CSV file
spec := dataframe.CSVReadingSpec{
  MaxCPU: -1,
  MissingValues: []string{"", " ", "NA","-"},
  IntAsFloat: true,
  BoolAsFloat: false,
  BinaryAsFloat: true,
}
rawdata, err := dataframe.FromCSVFile("/path/to/csvfile.csv", spec)

or

rawdata, err := dataframe.FromCSVFilePattern("/path/to/csvdir/*.csv", spec)

Column names

You can manipulate column names via the ColumnHeader structure.

h := df.FloatHeader().And(df.IntHeader()).Except("target", "id").NameList()

Iterate over a dataframe

Option 1: ColumnAccess
height := df.Floats("height")
for i := 0; i < height.Size(); i++ {
  height.Set(i, height.Get(i) / 2)
}
Option 2: Gonum Batching
batching := dataframe.NewDense64Batching([]string{"age", "height", "gender"})
for _, batch := range df.SplitView(params.BatchSize) {
  // get a gonum matrix with columns age, height and gender (in that order)
  rows := batching.DenseMatrix(batch)
}
Option 3: Row Iterator
iterator := NewFloat32Iterator(df, []string{"age", "height", "gender"})
for row, rowIdx, _ := iterator.NextRow(); row != nil; row, rowIdx, _ = iterator.NextRow() {
  // row is a float32 slice
}

Views

Views are dataframes that share data with other dataframes. There is no View type and DataFrame type. Both are of type DataFrame. Quick example:

view := df.ShuffleView()
view.OverwriteInts("level", []int{4, 1, 2, 1})

Here view shares its data with df. This is useful in two ways. First, ShuffleView doesn't copy the data, thus it is fast and memory-efficient. Second, it allows you to overwrite df's data from anywhere in your program. The "side effects" section explains why this is an advantage when it comes to handling indexed data.

If you want to avoid such side effects, you can detach the view from its parent dataframe.

view := df.ShuffleView().DetachedView("level")
view.OverwriteInts("level", []int{4, 1, 2, 1})

Now, OverwriteInts does not alter df because view has its own level data. Other columns of df remain shared.

ml-essentials provides a variety of functions to manage data copies at a fine-grained level.

View < TransferRawDataFrom < ShallowCopy < Unshare < DetachedView < Copy

On one side of the spectrum, View only copies pointers. On the opposite side, Copy copies almost everything. View, DetachedView and Copy cover 99% of the cases.

View is handy if you want to execute an in-place operation without altering the original dataframe, as in this example:

view := df.View()
view.Rename("level", "degree")

Now, view and df still share their data, but their columns are named differently.

Side effects

Side effects are normally considered anti-patterns but they do facilitate manipulating indexed data. For instance, consider this scenario:

  1. at the top level, the data is separated into "features" and "metadata". Example of metadata: unique identifier, timestamps.
  2. the model makes predictions from the features and predictions with low confidence are thrown away.
  3. back to the top level, we combine "metadata" columns with predictions using the indices of high-confidence rows.

Step 3 is error-prone. With ml-essentials, the idiomatic way is to avoid separating "features" and "metadata" in the first place. Instead, we would rely on views to enforce that the metadata always aligns with the features and predicted values.

Among the way Pandas can solve this problem, it can combine "features" and "metadata" in an index-aware fashion, but this makes pandas.concat error-prone in other scenarios, like when it fills dataframes with NaN where indices don't align, that is if ignore_index is left to its default value.

Filtering, masking and indexing

Unlike Pandas and Numpy, there is no syntactic sugar to create masks and index arrays. Sugar aside, this section will look familiar to Pandas and Numpy users.

If you want to filter rows where "age" is over 18, you can do so with MaskView:

ages := df.Floats("age")
mask := df.EmptyMask()
for i := 0; i < ages.Size(); i++ {
  mask[i] = ages.Get(i) >= 18
}
view := df.MaskView(mask)

Getting a mask from EmptyMask() is advantageous because it recycles []bool slices across dataframes, but it is not mandatory.

Equivalent filtering with IndexView:

ages := df.Floats("age")
indices := make([]int, 0, ages.Size())
for i := 0; i < ages.Size(); i++ {
  if ages.Get(i) >= 18 {
    indices = append(indices, i)
  }
}
view := df.IndexView(indices)

In the future, we may add syntactic sugar for common scenarios, e.g. Condition("age").Higher(18).

Write in a dataframe

You can use the Set function as shown above. Alternatively, you might find it more convenient to write an entire column in one line of code:

df.OverwriteFloats64("height", []float64{170, 180, 165})

This is almost the same as:

height := df.Floats("height")
height.Set(0, 170)
height.Set(1, 180)
height.Set(2, 165)

The only difference is that OverwriteFloats64 will create a new column if it doesn't already exist.

Complete example

This is an example taken from linear_regression.go

import (
  "gonum.org/v1/gonum/mat"
  "github.com/rom1mouret/ml-essentials/dataframe"
)

func Predict(df *dataframe.DataFrame, batchSize int, resultColumn string) *dataframe.DataFrame {
  df = df.ResetIndexView() // makes batching.DenseMatrix faster

  // pre-allocation
  weights := mat.NewVecDense(len(reg.Weights), reg.Weights)
  pred := make([]float64, df.NumRows())

  // prediction
  batching := dataframe.NewDense64Batching(reg.Features)
  for i, batch := range df.SplitView(batchSize) {
    rows := batching.DenseMatrix(batch)
    offset := i * batchSize
    yData := pred[offset:offset+batch.NumRows()]
    yVec := mat.NewVecDense(len(yData), yData)
    yVec.MulVec(rows, weights)
  }

  // write the result in the output dataframe
  result := df.View()
  result.OverwriteFloats64("_target", pred)
  reg.TargetScaler.InverseTransformInplace(result)
  result.Rename("_target", resultColumn)

  return result
}

# Functions

CheckNoColumnOverlap returns an error if two or more dataframes have one or more columns in common, regardless of their type.
ColumnConcatView merges the columns from multiple dataframes.
ColumnCopyConcat merges the columns from multiple dataframes.
Columns create a ColumnHeader from a list of column names.
ColumnSmartConcat merges the columns from multiple dataframes.
EmptyDataFrame creates a new dataframe with no columns.
FromCSV reads CSV data and returns a RawData structure with automatically inferred column types.
FromCSVFile reads a CSV file and returns a RawData structure with automatically inferred column types.
FromCSVFilePattern searches for file paths that matches the given glob pattern, reads them and returns a single RawData structure containing all the data packed in an unordered fashion.
MergeRawDataColumns transfers data from multiple RawData structures.
MergeRawDataRows concatenates multiple RawData together in a row-wise manner.
NewDense64Batching allocates a new Dense64Batching structure.
NewFloat32Iterator allocates a new row iterator to allow you to iterate over float, bool and int columns as floats.
NewRawData allocates a new RawData structure.
RowConcat concatenates the rows of the given dataframes.

# Constants

No description provided by the author
No description provided by the author

# Structs

BoolAccess is a random-access iterator for boolean columns.
No description provided by the author
ColumnHeader helps you manipulate column names.
No description provided by the author
No description provided by the author
DataBuilder is a helper structure to build dataframes.
DataFrame is a structure that lets you manipulate both original data and views on other dataframes' data by sharing the underlying data.
DataFrameInternals is a helper structure that lets you access the internals of a dataframe.
No description provided by the author
Float32Iterator is a structure to iterate over a dataframe one row at a time.
FloatAccess is a random-access iterator for float columns.
No description provided by the author
IntAccess is a random-access iterator for integer columns.
ObjectAccess is a random-access iterator for object columns, including string columns.
RawData is the structure that holds the data of the dataframes.
No description provided by the author
StringAccess is a random-access iterator for string columns.

# Type aliases

ObjectType allows us to distinguish between the possible types of data contained in the object columns.