Categorygithub.com/absolute8511/hyperloglog2
modulepackage
0.1.1
Repository: https://github.com/absolute8511/hyperloglog2.git
Documentation: pkg.go.dev

# README

HyperLogLog GoDoc Go Report Card cover.run go

An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 20-50% less space than other usual HyperLogLog implementations.

This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".

Implementation

The core differences between this and other implementations are:

  • use metro hash instead of xxhash
  • sparse representation for lower cardinalities (like HyperLogLog++)
  • loglog-beta for dynamic bias correction medium and high cardinalities.
  • 4-bit register instead of 5 (HLL) and 6 (HLL++), but most implementations use 1-byte registers out of convenience

In general it borrows a lot from InfluxData's fork of Clark Duvall's HyperLogLog++ implementation, but uses 50% less space.

Results

A direct comparison with the HyperLogLog++ implementation used by InfluxDB yielded the following results:

ExactAxiom (8.2 KB)Influx (16.39 KB)
1010 (0.0% off)10 (0.0% off)
5050 (0.0% off)50 (0.0% off)
250250 (0.0% off)250 (0.0% off)
12501249 (0.08% off)1249 (0.08% off)
62506250 (0.0% off)6250 (0.0% off)
3125031008 (0.7744% off)31565 (1.0080% off)
156250156013 (0.1517% off)156652 (0.2573% off)
781250782364 (0.1426% off)775988 (0.6735% off)
39062503869332 (0.9451% off)3889909 (0.4183% off)
100000009952682 (0.4732% off)9889556 (1.1044% off)

Note

A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".

# Packages

No description provided by the author

# Functions

New returns a HyperLogLog Sketch with 2^14 registers (precision 14).
New14 returns a HyperLogLog Sketch with 2^14 registers (precision 14).
New16 returns a HyperLogLog Sketch with 2^16 registers (precision 16).
New16NoSparse returns a HyperLogLog Sketch with 2^16 registers (precision 16) that will not use a sparse representation.
NewNoSparse returns a HyperLogLog Sketch with 2^14 registers (precision 14) that will not use a sparse representation.

# Variables

ErrorTooShort is an error that UnmarshalBinary try to parse too short binary.

# Structs

Sketch is a HyperLogLog data-structure for the count-distinct problem, approximating the number of distinct elements in a multiset.