gobsp

Go Binary Stream Protocol

Description

gobsp is a extensible binary protocol for passing messages between computer programs. It is designed to be consumed similar to how streams of other information are consumed: by a scanner. gobsp includes a scanner for decoding an incoming stream of data, and an writer for encoding and writing data to an outgoing stream.

Usage Example

    package example

    import (
        "log"
        "github.com/karrick/gobsp"
    )
    
    // NOTE: Define some message types our program can understand.
    const (
        MTError gobsp.MessageType = iota
        MTGreeting
        MTFarewell
        // NOTE: You ought not rearrange these for newer protocol message
        // types, just add more below.
    )

    func handleError(ior io.Reader) error {
        buf, err := io.ReadAll(ior)
        if err != nil {
            return err
        }
        fmt.Printf("An error took place: %s", string(buf))
    	return nil
    }
    
    func handleGreeting(ior io.Reader) error {
        buf, err := io.ReadAll(ior)
        if err != nil {
            return err
        }
        fmt.Printf("Hello, %s", string(buf))
    	return nil
    }
    
    func handleFarewell(ior io.Reader) error {
        buf, err := io.ReadAll(ior)
        if err != nil {
            return err
        }
        fmt.Printf("Farewell, %s", string(buf))
    	return nil
    }
    
    func main()  {
    	handlers := map[uint32]MessageHandler {
            MTError: handleError,
    		MTGreeting: handleGreeting,
    		MTFarewell: handleFarewell,
    	}

        proc, err := gobsp.NewProcessor(iow,
            DefaultHandler(dh),
            DefaultHandler(dh),
            Handlers(handlers),
        )
    	if err != nil {
    		log.Fatal(err)
    	}
        if err := proc.Process(); err != nil {
    		log.Printf("WARNING: there was an error; moving on: %s", err)
    	}
    }

WARNING: There are two kinds of errors: (1) those that occur due to failure to read data from the stream; and (2) those that occur during processing of a particular message. It is imperative that message handlers only return errors when there is a failure to read data from the stream. If a handler can read data, but there is an error interpreting it, or an error taking some action on it, the handler ought not return that error. Rather, it should handle it in some way and return nil, which tells the processor to continue reading from the stream and handle more messages.

Protocol

gobsp has multiple layers of protocols for different purposes.

stream layer (message framing)
user-defined message type layer
primitive data type layer

The stream layer describes how bytes are pulled off the wire and divided into messages. The user-defined layer is controlled by the code using this library. Finally, the primitive data type layer describes how individual pieces of data are encoded in the user-defined messages.

Primitive Data Type Layer

gobsp supports several primitive data types, such as both signed and unsigned fixed width and variable width integers, floating point numbers, strings, and a lists of strings.

Stream Layer

Messages are read from the byte stream one after the other using a simple framing method. Messages are framed with two control integers, the message type and the message size, and followed by the message payload. The format of the message payload is determined by the message type. Each message type will have a particular payload format, which is determined by the application.

One drawback to the simplicity of the message framing described above is inability to resynchronize a parser if it ever drops sync with the byte stream.

Message type and message size are each encoded as unsigned 16-bit integers in big-endian format. This keeps the per-message overhead to 4 bytes, while leaving plenty of room for a lot of message types and rather large message sizes. Messages larger than 64 KiB would have to be split up into multiple messages.

Message Type and Version

The message type integer does double duty and, for a particular application, encodes both the message's type and version. For example, if message type 1 has a particular meaning to your application, and you must upgrade the message payload format, simply allocate a new number for the new message format. Instead of creating version 2 of message type 1, you simply create message type N+1, where N was previously the largest message type number.

For this reason, once your application associates particular integers with particular message payload formats, it is never advisable to modify those formats. Rather keep the old source code for processing older message types and add source code for processing new message types.

The side-effect of combining a message type and version are to create a larger group of message types.

The advantages of combining message type and version includes the simplification of upgrading the payload format of a particular message type, and does not tie a bunch of message types to a particular version of the protocol.

Furthermore, there is no protocol negotiation phase.

The disadvantages of combining message type and version include the fact that there is no way in the protocol itself to specficy minimum protocol version. However, a particular application could create a message type for protocol version. For instance, perhaps message type 0 is a protocol version declaration, and message type 1 specifies an error in the protocol negotiation, and means in this application that the other end ought to cease data transmission. In this example, the message types convey the protocol negotiation phase.

Primitive Data Types

While applications are free to format message payloads in any way suitable to their purpose, the following primitive data type encodings are provided. All numbers are encoded in big-endian format. Floats are IEEE-754.

Float32
Float64
Int8 & Uint8
Int16 & Uint16
Int32 & Uint32
Int64 & Uint64
String
StringSlice
Variable Width Integer (VWI) & Unsigned Variable Width Integer (UVWI)

Performance

When encoding or decoding data for the Int8, Uint8, VWI, and UVWI data types, and the provided stream implements the io.ByteReader (for decoding), or the io.ByteWriter (for encoding) interface, the program benefits from an approximate three fold performance boost. In the benchmarks below, the buffer.Buffer instance only implements io.Reader and io.Writer interfaces. The bytes.Buffer instance also implements the io.ByteReader and io.ByteWriter interfaces.

BenchmarkBinaryBuffer-4            	 2000000	       963 ns/op
BenchmarkBinaryBytes-4             	 5000000	       312 ns/op

Variable Width Integer (VWI)

Variable Width Integers encode numbers as large as 64-bit integers by repurposing the high-bit of every byte as a flag to represent whether or not additional bytes are used to encode this number. Therefore, each encoded byte can only encode up to 7-bits of the original value. This may cause some numbers to be encoded using more bytes than encoding them as a fixed width integer. For example, while the number 127 requires one byte to encode, the number 128 requires two bytes.

The advantage of using VWI is that both large and small numbers can be encoded using this data type, but a particular encoded value will consume as few bytes as required to encode that number. Because numbers in many applications are relatively small, most applications benefit from the compromise.

The disadvantage of using VWI is the small computation overhead required for encoding and decoding VWI numbers compared to equivalent unsigned integer numbers.

Benchmarks

There is no doubt that VWI imposes a certain computational overhead during encoding or decoding when compared to merely writing the byte values from fixed width integers. The following benchmark values were collected on my laptop, but one may expect similar percentage impacts on other platforms.

BenchmarkBinaryOneByteFixed-4      	20000000	       123 ns/op
BenchmarkBinaryOneByteVariable-4   	10000000	       140 ns/op

BenchmarkBinaryTwoBytesFixed-4     	10000000	       211 ns/op
BenchmarkBinaryTwoBytesVariable-4  	10000000	       157 ns/op

BenchmarkBinaryFourBytesFixed-4    	10000000	       202 ns/op
BenchmarkBinaryFourBytesVariable-4 	10000000	       208 ns/op

BenchmarkBinaryEightBytesFixed-4   	 5000000	       211 ns/op
BenchmarkBinaryEightBytesVariable-4	 5000000	       328 ns/op

In general the more bytes required to store a number, the more of a performance impact a particular program will have. For occassions when most numbers are expected to be rather small, using VWI data types can have negligible impact while providing additional flexibility when few numbers are large in magnitude.

The takeaway here is that while variable width integers provide greater flexibility and more compact representations of many numbers, they require additional overhead. Use them when needed and use fixed width integers when you don't.

String

This protocol encodes strings as a VWI representing the number of bytes for the string, followed by the actual bytes in the string.

StringSlice

This protocol encodes string slices as a VWI representing the number of strings in the slice, followed by the encoded form of each string.

# README