Varints

Today I learned about protocol buffer encoding and one of the related topics is the varint. Here I will talk about what they are and why we use them in protobufs.

Integer Representations

Imagine you want to represent an integer value like 100,000. How many bytes do you need to represent it? Well, 2^16 is ~65k, so to represent 100k you actually need 17 bits (2^17 ~= 131k) (and assumes the integer is unsigned).

In programming when we say something is an integer we usually mean either a 32-bit integer or a 64-bit integer.

During processing our processors take these fixed-length integers to perform floating-point operations (FLOPs) on. (You often hear “64 bit processor – x86-64 architecture” or “32 bit processor.”) Since your processor’s physical design is to process “words” (instructions) of that size, integer representations more compact than that don’t gain you anything for processing. But you can see that if you’re storing 64-bit integers and all you actually store are small numbers, you waste a lot of bits!

Storing Numbers in Databases

Unlike your processor, databases give you the opportunity to tune how big the data is that goes into an integer / other field. You can say you want 8-bit ints for storing tiny things (tinyint), or more commonly 32-bit ints for normal sized integers. However you can see there’s still a drawback: you’ll always face wasting some bits since most of the time you won’t utilize all of the reserved bits for data storage. Now, because of the way memory works (paging, etc) having a set size for integers still makes sense (you can have variable length fields but they tend to only be useful for larger quantities of stored data and are probably page-aligned).

But turns out there is also a situation where this waste can/should be optimized away: over the wire.

Numbers on the Wire

OK what if you’re not talking about databases and now talking about wire formats for transfering data? What if you want the most efficient usage since your machine is reading off the wire 1 byte at a time?

Answer: use a varint!!!

A varint is a method of serializing integers using one or more bytes where smaller numbers take smaller number of bytes. That means when you send the number 100,000 over the wire, it takes longer than the value 100. If we use normal 32-bit integers, it would take the same amount of time for either value. But varints leverage the fact that you don’t need to send all 32 bits (4 bytes) to know the value 100, or even 100,000.

The way it works: for a given byte (8 bits), the first bit is reserved to indicate if more bits are coming. The lower 7 bits of each byte are used to store the two’s complement representation of the number in groups of 7 bits, least significant group first.

Varint Example

1 in varint is: 00000001. 300 in varint is: 10101100 00000010.

To get 300 from that byte sequence, you first drop the MSB from each byte (since it was only used to tell us whether we’ve reached the end of the number):

So now we have: 0101100 0000010

Then you reverse the two groups of 7 bits because varints store least significant group first.

So now we have: 0000010 0101100

Finally, you concatenate them to get your final value: 00000100101100

…(and all those leading zeros can go away): 100101100

OK. Now we have this binary number! 256+32+8+4 = 300.

Who uses varints

Protocol buffers! Since they’re designed to be used in conjunction with an RPC system (stubby) they try to maximize wire throughput!