How to Save Millions of Bytes with One Simple Trick - [Varint Encoding]
Introduction
In many applications, such as search engines and relational database systems, data is stored in the form of arrays of integers. For example, in a relational database, column values are transformed into integer values by dictionary coding. Encoding and decoding these arrays of integers consumes considerable CPU time and memory bandwidth. Therefore, it is important to use efficient compression techniques to reduce the size of the data and improve the performance of the applications.
One of the most common and simple compression techniques for integer arrays is Varint encoding. Varint encoding is a way of representing unsigned 64-bit integers using a variable number of bytes, with smaller values using fewer bytes.
How varint encoding works
The basic idea of varint encoding is to use the most significant bit (MSB) of each byte as a continuation bit, and the lower 7 bits as a payload. The continuation bit indicates whether the byte is the last one in the varint or not. If the continuation bit is 0, it means that the byte is the last one, and the varint is complete. If the continuation bit is 1, it means that the byte is followed by another byte that is part of the varint. The payload bits are used to store the actual value of the integer. The resulting integer is built by appending together the 7-bit payloads of its constituent bytes in little-endian order.
For example, let’s say we want to encode the number 150 using varint encoding. In binary, 150 is 10010110
. To encode it using varint, we first split it into two 7-bit chunks: 0000001
and 0010110
. Then, we add a continuation bit to each chunk: 10000001
and 00010110
. The continuation bit of the first chunk is 1, indicating that there is another byte following it. The continuation bit of the second chunk is 0, indicating that it is the last byte. The varint encoding of 150 is then 9601
in hexadecimal, or 10010110 00000001
in binary.
To decode a varint, we simply drop the continuation bit from each byte, and concatenate the 7-bit payloads in little-endian order. For example, to decode 9601
, we first remove the continuation bits: 0000001
and 0010110
. Then, we concatenate them in reverse order: 0010110 0000001
. This gives us 10010110
, which is 150 in binary.
Summary
Varint encoding is a simple and efficient way to store and transmit integers using a variable number of bytes. It reduces the size of the data, improves the performance of the applications, and is widely used in various domains. In this post, we explained how varint encoding works, why it is efficient, and how it is used in various applications. We hope this post helps you understand and appreciate the power and beauty of varint encoding.