How GZIP works

Home

March 10, 2023

I created a project gzip.go, which is a toy implementation to decrompress zip file. The blog content below is copied from the README. It aims to explain the techniques used in Gzip.

gzip.go

The implementation is largely derived from http://www.infinitepartitions.com/art001.html, but I try to add some overview explanations.

Technique 1: Huffman Encoding

What is encoding? Encoding is mapping from symbols to another symbols. In this case we will map the symbol to binary values, which we will simply call as codes. It is important that the encoded result must be decode-able. Some (invalid) mapping can cause the encoded data to be undecode-able because of ambiguity.

Basic example

One example of encoding, is let say, encode the 4 shapes of a card set, as such:

Shape Code
club 00
heart 01
spade 10
diamond 11

club, heart, spade, and diamond into some binary values, for example, 00, 01, 10, 11. In this way, a string of codes like 01101111 can be decoded as heart, spade, diamond, diamond.

In this particular encoding, we are using 2 bits for each symbol, hence the average length of the codes are 2 bits as well.

Encoding Performance

In most cases, the main goal of an encoding system is to reduce space. That is equivalent to minimizing the average length. An example of a very redundant encoding system is to encode the symbols as ASCII characters, simply like mapping:

Shape Code
club club
heart heart
spade spade
diamond diamond

In this case, the average length is (4 + 5 + 5 + 7)bytes/4 = 5.25 bytes = 42 bits. This is 21 times longer than our previous 2-bits encoding.

Can we do better than 2-bits? How about this following encoding?

Invalid encoding and lower-bound of information

Shape Code
club 0
heart 1
spade 00
diamond 01

Assuming that all shape has equal frequency of appearance, this encoding has an average length of 1.5 bits.

However, there is an issue with this encoding. When we encounter the code 00, we don’t know if it should mean 2 clubs or 1 spade. So we call this invalid encoding.

So seems like the 2 bits encoding is the best one for now. But how do we know or how do we prove it? We can have another valid encoding that is not of equal length, or which we can call “variable length encoding”, for example:

Shape Code
club 1
heart 01
spade 001
diamond 0001

This doesn’t have ambiguity in the decoding state, but the average bit length is 2.5.

But all this while we only talk in the assumption of the frequency being equal. Assume we have a different card sets that has different frequency of shapes:

Shape Frequency
club 0.5
heart 0.3
spade 0.1
diamond 0.1

In this case, to measure the encoding performance, we cannot simply take the average bit length, but rather we need to weight it against the frequency.

Shape Frequency Fixed Length Enc Contribution
club 0.5 00 1 bit
heart 0.3 01 0.6 bit
spade 0.1 10 0.2 bit
diamond 0.1 11 0.2 bit

*Because we are talking in terms of expectation / frequency, we can use non-integer value for “bit count” despite bit being a discrete measure.

Here, the fixed-length 2-bit encoding uses about (1 + 0.6 + 0.2 + 0.2) = 2 bits per symbol (following the symbol frequency/distribution). This is not very surprising because regardless of the frequency distribution, each symbol contributes the same amount.

Let’s compare it with our variable length encoding:

Shape Frequency Variable Length Enc Contribution
club 0.5 1 0.5 bit
heart 0.3 01 0.6 bit
spade 0.1 001 0.3 bit
diamond 0.1 0001 0.4 bit

Here, the sum of contribution is 1.8 bits.

In fact, we can make the encoding shorter:

Shape Frequency Variable Length Enc Contribution
club 0.5 1 0.5 bit
heart 0.3 01 0.6 bit
spade 0.1 001 0.3 bit
diamond 0.1 000 0.3 bit

This encoding might be slightly less intuitive, but consider streams like 000010010001 can be decoded easily as:

This encoding has the total contribution of 1.7 bits per symbol. It can be proven that this encoding is the most optimal one for such frequency distribution.

Huffman Algorithm

Huffman Algorithm is an algorithm to generate huffman encoding. Huffman Algorithm works on the distribution of the symbols and will generate an optimal* encoding for the set of symbols.

*this method is only optimal for if we want to encode one symbol at a time (can be proven). If we remove that constraint and we can encode strings of symbols instead, we can do better compression (lower average bits per symbol).

Algorithm: 0. Given n symbols with probability/frequency of p_0, p_1, …, p_n

  1. Build subtree using 2 symbols with the lowest p_i.
  2. At each step, choose 2 symbols/subtrees with the lowest total probability/frequency, combine to form new subtree
  3. Result: optimal tree built with bottom-up
  4. Apply 0 to left edge and 1 to right edge (or vice versa, doesn’t matter)

Example:

Shape Frequency
A 1/3
B 1/2
C 1/12
D 1/12

img_1.png

Thus, we will get encoding of:

Shape Encoding
A 11
B 0
C 100
D 101

Read more: https://computationstructures.org/lectures/info/info.html (Entropy, Huffman Algorithm)

Huffman Trees Encoding

So far we have only discussed huffman trees in the abstract or in your program’s memory. How do we actually encode this? The naivest solution will simply be listing out the symbol and encoding one after another:

Symbol Encoding
A 11
B 0
C 100
D 101

But this is not very efficient. One improvement that we can do is to skip listing out the symbol instead because we can pre-agree on what are the symbols are.

Encoding
11
0
100
101

Next thing that we can improve, is actually we don’t have to provide all the encoding, simply giving the bitlength is enough to build a reproducible tree. This is possible because what really matters in huffman tree is the length for each symbol encoding, not the structure of the tree itself.

Encoding
2
1
3
3

Here’s one way to fill it

  0000: 0
  0001: 1
101100: 2 <- start a new sequence at (10101 + 1) << 1
101101: 3
101110: 4
  0010: 5 <- pick up four-bit sequence where it left off
  0011: 6
 01110: 7 <- start a new sequence at (0110 + 1) << 1
 01111: 8
 10000: 9
 10001: 10
 10010: 11
 10011: 12
 10100: 13
 10101: 14
101111: 15 <- pick up six-bit sequence where it left off
110000: 16
110001: 17
110010: 18
  0100: 19 <- pick up four-bit sequence where it left off
  0101: 20
  0110: 21
110011: 22 <- pick up six-bit sequence where it left off
110100: 23
110101: 24
110110: 25
110111: 26

The next step is to, instead of providing all the bit lengths, we can provide it as RLE instead:

func TestRLE(t *testing.T) {
	values := []int{
		3, 0, 0, 0, 4, // 0-4
		4, 3, 2, 3, 3, // 5-9
		4, 5, 0, 0, 0, // 10-14
		0, 6, 7, 7, // 15-18
	}
	expectedRanges := []rleRange{
		{0, 3},
		{3, 0},
		{5, 4},
		{6, 3},
		{7, 2},
		{9, 3},
		{10, 4},
		{11, 5},
		{15, 0},
		{16, 6},
		{18, 7},
	}
	assert.Equal(t, expectedRanges, runLengthEncoding(values))
}

If you are worried about RLE having more overhead than the benefit, note that we have about 285 codes in total in GZIP, which probably will have bit lengths of up to 10. We should have more consecutive values, especially where there’s a lot of 0s.

The actual RLE system used in GZIP:

for i < alphabetCount {
    code, _ := getCode(stream, codeLengthsRoot)
    // 0-15: literal (4 bits)
    // 16: repeat the previous character n+3 times (2 extra bits specified)
    // 17: insert n 0's (3 bit specified), max value is 10
    // 18: insert n 0's (7 bit specifier), add 11 (because it's the max of code 17)
    if code == 16 {
        repeatLength := readBitsInv(stream, 2) + 3
        for j := 0; j < repeatLength; j++ {
            alphabetBitLengths[i] = alphabetBitLengths[i-1]
            i++
        }
    } else if code == 17 || code == 18 {
        var repeatLength int
        if code == 17 {
            repeatLength = readBitsInv(stream, 3) + 3
        } else {
            repeatLength = readBitsInv(stream, 7) + 11
        }
        for j := 0; j < repeatLength; j++ {
            alphabetBitLengths[i] = 0
            i++
        }
    } else {
        alphabetBitLengths[i] = code
        i++
    }
}

The instructions above actually contains very clever designs:

Finally, these codes from 0-18 are huffman-encoded as bit-lengths:

func readCodesBitLengths(stream *bitstream, hclen int) []int {
	// The specification refers to the repetition codes as the values 16, 17, and 18,
	//	but these numbers don't have any real physical meaning
	//	16 means "repeat the previous character n times",
	//	and 17 & 18 mean "insert n 0's".
	//	The number n follows the repeat codes and is encoded
	//	(without compression) in 2, 3 or 7 bits, respectively

	// Because codes of lengths 15, 1, 14 and 2 are likely to be very rare in real-world data,
	// 	the codes themselves are given in order of expected frequency
	codeLengthOffsets := []int{
		16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15,
	}

	codeBitLengths := make([]int, 19) // max hclen (0b1111) + 4
	for i := 0; i < (hclen + 4); i++ {
		codeBitLengths[codeLengthOffsets[i]] = readBitsInv(stream, 3)
	}

	return codeBitLengths
}

So we will have this up-to 19 symbols (maximally 57 bits) huffman tree, which we will use to build the 285 codes literals huffman tree.

Technique 2: LZ77

LZ77 is a data compression algorithm published by Lempel and Ziv in 1977 (not to be confused with LZ78 (see footnote), which is similar but slightly different). In principle, LZ77 maintains a sliding window where we can reference some earlier data in the sliding window.

One implementation could be to have triplets of data: (back pointer, copy-length, next_byte) as such:

img_2.png

Notice the triplet like (3, 4, b), the string aaca is a self-reference. (the last a in the blue box refers to first a in the blue box). This always happens when copy-length is bigger than back-pointer.

The actual LZ77 implementation used in GZIP slightly more complicated for more efficient use of space. It doesn’t use the triplet, but rather differentiate literals and “back-pointers”.

LZ77 in GZIP

Consider we have this body of text (char count 1408 chars)

So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children's blood;
Nor more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery
Shall now, in mutual well-beseeming ranks,
March all one way and be no more opposed
Against acquaintance, kindred and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ,
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight,
Forthwith a power of English shall we levy;
Whose arms were moulded in their mothers' womb
To chase these pagans in those holy fields
Over whose acres walk'd those blessed feet
Which fourteen hundred years ago were nail'd
For our advantage on the bitter cross.
But this our purpose now is twelve month old,
And bootless 'tis to tell you we will go:
Therefore we meet not now. Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our council did decree
In forwarding this dear expedience.

After applying LZ77:

So shaken as we are, so wan with c<16,4>(are,)
Find<12,5>( we a) time for frighted peace to pant,
A<41,3>(nd )breathe<2,3>( sh)ort-w<40,3>(ind)<64,3>(ed )accents of new<85,3>( br)oils
To be commenc<104,3>(ed )in strands afar remote.
No mor<71,3>(e t)h<175,4>(e th)irsty <110,3>(ent)<150,3>(ran)<70,3>(ce )<115,3>(of )<181,3>(thi)s<20,3>( so)il
Shall daub her lips<27,6>( with )<222,4>(her )own children's blood;<168,3>(
No)r<171,6>( more )s<212,5>(hall )t<249,3>(ren)<244,3>(chi)ng<23,3>( wa)r<243,3>( ch)annel<235,5>( her )fields,<261,5>(
Nor )bruise<298,6>( her f)loweret<229,7>(s with )<177,4>(the )arm<142,3>(ed )hoofs
Of<350,3>( ho)stile<75,3>( pa)ces:<340,3>( th)o<319,3>(se )opp<377,3>(ose)d eye<308,3>(s,
)Which,<225,3>( li)k<175,6>(e the )meteor<113,5>(s of )<47,3>(a t)roubl<348,4>(ed h)eaven<80,3>(,
A)<274,3>(ll )<419,3>(of )one natu<17,4>(re, )<445,7>(of one )subst<193,5>(ance )<86,3>(bre)d,
Did lately<410,3>( me)et<144,4>( in )<407,4>(the )inte<362,3>(sti)n<92,5>(e sho)ck<81,5>(
And )furious cl<377,5>(ose o)f civil butc<322,3>(her)y<210,7>(
Shall )now,<498,4>( in )mutual<43,3>( we)ll-beseem<283,4>(ing )<192,3>(ran)k<392,3>(s,
)March <560,4>(all )<463,4>(one )way a<83,4>(nd b)<450,3>(e n)<170,7>(o more )<381,7>(opposed)
Against<106,3>( ac)qu<644,3>(ain)<471,5>(tance), k<101,3>(ind)r<104,4>(ed a)<620,3>(nd )<607,3>(all)i<371,3>(es:)
T<503,3>(he )edg<538,5>(e of )<287,3>(war)<400,7>(, like )<25,3>(an )i<581,3>(ll-)sh<88,5>(eathe)d knife,<168,9>(
No more )<271,6>(shall )cut <202,4>(his )master. <684,3>(The)re<54,3>(for)<661,3>(e, )<58,3>(fri)en<307,4>(ds,
)As <157,4>(far )<10,3>(as )<73,3>(to )<90,5>(the s)epulchr<691,5>(e of )Christ<393,4>(,
Wh)<536,4>(ose )soldi<323,3>(er )<564,5>(now, )u<102,3>(nde)r w<818,5>(hose )<428,3>(ble)s<385,4>(sed )cross
W<14,5>(e are) impr<850,6>(essed )<672,4>(and )engag<876,3>(ed )<789,3>(to )f<60,4>(ight)<37,3>(,
F)<96,3>(ort)h<336,5>(with )a p<328,4>(ower)<805,4>( of )English<736,7>( shall )<44,3>(we )levy;<816,7>(
Whose )<345,3>(arm)<11,4>(s we)<866,3>(re )mould<142,6>(ed in )<792,3>(the)i<264,4>(r mo)<972,3>(the)rs' womb<128,4>(
To )<291,3>(cha)s<405,5>(e the)s<366,4>(e pa)gans<968,6>( in th)<947,4>(ose )ho<491,3>(ly )<303,6>(fields)
Ov<839,9>(er whose )ac<872,3>(res)<695,3>( wa)lk'd<1016,7>( those )<848,8>(blessed )f<495,3>(eet)<394,6>(
Which)<53,3>( fo)urte<7,3>(en )hu<666,6>(ndred )yea<416,3>(rs )ago<955,6>( were )nail'd<900,4>(
For) <1085,3>(our) adv<77,3>(ant)a<690,4>(ge o)<500,6>(n the )bit<754,3>(ter)<855,6>( cross).
B<744,3>(ut )<201,5>(this )<1127,4>(our )pur<636,4>(pose)<830,4>( now) <1168,3>(is )t<579,3>(wel)v<959,4>(e mo)n<908,3>(th )<824,3>(old)<80,7>(,
And b)oot<1066,4>(less) 'ti<787,6>(s to t)<580,3>(ell) you<935,4>( we )w<709,3>(ill) go<682,5>(:
The)<762,6>(refore)<1237,4>( we )<494,5>(meet )no<1266,4>(t no)w<757,5>(. The)n<938,3>( le)t<1262,3>( me)<432,4>( hea)r<356,4>(
Of )<1234,3>(you), my g<189,3>(ent)l<133,4>(e co)us<1014,3>(in )W<509,3>(est)<732,4>(more)l<879,3>(and)<815,4>(,
Wh)at<1100,3>( ye)<753,4>(ster)n<895,4>(ight)<1170,5>( our )<1312,3>(cou)nc<546,3>(il )d<484,3>(id )de<1047,3>(cre)e
In<53,4>( for)<696,3>(war)d<590,4>(ing )<1166,5>(this )d<1290,3>(ear) expe<826,3>(die)<659,3>(nce).

Summary Report: literalCount 599, backPointerCount 202, totalBytes 1408

We can get reduce the space usage until only 42.5% of the original space (ignoring all the overhead first for now). The actual space usage maybe around 50-60% of original.

But how do we actually encode back pointer? One naive solution is to use escape characters to differentiate the metadata (back-pointer) from the literals. However this might not have good performance. In GZIP, actually we don’t even store the literals as literal, however, as huffman code!

Here’s a simplified snippet from the implementation:

for {
    node := resolveHuffmanCode(literalsRoot, stream)
    if node.code >= 0 && node.code < 256 {
        // literal
        buf = append(buf, byte(node.code))
    } else if node.code == 256 {
        // stop code
        break
    } else if node.code > 256 && node.code <= 285 {
        // This is a back-pointer

        // get length
        var length int
        if node.code < 265 {
            length = node.code - 254
        } else {
            length = getExtraLength(length, stream)
        }

        // get distance (from huffman tree)
        distNode := resolveHuffmanCode(distancesRoot, stream)
        dist := dist.code 
        if dist > 3 {
            dist = getExtraDist(dist, stream)
        }

        // read backpointer
        backPointer := len(buf) - dist - 1
        for length > 0 {
            buf = append(buf, buf[backPointer])
            length--
            backPointer++
        }
    } else {
        panic("invalid code!")
    }
}

If you notice here there are two huffman trees being used: literalsTree and distancesTree. We will discuss about the construction of the tree later. Here is flow of data:

control lz77 as outer
control resolveLiteralsHuffman
control resolveDistancesHuffman
queue stream
queue buf

outer -> resolveLiteralsHuffman++: get code
loop not yet leaf node
  resolveLiteralsHuffman -> stream++: next bit
  resolveLiteralsHuffman <-- stream--
end
outer <-- resolveLiteralsHuffman--: code

alt if literal code < 256
  outer -> buf: literal


else if literal code == 256
  outer -> buf: flush (done)
  outer -> outer: break


else if literal code > 256
  group "Get length"
    alt "code >= 265"
      outer -> stream++: get extra bits for length
      outer <-- stream--: bits
    end
    outer -> outer: calculate length
  end
  
  group "Resolve distance code"
    outer -> resolveDistancesHuffman++: get distances code
    loop not yet leaf node
      resolveDistancesHuffman -> stream++: next bit
      resolveDistancesHuffman <-- stream--
    end
    outer <-- resolveDistancesHuffman--: code
  end
  
  group "Get distances"
    alt distanceCode >= 3
      outer -> stream++: get extra bits for distances
      outer <-- stream--: bits
    end
    outer -> outer: calculate distance
  end
  
  group "Write backpointer"
    loop length
      outer -> buf: write using backpointer
    end
  end
end

LZ77.png

The tecnhique used in getting the extra distances and extra lengths is similar to the earlier code 17, 18. That is, having variable count of extra bits + the base offset.

Putting it all together

gzip_overview.excalidraw.png

Demo

Live demo or backup video:

demo_gzip.mp4

Actual performance:

Generally the bigger the file should have better compression because more back-referencing.

Footnote

LZ77 and LZ78 difference is only slightly but fundamental. LZ77 builds the “shortcut dictionary” on the fly, referencing earlier data. LZ78 will pre-calculate the “shortcut dictionary”, which then can be used as building blocks. There is no concept of “back-pointer” in LZ78, only “pointer” where one dictionary entry can refer to more simpler dictionary entry. A bit like defining “going” by referencing “go”.

The main drawback of LZ77 compared to LZ78 is the need to keep the sliding window. LZ78 can build dictionary items from anywhere within the compressed text. However, the drawback of LZ77 also means it can be used for streaming applications like media streaming in the web, meanwhile LZ78 couldn’t.

References