Dissecting ZIP File Header with YARA


March 2, 2020

I am currently taking a newly offered course from NTU, “Concepts and Techniques of Malware Analysis”. One of the question was to analyze a zip file to check if the file may content malicious file without unzipping it. The approach used to check based on the file name of the compressed file (i.e. if it contains .exe, .dll or .scr file). Of course that this approach is not foolproof, but a good way to get started to use the YARA tool.


Yara is marketed as a “pattern matching swiss knife”. Pattern matching here means that we try to match certain pattern of strings or bytes to the suspected malicious file. A swiss knife illustration

// sample_rule.yar

rule contain_mz_header
		$mz_header = { 4D 5A }
		$mz_header at 0

This is a very simple example of a Yara specification (a.k.a. rules). The code above basically will try to find if there is any continuous bytes of 0x4D5A (the curly braces in the code means that it is in hexademical) at position 0 (means at the start of the file), allowing for white spaces in between. That example is interesting because it looks for an MZ header, telling that file is a DOS executable. Then we can run it simply by yara sample_rule.yar tested_file.ext.


Note that in this exercise, the instructor are not focusing on the compression method, but rather only the formatting of the ZIP file format. This image is a good illustration of the format. ZIP file format illustration The format above will specify for a file contained in the ZIP. AFAIK, if the ZIP contains multiple file, it will be just stacked together continuously.

Checking the file extension

So basically the assignment requires us to write a rule that checks if the ZIP contain any file that uses suspected file extensions. This is my breakdown of the step by step to achieve that:

  1. Find all the occurences of the ZIP file signature 50 4B 03 04. I’m not really sure what will happen if turns out that 50 4B 03 04 appears elsewhere not representing the signature. But at least for the sample file I’m working on it’s not a problem.
  2. Locate the location of the file name size (the one in 0x1a-0x1b) and then extract the value to get the length of the file name.
  3. Look up the actual file name from the start, which is 0x1e to the end based on the length of the file name as known from step 2, and check if the string contain any of those unwanted file extension names.

And skip forward, my final solution is

rule check_unwanted_files {
		$zip_header = {50 4B 03 04}
		$exe_file_ext = ".exe"
		$dll_file_ext = ".dll"
		$scr_file_ext = ".scr"
		for any i in (1..#zip_header):
			$exe_file_ext in (@zip_header[i]+30..@zip_header[i]+30+uint16(@zip_header[i]+26)) or
			$dll_file_ext in (@zip_header[i]+30..@zip_header[i]+30+uint16(@zip_header[i]+26)) or
			$scr_file_ext in (@zip_header[i]+30..@zip_header[i]+30+uint16(@zip_header[i]+26))

Let us break it down. for any i in (1..#zip_header): means that I will count the number of occurrences of zip_header in the file, then iterate through 1 till that number. (the array/list in YARA is 1-indexed, so unlike other languages that starts from 0, it starts from 1 instead.)

Then @zip_header[i] means the location of the i-th zip_header. We add 30 and it becomes @zip_header[i]+30 because the file name starts at 0x1e, which is 16+14 in decimal. uint16(@zip_header[i]+26) means extracting value of the 2 bytes (16-bit) unsigned integer value starting at the location @zip_header[i]+26, which corresponds to the 0x1a offset from the local file header for the length of the file name.

Thus, when we assemble it (@zip_header[i]+30..@zip_header[i]+30+uint16(@zip_header[i]+26)) becomes the string of the file name. Then we just simply check whether the file extension exists in the string!