Getting Started with ssdeep

Introduction

This document provides an introduction to using ssdeep.

This guide starts with an explanation of the basic functions of ssdeep and then gives some examples of using fuzzy hashing in real world situations.

Installation

Windows

Users running Microsoft Windows are strongly encouraged to download the precompiled binaries from GitHub project page.

Please note that these binaries are created using a Mingw-w64 cross compiler. Compiling the programs directly from Windows is not supported.

Automatic Installation

If the distribution you use have ssdeep package, you may install the package using a package manager.

See Platforms section on the home page to check if your distribution / operating system has ssdeep package.

Manual Installation

If your operating system does not support the automatic installation methods described above, you will have to download the source code and compile the programs yourself.

First download the latest tarball of the program from GitHub project page. This file should be named something like ssdeep-2.14.tar.gz. Uncompress the file with the following command:

$ tar zxf ssdeep-2.14.tar.gz

Change into the decompressed directory:

$ cd ssdeep-2.14

and configure the program.

$ ./configure

If you checked out the source code from the git repository or downloaded a bare distribution file (which GitHub automatically generates), you may not be able to find configure script on the decompressed directory. If so, install GNU Autotools and run:

$ ./bootstrap

to generate configure script.

The configure script can accept lots of options. Run ./configure --help for the complete list. The most common option used is the --prefix option which installs the program in a location other than the default, /usr/local. If you wanted to install the program elsewhere, for example, /tmp/ssdeep, you would run ./configure --prefix=/tmp/ssdeep instead.

On 32-bit (or less) platforms, new bit-parallel algorithms implemented in version 2.14 may be slower on the worst case (though it should not, even if 64-bit integer is emulated). You may try --disable-bitparallel-string-ops option and benchmark if this option is suitable for you and your platform.

You can now compile the program using the make command:

$ make

and install it:

# make install

Note that you must be root on most operating systems to install the program to its default location, /usr/local/bin. The tool sudo may help:

$ sudo make install

Basic Operation

By default, ssdeep generates context triggered piecewise hashes, or fuzzy hashes, for each input file. The output is proceeded by a file header.

[user@localhost /workdir]$ ssdeep config.h INSTALL m4/libtool.m4
ssdeep,1.1--blocksize:hash:hash,filename
96:s4Ud1Lj96tHHlZDrwciQmA+4uy1I0G4HYuL8N3TzS8QsO/wqWXLcMSx:sF1LjEtHHlZDrJzrhuyZvHYm8tKp/RWO,"/workdir/config.h"
384:EWo4X1WaPW9ZWhWzLo+lWpct/fWbkWsWIwW0/S7dZhgG8:EWo4X1WmW9ZWhWH/WpchfWgWsWTWtf8,"/workdir/INSTALL"
6144:3wSQSlrBHFjOvwYAU/Fsgi/2WDg5+YaNk5xcHrYw+Zg+XrZsGEREYRGAFU25ttR/:ctM7E0L4q,"/workdir/m4/libtool.m4"
[user@localhost /workdir]$

Notice how the above output shows the full path in the filename. You can have ssdeep print relative filenames instead of absolute ones. That is, omit all of the path information except that specified on the command line. To enable relative paths, use the -l flag. Repeating our first example with the -l flag:

[user@localhost /workdir]$ ssdeep -l config.h INSTALL m4/libtool.m4
ssdeep,1.1--blocksize:hash:hash,filename
96:s4Ud1Lj96tHHlZDrwciQmA+4uy1I0G4HYuL8N3TzS8QsO/wqWXLcMSx:sF1LjEtHHlZDrJzrhuyZvHYm8tKp/RWO,"config.h"
384:EWo4X1WaPW9ZWhWzLo+lWpct/fWbkWsWIwW0/S7dZhgG8:EWo4X1WmW9ZWhWH/WpchfWgWsWTWtf8,"INSTALL"
6144:3wSQSlrBHFjOvwYAU/Fsgi/2WDg5+YaNk5xcHrYw+Zg+XrZsGEREYRGAFU25ttR/:ctM7E0L4q,"m4/libtool.m4"
[user@localhost /workdir]$

You can have ssdeep only print out the basename of each file it processes. That is, all directory information will be stripped off. To enable basename mode, use the -b flag:

[user@localhost /workdir]$ ssdeep -b config.h INSTALL m4/libtool.m4
ssdeep,1.1--blocksize:hash:hash,filename
96:s4Ud1Lj96tHHlZDrwciQmA+4uy1I0G4HYuL8N3TzS8QsO/wqWXLcMSx:sF1LjEtHHlZDrJzrhuyZvHYm8tKp/RWO,"config.h"
384:EWo4X1WaPW9ZWhWzLo+lWpct/fWbkWsWIwW0/S7dZhgG8:EWo4X1WmW9ZWhWH/WpchfWgWsWTWtf8,"INSTALL"
6144:3wSQSlrBHFjOvwYAU/Fsgi/2WDg5+YaNk5xcHrYw+Zg+XrZsGEREYRGAFU25ttR/:ctM7E0L4q,"libtool.m4"
[user@localhost /workdir]$

From the Standard Input

If no input files are specified, it accepts input from the standard input. (older versions may display an error message)

[user@localhost /workdir]$ ssdeep
(it waits input from the standard input)
<Ctrl+D><Ctrl+D>
ssdeep,1.1--blocksize:hash:hash,filename
3:q8wK6FuFWcEqlv:3wK6FN1I,"stdin"
[user@localhost /workdir]$

Error Messages

If an input file can't be found or can't be accessed, an error message is normally printed. These, and all other error messages, can be suppressed by using the -s flag.

[user@localhost /workdir]$ ssdeep doesnotexist.txt cannotaccess.txt
/workdir/doesnotexist.txt: No such file or directory
/workdir/cannotaccess.txt: Permission denied
[user@localhost /workdir]$ ssdeep -s doesnotexist.txt cannotaccess.txt
[user@localhost /workdir]$

Of course, you can also redirect the standard error output:

[user@localhost /workdir]$ ssdeep doesnotexist.txt cannotaccess.txt 2>/dev/null
[user@localhost /workdir]$

Recursive Mode

Normally, attempting to process a directory will generate an error message. Under recursive mode, ssdeep will hash specified files and files in specified directory including its subdirectories. Recursive mode is activated by using the -r flag.

[user@localhost /workdir]$ ssdeep *
ssdeep,1.1--blocksize:hash:hash,filename
/workdir/backups: Is a directory
96:KQhaGCVZGhr83h3bc0ok3892m12wzgnH5w2pw+sxNEI58:FIVkH4x73h39LH+2w+sxaD,"/workdir/config.h"
/workdir/www: Is a directory
[user@localhost /workdir]$ ssdeep -r *
ssdeep,1.1--blocksize:hash:hash,filename
768:McAQ8tPlH25e85Q2OiYpD08NvHmjJ97UfPMO47sekO:uN9M553OiiN/OJ9MM+e3,"/workdir/mystuff.zip"
384:bcEKuglk+GUYIk90a1lEF+Wfsy2solvW8mK1enQXP79:bmlFGUNk9L1roy4K1enQ,"/workdir/backups/ssdeep.exe"
96:CFzROqsgconvv7uUo6jTcEGEvpVCN116S:CNVnqj8cMVCv16,"/workdir/backups/foo.doc"
96:KQhaGCVZGhr83h3bc0ok3892m12wzgnH5w2pw+sxNEI58:FIVkH4x73h39LH+2w+sxaD,"/workdir/config.h"
96:aN0jOc0WlWW+LWQnjv7ufGcE5ESr5YaZ6uicEDEO9VCN116Sb5EutkB:aSeoF+L/zqfGtfr5YiWcsVCv16W5htk,"/workdir/www/index.html"
[user@localhost /workdir]$

Matching Mode

One of the more powerful features of ssdeep is the ability to match the hashes of input files against a list of known hashes. Because of inexact nature of fuzzy hashing, note that just because ssdeep indicates that two files match, it does not mean that those files are related. You should examine every pair of matching files individually to see how well they correspond.

Here's a simple example of how ssdeep can match files that are not identical. We take an existing file, make a copy of it, and append a single character to it.

[user@localhost /workdir]$ ls -l foo.txt
-rw-r--r-- 1 user users 4274 Jan  2 03:04 foo.txt
[user@localhost /workdir]$ cp foo.txt bar.txt
[user@localhost /workdir]$ echo 1 >>bar.txt

A cryptographic hashing algorithm like MD5 can't be used to match these files; they have wildly different hashes.

[user@localhost /workdir]$ md5sum foo.txt bar.txt
33e63a6fb553396089206212a5af17e3  foo.txt
890aecccf13601c80f194bce9f5f6d09  bar.txt
[user@localhost /workdir]$

But fuzzy hashing can! We compute the fuzzy hash of one file and use the matching mode to match the other one.

[user@localhost /workdir]$ ssdeep -b foo.txt >hashes.txt
[user@localhost /workdir]$ ssdeep -b -m hashes.txt bar.txt
bar.txt matches hashes.txt:foo.txt (99)
[user@localhost /workdir]$

The number at the end of the line is a match score, or a weighted measure of how similar these files are. The higher the number, the more similar the files.

Use Case: Source Code Reuse

As a more practical example of ssdeep's matching functionality, you can use ssdeep's matching mode to help find source code reuse. Let's say we have two folders, ssdeep-1.1 and md5deep-1.12 that contain the source code for each of those tools. You can compare their contents by computing fuzzy hashes for one tree and then comparing them against the other:

[user@localhost /workdir]$ ssdeep -l -r md5deep-1.12 >md5deep-hashes.txt
[user@localhost /workdir]$ ssdeep -l -r -m md5deep-hashes.txt ssdeep-1.1
ssdeep-1.1/cycles.c matches md5deep-1.12/cycles.c (94)
ssdeep-1.1/dig.c matches md5deep-1.12/dig.c (35)
ssdeep-1.1/helpers.c matches md5deep-1.12/helpers.c (57)
[user@localhost /workdir]$

Ta da! You can see that Jesse reused code from the md5deep project when writing ssdeep.

Use Case: Truncated Files

Along with source code reuse, you can also use fuzzy hashing to find truncated files. Here's a sample using a fake filename. We'll compute the fuzzy hash for the file, make a copy that contains only the first 29% of the original, and then try to match the truncated version back to the original.

[user@localhost /workdir]$ ls -l all-the-kings-men.avi
-rw-r--r-- 1 user users 733478912 Jan  2 03:04 all-the-kings-men.avi
[user@localhost /workdir]$ ssdeep -b all-the-kings-men.avi >sig.txt
[user@localhost /workdir]$ cat sig.txt
ssdeep,1.1--blocksize:hash:hash,filename
12582912:imU4zlwQ1LYdr1uKWM31bN0v1NySeBDBxs7/gOpQWzFLp1uLeBi18MP8:imU0wgMdwTMdN0v83xiHQWzz1uLo698,"all-the-kings-men.avi"
[user@localhost /workdir]$ dd bs=1M count=200 if=all-the-kings-men.avi of=partial.avi
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.23091 s, 908 MB/s
[user@localhost /workdir]$ ls -l partial.avi
-rw-r--r-- 1 user users 209715200 Jan  5 06:42 partial.avi
[user@localhost /workdir]$ ssdeep -b -m sig.txt partial.avi
partial.avi matches sig.txt:all-the-kings-men.avi (43)
[user@localhost /workdir]$

Needles in a Haystack

You can also compare many without writing out any hashes to the disk using two different methods. Let's say that we have a whole bunch of files in two or three directories and want to know which ones are similar to each other. We can use the -d mode to display these matches. The switch causes ssdeep to compute a fuzzy hash for each input file and compare it against all of the other input files.

In this example, we've gathered a whole bunch of Microsoft Word documents in the folders Incoming, Outgoing, and Trash. Rather than go through all of the documents, it would be nice to eliminate those are substantially the same.

[user@localhost /workdir]$ ssdeep -l -r -d Incoming Outgoing Trash
Outgoing/Corporate Espionage/Our Budget.doc matches Incoming/Budget 2007.doc (99)
Outgoing/Personnel Mayhem/Your Buddy Makes More Than You.doc matches Incoming/Salaries.doc (45)
Trash/DO NOT DISTRIBUTE.doc matches Outgoing/Plan for Hostile Takeover.doc (88)
[user@localhost /workdir]$

Oh my!

The -p mode works similarly, but displays the results in a slightly nicer format. If there are two input files A and B that match, the -d mode will only display that "A matches B." The -p mode will display that "A matches B," skips a line, and then "B matches A." This greatly increases the length of the output, but can make files easier to find. Here's the above input again, this time using the -p flag.

[user@localhost /workdir]$ ssdeep -l -r -p Incoming Outgoing Trash
Incoming/Budget 2007.doc matches Outgoing/Corporate Espionage/Our Budget.doc (99)

Incoming/Salaries.doc matches Outgoing/Personnel Mayhem/Your Buddy Makes More Than You.doc (45)

Outgoing/Corporate Espionage/Our Budget.doc matches Incoming/Budget 2007.doc (99)

Outgoing/Personnel Mayhem/Your Buddy Makes More Than You.doc matches Incoming/Salaries.doc (45)

Outgoing/Plan for Hostile Takeover.doc matches Trash/DO NOT DISTRIBUTE.doc (88)

Trash/DO NOT DISTRIBUTE.doc matches Outgoing/Plan for Hostile Takeover.doc (88)

[user@localhost /workdir]$

Comparing Files of Signatures

After you've generated several files of fuzzy hashes you may wish to compare those signatures to each other. You can compare one or more files of signatures against each other using the -x flag.

[user@localhost /workdir]$ ssdeep -r /etc >list1.txt
[user@localhost /workdir]$ ssdeep -r /usr >list2.txt
[user@localhost /workdir]$ ssdeep -l -r ./known_malware >malware-list.txt
[user@localhost /workdir]$ ssdeep -x list1.txt list2.txt malware-list.txt
list1.txt:/etc/rcc.d/init.d matches malware-list.txt:./known_malware/wlk_rootkit/dropper (86)

malware-list.txt:./known_malware/wlk_rootkit/dropper matches list1.txt:/etc/rcc.d/init.d (86)

[user@localhost /workdir]$

The above method compares all of the signatures against each other. This can take some time, especially if the files are large. If you'd rather compare some unknown signatures against a set of known signatures, you can use the -k flag. Let's say you have some signatures for malicious programs, badfiles.txt and worsefiles.txt. You then compute the fuzzy hashes for programs on some workstations, which are saved to comp1.txt, comp2.txt, and comp3.txt. You can compare these unknowns to the knowns like this:

[user@localhost /workdir]$ ssdeep -k badfiles.txt -k worsefiles.txt comp1.txt comp2.txt comp3.txt
comp1.txt:WINWORD2.EXE matches badfiles.txt:some_trojan.exe (84)
comp3:txt:ntoskrrnl.exe matches worsefiles.txt:delete_all_data.exe (77)
[user@localhost /workdir]$