Experimental study of fuzzy hashing in malware clustering analysis
Abstract
Malware triaging is the process of analyzing malicious software applications’ behavior to develop detection signatures. This task is challenging, especially due to the enormous number of samples received by the vendors with limited amount of analyst time. Triaging usually starts with an analyst classifying samples into known and unknown malware. Recently, there have been various attempts to automate the process of grouping similar malware using a technique called fuzzy hashing – a type of compression functions for computing the similarity between individual digital files. Unfortunately, there has been no rigorous experimentation or evaluation of fuzzy hashing algorithms for malware similarity analysis in the research literature. In this paper, we perform extensive study of existing fuzzy hashing algorithms with the goal of understanding their applicability in clustering similar malware. Our experiments indicate that current popular fuzzy hashing algorithms suffer from serious limitations that preclude them from being used in similarity analysis. We identified novel ways to construct fuzzy hashing algorithms and experiments show that our algorithms have better performance than existing algorithms.