Evaluation of Hidden Markov Model for Malware Behavioral Classification


Malware is a growing threat to computer systems and networks around the world. Ever since the malware construction kits and metamorphic virus generators became easily available, creating and spreading obfuscated malware has become a simple matter. The cyber-security vendors receive thousands of new malware samples everyday for analysis. It has become a challenging task for the malware analysts to identify if a given malware sample is a variant of a known malware or belongs to a new breed altogether. Since making an accurate decision about the nature of an unknown malware sample is crucial for updating of signature databases and propagation of the update to their customers, therefore vendors of cyber-security products need accurate malware classification techniques for this purpose.

The research community has been active for providing a solution to the above problem, and a number of diverse avenues have been explored such as machine learning, graph theory, finite state machines, etc. Furthermore, many syntactic and semantic aspects of computer programs have been tried out in search of the best aspect that could be used to distinguish between harmful and harmless computer programs, and to differentiate malware belonging to different families. All the proposed approaches have merits and demerits of their own, and the search for a solution that maximizes the classification accuracy with minimal computational costs is continued.

This dissertation formulates malware classification as a sequence classification problem, and evaluates a widely used sequence classification tool, Hidden Markov Model (HMM), for the task of malware classification. HMM has been a method of choice for a broad range of sequential pattern matching applications such as speech analysis, behavior modeling and handwriting recognition to name a few. The dissertation first proposes and evaluates novel methods of malware classification by combining HMM and malware behavioral features, which are attributes frequently used to distinguish between normal and malicious programs and to differentiate among malware families. As an another major contribution, the dissertation fills a significant research gap by studying the role of an important HMM parameter, the number of hidden states, in malware classification applications. Based on observations from comprehensive experiments conducted on a large and diverse dataset consisting of malware behavioral reports, the dissertation concludes that although HMM shows encouraging results when used for malware classification tasks, its potential from a practical standpoint is fairly limited. The dissertation makes the third contribution by proposing to replace the HMM component of malware classification method with Markov Chain Model (MCM), and performing comparative evaluation between the two models. Results of the comparison prove that classification performance achieved by HMM can be attained much more efficiently by MCM, and therefore MCM should be preferred over HMM for malware classification applications.

Download full paper