The file fragment classification problem : a combined neural network and linear programming discriminant model approach
Abstract
The increased use of digital media to store legal, as well as illegal data, has created the need for specialized tools that can monitor, control and even recover this data. An important task in computer forensics and security is to identify the true file type to which a computer file or computer file fragment belongs. File type identification is traditionally done by means of metadata, such as file extensions and file header and footer signatures. As a result, traditional metadata-based file object type identification techniques work well in cases where the required metadata is available and unaltered. However, traditional approaches are not reliable when the integrity of metadata is not guaranteed or metadata is unavailable. As an alternative, any pattern in the content of a file object can be used to determine the associated file type. This is called content-based file object type identification. Supervised learning techniques can be used to infer a file object type classifier by exploiting some unique pattern that underlies a file type’s common file structure. This study builds on existing literature regarding the use of supervised learning techniques for content-based file object type identification, and explores the combined use of multilayer perceptron neural network classifiers and linear programming-based discriminant classifiers as a solution to the multiple class file fragment type identification problem. The purpose of this study was to investigate and compare the use of a single multilayer perceptron neural network classifier, a single linear programming-based discriminant classifier and a combined ensemble of these classifiers in the field of file type identification. The ability of each individual classifier and the ensemble of these classifiers to accurately predict the file type to which a file fragment belongs were tested empirically. The study found that both a multilayer perceptron neural network and a linear programming-based discriminant classifier (used in a round robin) seemed to perform well in solving the multiple class file fragment type identification problem. The results of combining multilayer perceptron neural network classifiers and linear programming-based discriminant classifiers in an ensemble were not better than those of the single optimized classifiers.