When I receive data carriers from an inheritance, I want to get a quick overview of what is on the floppy disk, the CDROM, the USB stick or the hard disk drive so that I can look at the interesting things first.
But I only know what is there when I read the media, right? A typical chicken and egg problem.
https://openclipart.org/detail/212857/sci-fi-scanner-device |
What is Random Sampling?
Random sampling is nothing more than looking at only every n-th part of a total set and inferring the big picture.
To find out what is on a medium, it would be sufficient to look at random blocks and determine for them, based on their byte structure, whether they fall into the categories "empty", "random", "text", "video" or "undef".
Exactly this approach is implemented in the Perl module File::FormatIdentification::RandomSampling, which can be found on CPAN under https://metacpan.org/pod/File::FormatIdentification::RandomSampling.
Example
The above Perl module contains the program crazy_fast_image_scan.pl. The following simple call:
perl -I lib bin/crazy_fast_image_scan.pl --percent=0.000001 --image=/dev/mapper/laptop--vg-home
provides the following output:
Scanning Image /dev/mapper/laptop--vg-home with size 728982618112, checking 1423 sectors
scanning [...]
Estimate, that the image '/dev/mapper/laptop--vg-home'
has percent of following data types:
44.6% random/encrypted/compressed
35.6% undef
11.0% empty
5.4% video/audio
3.5% text
The complete output is even more extensive. It is important to note that the examined partition was 668GB in size and was scanned in just 15s.
Limits
Importantly, the output provides only a rough estimate of what might be on the media. The choice of the sample size (here: via the --percentage parameter) determines the informative value of the estimate, as well as the duration until a result can be delivered.
More ideas
In the above module, I have implemented an experimental output of the MIME-Types potentially present on the media. This is not very stable yet and needs more work, but it can help to estimate even better whether the files on a disk are interesting enough to prioritize it. Here is an example output:
The next mimetype estimation is experimental and needs further work:
87.9% unknown
3.5% application/pdf
1.1% video/quicktime
0.8% image/gif
0.8% text/java
0.7% application/msword
0.6% text/markdown
0.6% application/vnd.openxmlformats-officedocument.wordprocessingml.document
0.6% application/xml
0.4% application/msaccess
0.4% application/navimap
0.4% application/rtf
0.3% image/png
0.2% application/arj
0.1% application/vnd.ms-powerpoint
0.1% text/html
The approach is to determine the MIME-Type of the files for a test corpus using other tools, determine typical bytegram values and pass the whole thing to a decision tree learner. If you are interested, you are welcome to contribute to the module.
Happy scanning!