Donnerstag, 1. April 2021

Backup is digital long-term preservation!

Exponential growth

An important observation is that the number of files produced each year continues to increase worldwide (see And with it the number of digital objects increases in the same measure, for which we must decide: Keep or throw away? 

The truth is, the discard scenario becomes the more likely one with each passing year.

Magnificent diversity

 Another observation is that about 90 new file formats are added every year.
And the file formats that are being dropped are already in place. 


The truth is, no one can build up format knowledge for this yet.


A fuzzy concept

When talking to colleagues, the topic of validation does not play a role. For one thing, no one is clear about what "valid" means. Valid against a specification? Valid against a profile? Valid because it can be opened by programs? On the other hand, nothing happens after that. If a file is broken, it is still archived. If it is not broken, fine. 

The truth is, validation is useless.


Success factors

Do you know how the success of digital preservation is measured? I'll tell you, in terabytes per year. If the numbers go up, that's a good thing to sell to politicians. Whether it was difficult to prepare digital objects for long-term availability doesn't matter. Whether born-digitals are more at risk, never mind. 

Is that the truth?


It used to be said that long-term digital archiving could only be handled by organizations with a minimum of resources. Look around and you'll find dozens of one-man orchestras and part-time archives. And do you think that as the amount of data increases, so do the human resources? Oh, come on! 

 You know the truth!

That's too exhausting

If you've ever heard of format migration as a principle of long-term preservation, you've read in textbooks phrases like 

To ensure format migration, the significant properties of groups of objects that must be preserved must be determined. 

Have you ever seen an archive that has actually determined and documented significant properties

The truth is, significant properties are determined after the fact from technical metadata.


So what is digital long-term preservation? Only an expensive backup.

Mittwoch, 27. Januar 2021

Impossible - or how I learned to read data storage media at the speed of light and what it's good for

When I receive data carriers from an inheritance, I want to get a quick overview of what is on the floppy disk, the CDROM, the USB stick or the hard disk drive so that I can look at the interesting things first.

But I only know what is there when I read the media, right? A typical chicken and egg problem.
I discovered the crucial clue to the solution in a 2014 talk by Simon Garfinkel "Digital Forensics Innovation: Searching A Terabyte of Data in 10 minutes" (

What is Random Sampling?

Random sampling is nothing more than looking at only every n-th part of a total set and inferring the big picture.

To find out what is on a medium, it would be sufficient to look at random blocks and determine for them, based on their byte structure, whether they fall into the categories "empty", "random", "text", "video" or "undef".

Exactly this approach is implemented in the Perl module File::FormatIdentification::RandomSampling, which can be found on CPAN under

The category "empty" is dominated by sequences of zero bytes, in the category "random" the byte values are almost equally distributed, in the category "text" values for the characters "a-z" from the ASCII character set appear frequently, "video" contains frequent byte sequences resulting from the basic structure of MPEG. And under "undef" everything else is subsumed.


The above Perl module contains the program The following simple call:

perl -I lib bin/ --percent=0.000001 --image=/dev/mapper/laptop--vg-home

provides the following output:

Scanning Image /dev/mapper/laptop--vg-home with size 728982618112, checking 1423 sectors
scanning [...]   
Estimate, that the image '/dev/mapper/laptop--vg-home'
has percent of following data types:
    44.6% random/encrypted/compressed
    35.6% undef
    11.0% empty
     5.4% video/audio
     3.5% text

The complete output is even more extensive. It is important to note that the examined partition was 668GB in size and was scanned in just 15s.


Importantly, the output provides only a rough estimate of what might be on the media. The choice of the sample size (here: via the --percentage parameter) determines the informative value of the estimate, as well as the duration until a result can be delivered.

More ideas

In the above module, I have implemented an experimental output of the MIME-Types potentially present on the media. This is not very stable yet and needs more work, but it can help to estimate even better whether the files on a disk are interesting enough to prioritize it. Here is an example output:

The next mimetype estimation is experimental and needs further work:
    87.9% unknown
     3.5% application/pdf
     1.1% video/quicktime
     0.8% image/gif
     0.8% text/java
     0.7% application/msword
     0.6% text/markdown
     0.6% application/vnd.openxmlformats-officedocument.wordprocessingml.document
     0.6% application/xml
     0.4% application/msaccess
     0.4% application/navimap
     0.4% application/rtf
     0.3% image/png
     0.2% application/arj
     0.1% application/
     0.1% text/html

The approach is to determine the MIME-Type of the files for a test corpus using other tools, determine typical bytegram values and pass the whole thing to a decision tree learner. If you are interested, you are welcome to contribute to the module. 

Happy scanning!