Mittwoch, 22. Juli 2020

Format recognition, new analysis options?

Previous work

In an older article (see https://kulturreste.blogspot.com/2018/10/heres-tool-make-it-work.html) I have already done an analysis of PRONOM signatures. Since today the module for this exists on CPAN, see https://metacpan.org/pod/File::FormatIdentification::Pronom for details.

In addition to the statistics on PRONOM signatures, the Perl package comes with two more helper scripts that can make the work of a long-term archivist easier.

Format identification

On the one hand, we have the functionality of classic format recognition. The script delivers all hits. In the output the quality of the RegEx is indicated. This does not say how well the PRONOM signature matches the file, but how specifically it is created.

Here is an example output for a TIFF file, which was wrongly recognized as GeoTIFF by Droid:

perl -I lib bin/pronomidentify.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif

/tmp/00000007.tif identified as Tagged Image File Format with PUID fmt/353 (regex quality 1)
/tmp/00000007.tif identified as Geographic Tagged Image File Format (GeoTIFF) with PUID fmt/155 (regex quality 2)

Colorized output of possible signature hits in the hexeditor wxHexEditor

Under Linux you can use the editor wxHexEditor to analyze files. It allows you to create tag-files, in which you can define sections that are marked with colors and annotated.

The script pronom2wxhexeditor creates such a file. In the following you can see the call and a screenshot.

perl -I lib bin/pronom2wxhexeditor.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif

What next?

Well, it's up to us as a community to use the existing tools and use their possibilities to improve our daily work. Anyone who has suggestions for improvement or ideas is welcome to share them with us.

I would be especially happy if servant spirits would take the pronoun statistics to their chest and help improve the pronoun signatures.

It makes sense to start with the orphaned signatures and to check multiple used signatures again.

Montag, 13. Juli 2020

Why it is a stupid idea to consider CSV as a valid long-term preservation file format

Take CSV!

It's so nice and quick and easy to say. Take CSV!

For simple cases that may be true. CSV files look so simple, so innocent, so sweet. Yet by their very nature they are insidious, vicious, and resemble a bloody walk into the deepest dungeons of classic role-players.

Let us begin our journey.

Innocent simplicity

You take a separator, e.g. the comma, use it to separate your values. Pour both into readable form. Done.

Okay. We need a second separator to show us the next line. But then, done! It's a CSV.

Hmm. There was something. Line separator. Now, is that line feed, carriage return or carriage return and line feed? It depends. For example, what operating system you're running.

The monster is growing

It is not a bad idea to separate values of a list by commas. Especially for Americans, this feels quite natural.

In other parts of the world, the decimal places of fractional numbers are separated by commas. Good, then we'll give the spreadsheets the opportunity to define the separator freely. Problem solved.

Well, not quite. It could be in other contexts that somehow the separator could appear in the individual values of a list. Good, then we'll introduce quoting. We define a character that allows us to recognize whether a separator is a separator or just a text component of a list value. Apostrophes would fit. That was easy, wasn't it?

Short break

So, to sum up. CSV files are easy. You need a separator, which can be a comma or anything else. We have a second separator that separates the lines. Usually there are three variations. We need quoting to see that a value cannot be confused with a separator.

Yeah, it may have been a little more complex than it looked at first. But what is there to make it worse?

Little toothy pegs!

Hmm, what if I want to store a text like this as a value after the raw value 1:

And he said "Oh, no!"

In the text, we have a comma, which would be protected by quoting, But we also have quotation marks, which we need for our quoting. No problem, then we double the quotation mark at that point to indicate that the text is not finished. So in the CSV it looks like this now:

1, "And he said ""Oh, no!""

I got it.

But, wait, what happens if my text consists of a single quotation mark?

1,""""

You're lucky. It seems to be working.

Wait, so what if I have a lot of quotation marks? As in

""""""

This is translated to

1, """"""""""""""

It works, too.

The problem is in the details

Now, a nasty little devil might get the idea to construct a text as value that contains line breaks, for example this one:

Evil Text
",
",

That would then:

1, "Evil text
"","
"",

Oops! If I now stubbornly read this in line by line, I would have read strange lines.

Good thing there is real software out there that reads and parses CSV files cleanly from the beginning. Not that anyone here still uses 'grep' and co.

The Abyss

Have we actually talked about character encoding yet? ASCII, Latin-1, UTF32? UTF8? With or without byte-order mark? No. Let's turn back. We still have a chance.

Later, at the pub.

I admit it was a terrible trip. Now, over a cold beer, we can laugh about it. But our hearts were already in our mouth. We had no idea what to expect.

If only there had been a sign that said what character encoding, what line end encoding, what separators for lines and columns we could expect, yes, then we would have been able to understand CSV and we would have been spared the horror. But the horror comes from the darkness, from the premonitions of the unknown.

Therefore, be warned!

Don't use CSV, it could get you!