Mittwoch, 22. Juli 2020

Format recognition, new analysis options?

Previous work


In an older article (see https://kulturreste.blogspot.com/2018/10/heres-tool-make-it-work.html) I have already done an analysis of PRONOM signatures. Since today the module for this exists on CPAN, see https://metacpan.org/pod/File::FormatIdentification::Pronom for details.

In addition to the statistics on PRONOM signatures, the Perl package comes with two more helper scripts that can make the work of a long-term archivist easier.

Format identification


On the one hand, we have the functionality of classic format recognition. The script delivers all hits. In the output the quality of the RegEx is indicated. This does not say how well the PRONOM signature matches the file, but how specifically it is created.

Here is an example output for a TIFF file, which was wrongly recognized as GeoTIFF by Droid:

perl -I lib bin/pronomidentify.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif
/tmp/00000007.tif identified as Tagged Image File Format with PUID fmt/353 (regex quality 1)
/tmp/00000007.tif identified as Geographic Tagged Image File Format (GeoTIFF) with PUID fmt/155 (regex quality 2)


Colorized output of possible signature hits in the hexeditor wxHexEditor


Under Linux you can use the editor wxHexEditor to analyze files. It allows you to create tag-files, in which you can define sections that are marked with colors and annotated.

The script pronom2wxhexeditor creates such a file. In the following you can see the call and a screenshot.

perl -I lib bin/pronom2wxhexeditor.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif


What next?


Well, it's up to us as a community to use the existing tools and use their possibilities to improve our daily work. Anyone who has suggestions for improvement or ideas is welcome to share them with us.

I would be especially happy if servant spirits would take the pronoun statistics to their chest and help improve the pronoun signatures.

It makes sense to start with the orphaned signatures and to check multiple used signatures again.

Keine Kommentare:

Kommentar veröffentlichen