Dienstag, 30. Mai 2017

Bibtag - und 'ne Kleinigkeit gelernt

Heute hatte ich einen Abstecher zum Bibliothekartag 2017 nach Frankfurt am Main gemacht. Zum einen, um etliche Ex-Kommilitonen zu treffen, zum anderen war ich am Workshop von Yvonne Tunnat von der ZBW zur Formatidentifikation interessiert.

Yvonne hat eine wunderbare, pragmatische Art komplizierte Sachverhalte zu erklären. Wer sie kennenlernen möchte, der nestor-Praktikertag 2017 zur Formatvalidierung hat noch Plätze frei.

Zwei Dinge, die ich mitnehme. Zum einen kannte ich das Werkzeug peepdf noch nicht. Es handelt sich um ein CLI-Programm um eine PDF-Datei zu sezieren und kommt ursprünglich aus der Forensik-Ecke.

Zum anderen gibt es mit Bad Peggy ein Validierungstool um JPEGs zu analysieren.

Eine Diskussion, die immer wieder auftaucht ist die, wie man mit unbekannten Dateiformaten umgeht. IMHO sind diese nicht archivfähig, und wie Binärmüll zu betrachten. Dazu bedarf es aber mal eines längeren Beitrags und einer genaueren Analyse, ob und unter welchen Bedingungen solche Dateien vernachlässigbar sind, oder der long-tail zuschlägt.

BTW., wer am Mittwoch noch auf dem Bibtag ist, schaue mal beim Vortrag unserer Kollegin Sabine zu den Ergebnissen der PDF/A Validierung vorbei.

Dienstag, 16. Mai 2017

Über die Idee, ein Langzeitarchiv vermessen zu wollen

OpenClipart von yves_guillou, sh. Link
OpenClipart von yves_guillou, sh. Link im Bild
Irgendwann gerät man in einer Organisation an den Punkt, an dem man auf Menschen trifft, die sich den Zahlen verschrieben haben. Menschen, die als Mathematiker, als Finanzbuchhalter oder als Controller arbeiten. Das ist okay, denn Rechnungen wollen bezahlt, Ressourcen geplant und Mittel bereitgestellt werden.

Omnimetrie


Problematisch wird das Zusammentreffen mit Zahlenmenschen dann, wenn diese die Steuerung der Organisation bestimmen. Wenn es nur noch um Kennzahlen geht, um Durchsatz, um messbare Leistung, um Omnimetrie.

Schon Gunter Dueck schrieb in Wild Duck¹: "In unserer Wissens- und Servicegesellschaft gibt es immer mehr Tätigkeiten, die man bisher nicht nach Metern, Kilogramm oder Megabytes messen kann, weil sie quasi einen 'höheren', im weitesten Sinn einen künstlerischen Touch haben. Die Arbeitswelt versagt bisher bei der Normierung höherer Prinzipien."
  

Zahlen lügen nicht


Schauen wir uns konkret ein digitales Langzeitarchiv an. Mit Forderungen nach der Erhebung von Kennzahlen, wie:
  • Anzahl der Dateien, die pro Monat in das Archiv wandern, 
  • oder Zahl der Submission Information Packages (SIPs), die aus bestimmten Workflows stammen, 
demotiviert man ein engagiertes Archivteam. 

Denn diese Zahlen sagen nichts aus. Digitale Langzeitarchive stehen auch bei automatisierten Workflows am Ende der Verwertungskette. Es wäre in etwa so als würde man den Verkauf von Würstchen an der Zahl der Besucher der Kundentoilette messen wollen.

In der Praxis ist es so, dass Intellektuelle Einheiten (IE), die langzeitarchiviert werden sollen, nach dem Grad ihrer Archivfähigkeit und Übereinstimmung mit den archiveigenen Format-Policies sortiert werden.

Diejenigen  IEs, die als valide angesehen werden, wandern in
Archivinformationspaketen (AIP) eingepackt in den Langzeitspeicher. Die IEs, die nicht archivfähig sind, landen in der Quarantäne und ein Technical Analyst (TA) kümmert sich um eine Lösung oder weist die Transferpakete (SIP) mit diesen IEs zurück.

Wenn wir einen weitgehend homogenen Workflow, wie die Langzeitarchivierung von Retrodigitalisaten, betrachten, so sollte der größte Bestandteil der IEs ohne Probleme im Langzeitspeicher landen können. In dem Fall kann man leicht auf die Idee kommen, einfach die Anzahl der IEs und Anzahl und Größe der zugehörigen Dateien zu messen, um eine Aussage über den Durchsatz des Langzeitarchivs und die Leistung des LZA-Teams zu bekommen.

Ausnahme Standardfall


Doch diese Betrachtung negiert, dass nicht der Standardfall, wo IEs homogenisiert und automatisiert in das Archivsystem wandern, zeitaufwändig ist, sondern der Einzelfall, in dem sich der TA mit der Frage auseinander setzen muss, warum das IE anders aufgebaut ist und wie man eine dazu passende Lösung findet.

Formatwissen


Was die einfache Durchsatzbetrachtung ebenfalls negiert, ist, dass das Archivteam Formatwissen für bisher nicht oder nur allgemein bekannte Daten- und Metadatenformate aufbauen muss. Dieser Lernprozess ist hochgradig davon abhängig, wie gut die Formate bereits dokumentiert und wie komplex deren inneren Strukturen sind.

Organisatorischer Prozess


Ein dritter Punkt, den ein Management nach der Methode Omnimetrie negiert, ist die bereits im Nestor-Handbuch² formulierte Erkenntnis, dass digitale Langzeitarchivierung ein organisatorischer Prozess sein muss.

Wenn, wie in vielen Gedächtnisorganisationen, die Retrodigitalisate produzieren, auf Halde digitalisiert wurde, und das Langzeitarchivteam erst ein bis zwei Jahre später die entstandenen digitalen Bilder erhält, so kann von diesem im Fehlerfall kaum noch auf den Produzenten der Digitalisate zurückgewirkt werden. Die oft projektweise Abarbeitung von Digitalisierungsaufgaben durch externe Dienstleister verschärft das Problem zusätzlich. Was man in dem Falle messen würde, wäre in Wahrheit keine Minderleistung des LZA-Teams, sondern ein Ausdruck des organisatorischen Versagens, die digitale Langzeitverfügbarkeit der Digitalisate von Anfang an mitzudenken.

Natürlich ist es sinnvoll, die Entwicklung des Archivs auch mit Kennzahlen zu begleiten. Speicher muss rechtzeitig beschafft, Bandbreite bereitgestellt werden. Auch hier gilt, Augenmaß und Vernunft.

¹ Gunter Dueck, Wild Duck -- Empirische Philosophie der Mensch-Computer-Vernetzung, Springer-Verlag Berlin-Heidelberg,  (c)2008, 4. Auflage., S. 71
² Nestor Handbuch -- Eine kleine Enzyklopädie der digitalen Langzeitarchivierung, Dr. Heike Neuroth u.a., Kapitel 8 Vertrauenswürdigkeit von digitalen Langzeitarchiven, von Susanne Dobratz und Astrid Schoger, http://nestor.sub.uni-goettingen.de/handbuch/artikel/text_84.pdf, S.3

Samstag, 29. April 2017

FFV1 - some compression results

In a pilot we got some retrodigitized films and videos in Matroska/FFV1 format. In the following table I summarized the results:


n/a
film/video12345
description8mm, positive, b/w8mm, positiv, b/w16mm, positive, b/w35mm, combined, color35mm, combined, color
width25002500204840964096
height15241524152034602976
bits per pixel4848484848
pxfmtgbrp16legbrp16legbrp16legbrp16legbrp16le
duration in s121211,4592,52,5
fps2424242424
frames2882882756060
original size658368000065836800005136682844,1651019776004388290560
compressed size38619438803790690517368077971939084753443576745774
compression ratio1,7041,7361,3951,3051,226
(DPX size)65841592326584159232513684160051020774404388390400
(h264 lossless)n/a n/a n/a n/a n/a
(h265 lossless)35734203093559442475275650424730150538222992764833
(jp2k lossless)45898863414534014321373255553938696659163514687046
with audionnnyn


n/a
film/video678910
description35mm, combined, colorvhs, colorbetacam, colorbetacam, colorDigi-beta, color
width4096720720720720
height3200576576576576
bits per pixel4820202020
pxfmtgbrp16leyuv422p10leyuv422p10leyuv422p10leyuv422p10le
duration in s1088,042280280280280
fps2425252525
frames261137000700070007000
original size20536105107467257600000725760000072576000007257600000
compressed size15754151756113565437155383850093434493722804451325952
compression ratio1,3032,0351,8902,1041,630
(DPX size)
2053653333632
17472217728174298880001742988800017429888000
(h264 lossless)
n/a
n/a n/a n/a n/a
(h265 lossless)12480312926343659828688377252225734427392594323623225
(jp2k lossless)15171175605753300899483347043417731507270814022908822
with audionyyyy

All files are encoded with FFV1v3 with slices, slice-crc, GOP=1. If audio exists, it is (lin. PCM 48kHz, 16bit) included in compression-size, but not in original size, because original size is calculated by width*height*pits_per_pixel*frames and compression-size is equivalent to filesize. The count of frames is calculated with the duration value of the MKV-files. The files 1 to 5, and 7-10 are first parts of the movies (each 4GB splits).

Hint: Once the project is completed, rights must be clarified. If possible, I will publish the sources.

Update 2017-06-09

  • added file size for DPX after using "ffmpeg -i input.mkv DPX/frame_%06d.dpx"
  • added file size for h264 after using "ffmpeg -i input.mkv -c:v libx264 -g 1 -qp 0 -crf 0 output.mkv" (RGB without lossy conversion to YUV not supported yet)
  • added file size for h265 after using "ffmpeg -i input.mkv -c:v libx265 -preset veryslow -x265-params lossless=1 output.mkv"
  • added file size for openjpeg2000 after using "ffmpeg -i input.mkv -c:v libopenjpeg output.mkv
Update 2017-06-29

  • added sizes for film no 6
  • in general, the processing time of h265 and jp2k is one magnitude greater than for ffv1

Interpretation


The files 1-3 are all originally b/w. It seems to be that the codec does not decorrelate the color channels. Also the material 1-6 is retrodigitized from film and are noisy. The file 1 is very special. In decoding the FFV1 produces a very high load on the CPU (eight cores at 100%). The most decoding time is spent in method get_rac(). The original film has the highest noise level in contrast to the other files.

I think the compression-ratio difference between video- and film files comes from the different pixel format. A ratio between 1,5 - 2 was expected, but 1,3 is a surprise.

Update 2017-06-09

The reason for high CPU load was, that the digitization service provider has created a file with a framerate of 1000 fps, but the scanner has provided 24 or 25 fps. Therefore 42-40 equal frames  was encoded on block.



Donnerstag, 30. März 2017

Nestor - DIN - Workshop "Digitale Langzeitarchivierung", Nachlese

Gestern fand in den Räumen des DIN e.V. ein Workshop des Kompetenznetzwerkes digitale Langzeitarchivierung nestor und der DIN statt. Dies soll nur eine kleine Zusammenfassung für die Zuhausegebliebenen sein und erhebt keinen Anspruch auf ein objektives oder gar vollständiges Protokoll :)
Falls Fehler vorliegen bitten wir um eine Email mit Korrekturhinweisen ;)

Arbeiten des NID 15 Ausschuß


Im Kern ging es  im Workshop um die Frage, welchen Standard wollen wir in der digitalen Langzeitarchivierung in den nächsten 5-8 Jahren haben und wie kommen wir dahin?

Mit dieser Frage startete Prof. Keitel den Workshop und skizzierte nachfolgend die Ausgangslage von 2005.

  • abstraktes Thema "digitale Archivierung"
  • DIN 31646/31644/31645 aus Nestor "Dunstkreis"
  • DIN 31647 "Beweiserhaltung kryptograf. signierter Dokumente"
  • Rücklauf, ob Norm in Praxis verwendet werden ist schwierig zu erkennen
  • beziehen sich auf OAIS (ISO14721)
  • zeigen, ob man sich noch im Rahmen der digitalen LZA bewegt.

Aktuell ergänzen praktische Erfahrungen diese frühen theorethischen Überlegungen. Die Frage ist daher, ob es Bereiche gibt, wo sich die Ausgangsthesen mittlerweile überholt haben?

Es gilt, so Prof. Keitel,
  •  Schwerpunkte, die sich zur Standardisierung eignen, herauszukristallisieren
  •  Mitarbeitern zu finden, die sich in der Normierungsarbeit in den neuen Feldern einbringen wollen

Ob man für Normungsarbeit geeignet sei, läßt sich launisch an folgenden Kriterien festmachen (Zitat):
  • Lange auf Stuhl sitzen
  • Verbessere gern Geschriebenes anderer Leute
  • bei genauen terminologischen Definitionen verstehe ich keinen Spaß und mache keine Kompromisse
  • ich lese gerne Dokumente mit Titelen, wie...
Im Anschluss wurde die Schwierigkeit angesprochen, Feedback zu bestehenden DIN Normen zu erhalten.

PDF Standardisierung


Olaf Drümmer von der callas software GmbH skizzierte einführend die Geschichte von PDF und wies auf die neue Version 2 hin:

  • 1993-2006 Adobe PDF 1.0 -> 1.7
  • 2008 ISO: PDF 1.7 als ISO 32000-1
  • 2017 ISO: PDF 2.0 als ISO 32000-2 (im nächsten Quartal, >1000 Seiten)
    • neue kryptografische Verfahren
    • tagging überarbeitet
    • Problemfeld im Normungsprozess waren Farben
    • Namespaces wurden eingeführt, zB. um Tags aus HTML 5 einbinden
Er ging dann auf die PDF-Spezialisierungen ein:

  • 2001 PDF/X Übermittlung von Druckvorlagen
  • 2005 PDF/A Archivierung, ISO Reihe 19005
    • entstanden aus Notwendigkeiten der US Courts, Library of Congress
  • 2008 PDF/E ISO 24517, Engineering (CAD), noch nicht stark verbreitet, Ende des Jahres auch 3D Modelle
  • 2010 PDF/VT ISO 16612-2 + PDF/VCR ISO 16612-3, variabler Datendruck (großvolumige Rechnungen, Serienbriefe)
  • 2012 PDF/UA ISO 14289 Reihe, Barrierefreiheit
Die Bedeutung der Normung ergibt sich nach Drümmer allein schon aus der
Verbreitung von PDF Dokumenten:
  • Anzahl PDF Dokumente weltweit, mind. Billionen (10¹²), davon 6 Millionen allein beim US Court
  • Lebenserwartungen pro PDF: Stunden bis Jahre
Weiter ging er auf die Herausforderung Variantenvielfalt ein:
  • PDF/X, 8 Normteile, insgesamt 12 Konformitätsstufe
  • PDF/A Normenreihe, 3 Normteile, insgesamt 8 Konformitätsstufen
  • Unübersichtlich, mangelnde Trennschärfe?
  • Flexibilität bzw. Mächtigkeit
  • offener Charakter
  • breite Abdeckung
Wie es mit der Normierung ab 2017 weitergehen soll skizzierte er anschliessend:
  • PDF2.0 weitgehend rückwärtskompatibel, keine Validierung bei Veröffentlichungen vorgesehen
  • Projekt "Camelot2" soll klassische PDF-Dokumentenwelt und Open Web Platform zusammenbringen, mehr Infos zu PDF Days Europe 2017, Berlin, 15.-16. Mai 2017
  • PDF/A4 als Ziel: keine Konformitätsstufen
  • PDF/E erlaubt interaktive Elemente (JS), PDF/E-2 soll eher eine Archivausprägung weniger eine Arbeitsdokumentausprägung bekommen
  • XMP kann im PDF an *allen* Stellen angebracht werden, so dass man darin auch Quellen oder zB. UUIDs dafür hinterlegen kann
  • PDFA/3 kann auch alternative Verknüpfung zum Inhalt beliebiger Dateien hinterlegen, Problem: nicht verpflichtend und muss über Policy geregelt werden

nestor


Prof. Keitel skizzierte kurz die Arbeit von nestor:

  •  …ist auf jeden Fall Kooperationsnetzwerk
  • stellt AGs vor

Vertrauenswürdige Archive

  • * 2004-2008 Nestor Kriterienkatalog
  • * 2008-2012 DIN31644
  • * 2013-… nestor Siegel

Submission Information Packages - Überarbeitung der Ingest-Standards


Dr. Sina Westphal und Dr. Sebastian Gleixner (Dt. Bundesarchiv) regten in einem Impulsvortrag die Normierung des Ingestvorgangs und der SIPs an.

  • Bundesarchiv 4PB/Jahr Zuwachs
  • Anreiz zur allmählichen Angleichung der Systeme
  • vereinheitlichte Metadaten
  • verbesserter Datenaustausch
  • vereinheitlichte Schnittstellen
Konsequenzen:
  • Vereinheitlichung bestehender SIPs (ggf. auch AIPs/DIPs)
  • Vereinheitlichung bestehender digitaler Archivsysteme

Zwei Teilbereiche:
  • Standardisierung des SIP (konkret)
    • Struktur
    • Metadaten
    • Primärdaten
    • vgl. E-ARK, e-CH, EMEA
  • Standardisierung des Ingest-Prozesses (abstrakt)
    • Verbindung zum Erschliessungstool
    • Validierung
    • Ingest
    • Umgang mit Primärdaten

Fragen:
  • Vereinheitlichung möglich?
  • Ist Standardisierung AIPs/DIPs und der damit verbundenen Prozesse notwendig?

Im Anschluss erfolgte eine Diskussion über Abgrenzung und konkrete Austauschverfahren mit ff. Ergebnis:

  • Trend geht hin zu abstrakter Modulbeschreibung
  • konzeptioneller Rahmen erwünscht
  • Festlegung welche Module verpflichtend, welche optional sind
  • empfohlener Einstiegspunkt für Automatisierung

Videoarchivierung als neue Herausforderung, Langzeiterhaltung audiovisueller Medien jenseits von Film- und Fernsehen


In diesem Impulsvortrag von Alfred Werner, HUK Coburg wurde die Problematik der Langzeitarchivierung von Videos skizziert.

  • Bandbreite Außenstelle 5-15MBit/s
  • wandeln in Multipage-TIFF monochrom (kleine Dateien) und in JPG um,
  • Videos erwünscht,
    • 2011 5 Videos/Tag
    • 2016 20 Videos/Tag (im Gegensatz zu 10.000 Schadensfälle pro Tag)
    • 2021 100?/1000? Videos/Tag
  • Dashcam-Videos seit diesem Jahr erlaubt

Problem: unterschiedlichste Formate, Tendenz steigend, es wird nicht besser (3D, HDR, 4k, 2 Objektive, Spezialsensoren)

mögliche Lösung: Konvertierung in ein Langzeitarchivformat für Videos

Anforderungen:
  • Standard für die nächsten 50 Jahre
  • Lizenzfrei
  • bestmögliche Qualität
  • geringer Speicherplatz
  • gute Antwortzeiten auch bei geringer Bandbreite

dann noch Funktionen für Sachbearbeiter, wie:
Zoomen, Sprungmarken setzen, Extrahieren Einzelbilder, Schwärzen, Szenen extrahieren.

In der anschliessenden Diskussion wurde das Problem deutlich, dass man sich im Spannungsfeld zwischen Robustheit und originalgetreuer Wiedergabe einerseits und Ressourcenbedarf (Speicher, Bandbreite, Processingzeit) andererseits befindet.

Anmerkung: Dazu wurde auf der nestor-ML ein ergänzender Beitrag verfasst.


Digital Curation


Auch hier hielt Prof. Keitel ein Impulsreferat. Ich hoffe, ich kann den Inhalt korrekt wiedergeben:

Unterschied Data Curation zu Langzeitarchivierung nach OAIS: wir reden nicht mehr von Einrichtungen/Organisationen, sondern von Techniken. D.h., fehlen der organisatorischen Verantwortung.

OAIS goes Records Managment, dh. wie kann man Anforderungen der digitalen LZA an Produzenten bringen (durch digital curation), AIP liegt quasi beim Produzenten.
Wie harmonieren die von OAIS/PREMIS genannten Erhaltungsfunktionen mit den Rgelungen des Records Managment? Welche Elemente/Gruppen müssen wir aus Erhaltungsgründen unterscheiden?

Keitel: "Wir gingen bisher immer von einem Kümmerer aus, der Dinge auf Dauer bewahrt. Digital Curation setzt vorher beim Producer an"

Zusammenfassung


Aus unserer Sicht sollte der Ingest versucht werden besser zu standardisieren. Nur so wäre es möglich, dass man Produzenten Werkzeuge in die Hand geben kann, die nicht archivspezifisch sind. Der Weg dorthin ist steil, zumal allein schon die Wege die Archive und Bibliotheken einschlagen sehr unterschiedlich sind.

PDF ist und bleibt leider ein Minenfeld. Weder wurden mit PDF2 bestehende Ambiguitäten ausgeräumt, noch vereinfacht sich der Standard. Besonders nachteilich dürfte sich die fehlende offizielle Validierung erweisen. Hinzukommt dass der Formatzoo rund um PDF weiter anwächst und Mischformen von Dokumenten möglich sind, d.h. ein PDF kann sowohl PDF/E als auch PDF/A sein.

Der Bedarf nach langzeittauglichen Videoformaten ist vorhanden. Eine Normierung könnte helfen, die Unterstützung durch Hersteller zu forcieren. Am Thema Video wurde deutlich, dass die digitale Langzeitarchivierung Kosten verursacht, die nicht leicht zu vermitteln sind. Datenkompression, insbesondere die verlustbehaftete führt zu einem höheren Schadensrisiko bei Bitfehlern. Die Diskussion über das Spannungsfeld Robustheit/Qualität vs. Kosten muss in der Community geführt werden, ist aber außerhalb von Normungsbemühungen anzusiedeln.

Data Curation ist eine Aktie für sich. Es gibt Lücken, die entstehen, wenn Dokumente Lebenszyklen von mehreren Jahrzehnten aufweisen. Mein Bauchgefühl sagt mir, dass dies ebenfalls unter Langzeitverfügbarkeit subsummiert werden kann, da wir in der Langzeitarchivierung ja die Dokumente auf unbestimmte Zeiten nutzbar halten wollen. Data Curation scheint mir demnach nichts anderes als der Sonderfall zu sein, als das Produzent und Archiv als Rolle zusammenfallen.

Montag, 13. Februar 2017

Where have all the standards gone? A singalong for archivists.

Recently, we noticed that the specification for the TIFF 6 file format has vanished from Adobe's website, where it was last hosted. As you might know, Adobe owns TIFF 6 due to legal circumstances created by the acquisition of Aldus in 1994.

Up until now, we used to rely on the fact that TIFF is publicly specified by the document that was always available. However, since Adobe has taken down the document, all we have left are the local copies on our workstations, and we only have those out of pure luck. The link to http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf has been dead for several months now.

This made us think about the standards and specifications themselves. We've always, half jokingly, said that we would have to preserve the standard documents in our repositories as well if we wanted to do our jobs right. We also thought that this would never be actually be necessary. Boy, were we wrong.

We're now gathering all the standard and specification documents for the file formats that we are using and that we are planning to use. These documents will then be ingested into the repository using separate workflows to keep our documents apart from the actual repository content. That way, we hope to have all documents at hand even if they vanished from the web.

From our new perspective, we urge all digital repositories to take care of not only their digital assets, but also of the standard documents they are using.

The TIFF user community just recently had to take a major hit when the domain owners of http://www.remotesensing.org/libtiff/ lost control of their domain, thus making the libtiff and the infrastructure around it unavailable for several weeks. Even though the LibTIFF is now available again at their new home (http://libtiff.maptools.org), we need to be aware that even widely available material might be unavailable from one day to another.


Freitag, 20. Januar 2017

repairing TIFF images - a preliminary report

During two years of operation, more than 3.000 ingests have been piling up in the Technical Analyst's workbench of our digital preservation software. The vast majority of them have been singled out by the format validation routines, indicating that there has been a problem with the standard compliance of these files. One can easily see that repairing these files is a lot of work that, because the repository software doesn't support batch operations for TIFF repairs, would require months of repetative tasks. Being IT personnel, we did the only sane thing that we could think of: let the computer take care of that. We extracted the files from our repository's working directory, copied them to a safe storage area and ran an automated repair routine on those files. In this article, we want to go a little into detail about how much of an effort repairing a large corpus of inhomogenously invalid TIFFs actually is, which errors we encountered and which tools we used to repair these errors.

So, let's first see how big our problem actually is. The Technical Analyst's workbench contains 3.101 submission information packages (SIPs), each of them containing exactly one Intellectual Entity (IE). These SIPs contain 107.218 TIFF files, adding up to a grand total of about 1,95 TB of storage. That's an average of 19,08 MB per TIFF image.

While the repository software does give an error message for invalid files that can be found in the WebUI, they cannot be extracted automatically, making them useless for our endeavour. Moreover, our preservation repo uses JHove's TIFF-hul module for TIFF validation, which cannot be modified to accomodate local validation policies. We use a policy that is largely based on Baseline TIFF, including a few extensions. To validate TIFFs against this policy (or any other policy that you can think of, for that matter), my colleague Andreas has created the tool checkit_tiff, which is freely (free as in free speech AND free beer) available on GitHub for anyone to use. We used this tool to validate our TIFF files and single out those that didn't comply with our policy. (If you are interested, we used the policy as configured in the config file cit_tiff6_baseline_SLUB.cfg, which covers the conditions covered in the german document http://www.slub-dresden.de/ueber-uns/slubarchiv/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten/langzeitarchivfaehige-dateiformate/handreichung-tiff/ as published on 2016-06-08.)

For the correction operations, we used the tool fixit_fiff (also created by Andreas and freely available), the tools tiffset and tiffcp from the libtiff suite and convert from ImageMagick. All of the operations ran on a virtual machine with 2x 2,2GHz CPUs and 3 GB RAM with a recent and fairly minimal Debian 8 installation. The storage was mounted via NFS 3 from a NetApp enterprise NAS system and connected via 10GBit Ethernet. Nevertheless, we only got around 35MB/s throughput during copy operations (and, presumeably, also during repair operations), which we'll have to further investigate in the future.

The high-level algorithm for the complete repair task was as follows:
  1. copy all of the master data from the digital repository to a safe storage for backup
  2. duplicate that backup data to a working directory to run the actual validation/repair in
  3. split the whole corpus into smaller chunks of 500 SIPs to keep processing times low and be able to react if something goes wrong
  4. run repair script, looping through all TIFFs in the chunk
    1. validate a tiff using checkit_tiff
    2. if TIFF is valid, go to next TIFF (step 4), else continue (try to repair TIFF)
    3. parse validation output to find necessary repair steps
    4. run necessary repair operations
    5. validate the corrected tiff using checkit_tiff to detect errors that haven't been corrected
    6. recalculate the checksums for the corrected files and replace the old checksums in the metadata with the new ones
  5. write report to log file
  6. parse through report log to identify unsolved problems, create repair recipies for those and/or enhance fixit_tiff
  7. restore unrepaired TIFFs from backup, rerun repair script
  8. steps 4-7 are run until only those files are left that cannot be repaired in an automatic workflow
During the several iterations of validation, failed correction and enhancements for the repair recipies, we found the following correctable errors. Brace yourself, it's a long list. Feel free to scroll past it for more condensed information.
  • "baseline TIFF should have only one IFD, but IFD0 at 0x00000008 has pointer to IFDn 0x<HEX_ADDRESS>"
    • This is a multipage TIFF with a second Image File Directory (IFD). Baseline TIFF requires only the first IFD to be interpreted by byseline TIFF readers.
  • "Invalid TIFF directory; tags are not sorted in ascending order"
    • This is a violation of the TIFF6 specification, which requires that TIFF tags in an IFD must be sorted ascending by their respective tag number.
  • "tag 256 (ImageWidth) should have value , but has value (values or count) was not found, but requested because defined"
    • The tag is required by the baseline TIFF specification, but wasn't fount in the file.
  • "tag 257 (ImageLength) should have value , but has value (values or count) was not found, but requested because defined"
    • Same here.
  • "tag 259 (Compression) should have value 1, but has value X"
    • This is a violation of our internal policy, which requires that TIFFs must be stored without any compression in place. Values for X that were found are 4, 5 and 7, which are CCITT T.6 bi-level encoding, LZW compression and TIFF/EP JPEG baseline DCT-based lossy compression, respectively. The latter one would be a violation of the TIFF6 specification. However, we've noticed that a few files in our corpus were actually TIFF/EPs, where Compression=7 is a valid value.
  • "tag 262 (Photometric) should have value <0-2>, but has value (values or count) 3"
    • The pixels in this TIFF are color map encoded. While this is valid TIFF 6, we don't allow it in the context of digital preservation.
  • "tag 262 (Photometric) should have value , but has value (values or count) was not found, but requested because defined"
    • The tag isn't present at all, even though it's required by the TIFF6 specification.
  • "tag 269 (DocumentName) should have value ^[[:print:]]*$, but has value (values or count) XXXXX"
    • The field is of ASCII type, but contains characters that are not from the 7-Bit ASCII range. Often, these are special characters that are specific to a country/region, like the German "ä, ö, ü, ß".
  • "tag 270 (ImageDescription) should have value word-aligned, but has value (values or count) pointing to 0x00000131 and is not word-aligned"
    • The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.
  • "tag 271 (Make) should have value ^[[:print:]]*$, but has value (values or count)"
    • The Make tag is empty, even though the specification requires it contains a string of the manufacturer's name.
  • "tag 271 (Make) should have value ^[[:print:]]*$, but has value (values or count) Mekel"
    • That's a special case where scanners from the manufacturer Mekel write multiple NULL-Bytes ("\0") at the end of the Make tag, presumeably for padding. This, however, violates the TIFF6 specification.
  • "tag 272 (Model) should have value ^[[:print:]]*$, but has value (values or count)"
    • The Model tag is empty, even though the specification requires it contains a string of the scanner device's name.
  • "tag 273 (StripOffsets) should have value , but has value (values or count) was not found, but requested because defined"
    • The tag isn't present at all, even though it's required by the TIFF6 specification.
  • "tag 278 (RowsPerStrip) should have value , but has value (values or count) was not found, but requested because defined"
    • Same here.
  • "tag 278 (RowsPerStrip) should have value , but has value (values or count) with incorrect type: unknown type (-1)"
    • This error results from the previous one: if a field doesn't exist, then checkit_tiff will assume data type "-1", knowing that this is no valid type in the real world.
  • "tag 278 (RowsPerStrip) was not found, but requested because defined"
    • The tag isn't present at all, even though it's required by the TIFF6 specification.
  • "tag 279 (StripByteCounts) should have value , but has value (values or count) was not found, but requested because defined"
    • The field doesn't contain a value, which violates the TIFF6 specification.
  • "tag 282 (XResolution) should have value word-aligned, but has value (values or count) pointing to 0x00000129 and is not word-aligned"
    • The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.
  • "tag 292 (Group3Options) is found, but is not whitelisted"
    • As compression is not allowed in our repository, we disallow this field that comes with certain compression types as well.
  • "tag 293 (Group4Options) is found, but is not whitelisted"
    • Same here.
  • "tag 296 (ResolutionUnit) should have value , but has value"
    • The tag ResolutionUnit is a required field and is set to "2" (inch) by default. However, if the field is completely missing (as was the case here), this is a violation of the TIFF6 specification.
  • "tag 296 (ResolutionUnit) should have value , but has value (values or count) with incorrect type: unknown type (-1)"
    • This error results from the previous one: if a field doesn't exist, then checkit_tiff will assume data type "-1", knowing that this is no valid type in the real world.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=0"
    • The TIFF6 specification states that: "If PageNumber[1] is 0, the total number of pages in the document is not available.". We don't allow this in our repository by local policy.
  • "tag 306 (DateTime) should have value ^[12][901][0-9][0-9]:[01][0-9]:[0-3][0-9] [012][0-9]:[0-5][0-9]:[0-6][0-9]$, but has value (values or count) XXXXX"
    • That's one of the most common errors. It's utterly unbelievable how many software manufacturers don't manage to comply with the very clear rules of how the DateTime string in a TIFF needs to be formatted. This is a violation of the TIFF6 specification.
  • "tag 306 (DateTime) should have value should be  "yyyy:MM:DD hh:mm:ss", but has value (values or count) of datetime was XXXXX"
    • Same here
  • "tag 306 (DateTime) should have value word-aligned, but has value (values or count) pointing to 0x00000167 and is not word-aligned"
    • The TIFF6 specification requires IFD tag fields to always start at word boundaries, but this field does not, thus violating the specification.
  • "tag 315 (Artist) is found, but is not whitelisted"
    • The tag Artist may contain personal data and is forbidden by local policy.
  • "tag 317 (Predictor) is found, but is not whitelisted"
    • The tag Predcitor is needed for encoding schemes that are not part of the Baseline TIFF6 specification, so we forbid it by local policy.
  • "tag 320 (Colormap) is found, but is not whitelisted"
    • TIFFs with this error message contain a color map instead of being encoded as bilevel/greyscale/RGB images. This is something that is forbidden by policy, hence we need to correct it.
  • "tag 339 (SampleFormat) is found, but is not whitelisted"
    • This tag is forbidden by local policy.
  • "tag 33432 (Copyright) should have value ^[[:print:]]*$, but has value (values or count)"
    • The Copyright tag is only allowed to have character values from the 7-Bit ASCII range. TIFFs that violate this rule from the TIFF6 specification will throw this error.
  • "tag 33434 (EXIF ExposureTime) is found, but is not whitelisted"
    • EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD. As this probably hasn't happened here, this needs to be seen as a violation of the TIFF6 specification.
  • "tag 33437 (EXIF FNumber) is found, but is not whitelisted"
    • Same here.
  • "tag 33723 (RichTIFFIPTC / NAA) is found, but is not whitelisted"
    • This tag is not allowed by local policy.
  • "tag 34665 (EXIFIFDOffset) should have value , but has value"
    • In all cases that we encountered, the tag EXIFIFDOffset was set to the wrong type. Instead of being of type 4, it was of type 13, which violates the TIFF specification.
  • "tag 34377 (Photoshop Image Ressources) is found, but is not whitelisted"
    • This proprietary tag is not allowed by local policy.
  • "tag 34675 (ICC Profile) should have value pointing to valid ICC profile, but has value (values or count) preferred cmmtype ('APPL') should be empty or (possibly, because ICC validation is alpha code) one of following strings: 'ADBE' 'ACMS' 'appl' 'CCMS' 'UCCM' 'UCMS' 'EFI ' 'FF  ' 'EXAC' 'HCMM' 'argl' 'LgoS' 'HDM ' 'lcms' 'KCMS' 'MCML' 'WCS ' 'SIGN' 'RGMS' 'SICC' 'TCMM' '32BT' 'WTG ' 'zc00'"
    • This is a juicy one. This error message indicates that something's wrong with the embedded ICC profile. In fact, the TIFF itself might be completely intact, but the ICC profile has the value of the cmmtype field set to a value that is not part of the controlled vocabulary for this field, so the ICC standard is violated.
  • "tag 34852 (EXIF SpectralSensitivity) is found, but is not whitelisted"
    • EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD. 
  • "tag 34858 (TimeZoneOffset (TIFF/EP)) is found, but is not whitelisted"
    • TIFF/EP tags are not allowed in plain TIFF6 images.
  • "tag 36867 (EXIF DateTimeOriginal) is found, but is not whitelisted"
    • EXIF tags may never be referenced out of IFD0, but always out of their own ExifIFD.
  • "tag 37395 (ImageHistory (TIFF/EP)) is found, but is not whitelisted"
    • Same here.
Some of the errors, however, could not be corrected by means of an automatic workflow. These images will have to be rescanned from their respective originals:
  • "tag 282 (XResolution) should have value <300-4000>, but has value (values or count) 200, 240, 273, 72"
    • This tag contains a value for the image's horizontal resolution that is too low for what is needed to comply with the policy. In this special case, that policy is not our own, but the one stated in the German Research Foundation's (Deutsche Forschungsgemeinschaft, DFG) "Practical Guidelines for Digitisation" (DFG-Praxisregeln "Digitalisierung", document in German, http://www.dfg.de/formulare/12_151/12_151_de.pdf), where a minimum of 300 dpi is required for digital documents that were scanned from an analog master and are intended for close examination. 1.717 files contained this error.
  • "tag 283 (YResolution) should have value <300-4000>, but has value (values or count) 200, 240, 273, 72"
    • Same here, but for the vertical resolution.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=2"
    • This error message indicates that the TIFF has more than one pages (in this case two master images), which is forbidden by our internal policy. Five images contained this error.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=3"
    • Same here. One image contained this error.
  • "tag 297 (PageNumber) should have value at [1]=1, but has value (values or count) at [1]=5"
    • Same here. One image contained this error.
  • "TIFF Header read error3: Success"
    • This TIFF was actually broken, had a file size of only 8 Bytes and was already defective when it was ingested into the repository. One image contained this error.
From our experiences, Andreas has created eight new commits for fixit_tiff (commits f51f71d to cf9b824) that made fixit_tiff more capable and more independent of the libtiff, which contained quite some bugs and sometimes even created problems in corrected TIFFs that didn't exist before. He also improved checkit_tiff to vastly increase performance (3-4 orders of magnitude) and helped build correction recipies.

The results are quite stunning and saved us a lot of work:
  • Only 1.725 out of 107.218 TIFF files have not been corrected and will have to be rescanned. That's about 1,6% of all files. All other files were either correct from the beginning or have successfully been corrected.
  • 26 out of 3.103 SIPs still have incorrect master images in them, which is a ratio of 0,8%.
  • 11 new correction recipies have been created to fix a total of 41 errors (as listed above).
  • The validation of a subset of 6.987 files just took us 37m:46s (= 2.266 seconds) on the latest checkit_tiff version, which is a rate of about 3,1 files/sec. For this speed, checking all 107.218 files would theoretically take approximately 9,7 hours. However, this version hasn't been available during all of the correction, so the speed has been drastically lower in the beginning. We think that 24 - 36 hours would be a more accurate estimate.
  • UPDATE: After further improvements in checkit_tiff (commit 22ced80), checking 87.873 TIFFs took only 51m 53s, which is 28,2 TIFFs per second (yes, that's 28,2 Hz!), marking an ninefold improvement over the previous version for this commit alone. With this new version, we can validate TIFFs with a stable speed, independent from their actual filesize, meaning that we can have TIFF validation practically for free (compared to the effort for things like MD5 calculation).
  • 10.774 out of 107.218 TIFF files were valid from the start, which is pretty exactly 10%.
The piechart shows our top ten errors as extracted from all validation runs. The tag IDs are color coded.


This logarithmically scaled graph shows an assembly of all tags that had any errors, regardless of their nature. The X-axis is labelled with the TIFF tag IDs, and the data itself is labeled with the number of error messages for their respective tag IDs.


Up until now, we've invested 26 person days on this matter (not counting script run times, of course); however, we haven't finished it yet. Some steps are missing until the SIPs can actually be transferred to the permanent storage. First of all, we will revalidate all of the corrected TIFFs to make sure that we haven't made any mistakes while moving corrected data out of the way and replacing it with yet-to-correct data. When this step has completed successfully, we'll reject all of the SIPs from the Technical Analyst's workbench in the repository and re-ingest the SIPs. We hope that there won't be any errors now, but we assume that some will come up and brace for the worst. Also, we'll invest some time to generate some statistics. We hope that this will enable us to make qualified estimates for the costs of reparing TIFF images, for the number of images that are affected by a certain type of errors and for the total quality of our production.

A little hint for those of you that want to try this at home: make sure you run the latest checkit_tiff compliance checker with the "-m" option set to enable memory-mapped operation and get drastically increased performance, especially during batch operation.
For the purpose of analysing TIFF files, checkit_tiff comes with a handy "-c" switch that enables colored output, so you can easily spot any errors on the text output.

I want to use the end of this article to say a few words of warning. On the one hand, we have shown that we are capable of successfully repairing large amounts of invalid or non-compliant files in an automatic fashion. On the other hand, however, this is a dangerous precedence for all the people who don't want to make the effort to increase quality as early as possible during production, because they find it easier to make others fix their sloppy quality. Please, dear digital preservation community, always demand only the highest quality from your producers. It's nothing less than your job, and it's for their best.

Dienstag, 22. November 2016

Some thoughts about risks in TIFF file format

Introduction

TIFF in general is a very simple fileformat. It starts with a constant header entry, which indicates that the file is a TIFF and how it is encoded (byteorder).
The header contains an offset entry which points to the first image file directory (IFD). Each IFD has a field which counts the number of associated tags, followed by an array of these tags and an offset entry to the next IFD or to zero, which means there is no further IFD.
Each tag in the array is 12 Bytes long. The first 4 bytes indicate the tag itself, the next 2 bytes declare the value-type, followed by 2 bytes counting the values. The last 4 bytes are either an offset or hold the values themselves.

What makes a TIFF robust?

In the TIFF specification, there are some hints which help us to repair broken TIFFs.
The first hint is that all offset-addresses must be even. The second important rule is that the tags in an IFD must be sorted in an ascending order.
At last, the TIFF spec defines different areas in the tag range. This is to guarantee that the important values are well defined.
If we guarantee that a valid TIFF was stored, there is a good chance to detect and repair broken TIFFs using these three hints.

What are the caveats of TIFF?

As a proof of concept there is also a tool "checkit_tiff_risks" provided in this repository. Using this tool, users can analyze the layout of any baseline TIF file.
The most risky memory ranges are the offsets. If a bitflip occurs there, the user must search the complete 4GB range. In practise, the TIF files are smaller, and so this size is the searchspace for offsets.
The most risky offsets are the ones which are indirect offsets. This means the IFD0 offset and the StripOffset tag (code 273).
Here an example of a possible complex StripOffset encoding:
The problem in this example is that TIFF has no way to find out how many bytes are part of the pixel-data stream. The existing StripByteCounts tag only stores the expected pixel data length after decompression.
This makes the StripOffset tag very fragile. If a bitflip changes the offset of the StripOffset tag, the whole pixel information might be lost.
Also, if a bitflip occurs in the offset area that the StripOffset tag points to, the partial pixel data of the affected stripe is lost.
If compression is used, the risk of losing the whole picture is even higher, because the compression methods do not use an end-symbol. Instead, the buffer sizes as stored in the StripByteCount tag are used. Therefore, a bit-error in the Compression tag, the StripOffset tag, the StripByteCount tag or in the memory-map where StripOffset points to, could destroy the picture information.

Upcoming next…


In upcoming versions of checkit_tiff, we would provide a tool to analyze the distribution of risky offsets in given TIFF-files.  This will objectify the discussion about robust fileformats vs. compression.

Here a short preview:

$>  ./checkit_tiff_risk ../tiffs_should_pass/minimal_valid.tiff

This reports this kind of statistics:

[00], type=                  unused/unknown, bytes=         0, ratio=0.00000
[01], type=                        constant, bytes=         4, ratio=0.01238
[02], type=                             ifd, bytes=       130, ratio=0.40248
[03], type=                  offset_to_ifd0, bytes=         4, ratio=0.01238
[04], type=                   offset_to_ifd, bytes=         4, ratio=0.01238
[05], type= ifd_embedded_standardized_value, bytes=        52, ratio=0.16099
[06], type=   ifd_embedded_registered_value, bytes=         0, ratio=0.00000
[07], type=      ifd_embedded_private_value, bytes=         0, ratio=0.00000
[08], type=ifd_offset_to_standardized_value, bytes=        12, ratio=0.03715
[09], type=  ifd_offset_to_registered_value, bytes=         0, ratio=0.00000
[10], type=     ifd_offset_to_private_value, bytes=         0, ratio=0.00000
[11], type=      ifd_offset_to_stripoffsets, bytes=         0, ratio=0.00000
[12], type=               stripoffset_value, bytes=        30, ratio=0.09288
[13], type=              standardized_value, bytes=        87, ratio=0.26935
[14], type=                registered_value, bytes=         0, ratio=0.00000
[15], type=                   private_value, bytes=         0, ratio=0.00000
counted: 323 bytes, size: 323 bytes


In this example the StripOffset is encoded directly (there are only one stripe). The problematic bytes are the offset-addresses (affected 20 Bytes of 323 Bytes).

In opposite to this example, here a special file using multiple strips:

$>  ./checkit_tiff_risk ../tiffs_should_pass/minimal_valid_multiple_stripoffsets.tiff

This reports this kind of statistics:

[00], type=                  unused/unknown, bytes=         0, ratio=0.00000
[01], type=                        constant, bytes=         4, ratio=0.01250
[02], type=                             ifd, bytes=       122, ratio=0.38125
[03], type=                  offset_to_ifd0, bytes=         4, ratio=0.01250
[04], type=                   offset_to_ifd, bytes=         4, ratio=0.01250
[05], type= ifd_embedded_standardized_value, bytes=        44, ratio=0.13750
[06], type=   ifd_embedded_registered_value, bytes=         0, ratio=0.00000
[07], type=      ifd_embedded_private_value, bytes=         0, ratio=0.00000
[08], type=ifd_offset_to_standardized_value, bytes=        16, ratio=0.05000
[09], type=  ifd_offset_to_registered_value, bytes=         0, ratio=0.00000
[10], type=     ifd_offset_to_private_value, bytes=         0, ratio=0.00000
[11], type=      ifd_offset_to_stripoffsets, bytes=        40, ratio=0.12500
[12], type=               stripoffset_value, bytes=        30, ratio=0.09375
[13], type=              standardized_value, bytes=        56, ratio=0.17500
[14], type=                registered_value, bytes=         0, ratio=0.00000
[15], type=                   private_value, bytes=         0, ratio=0.00000
counted: 320 bytes, size: 320 bytes


Here you can see we have the type 11, which points StripOffset to an array of offset adresses, where the pixel data could be found. This is similar to the diagram above. In this case we have 40 bytes with high bitflipping risk.