tag:blogger.com,1999:blog-90529408877562665772024-03-14T05:43:42.793-07:00Kulturreste – Was von uns übrig bleibt…Zwei Kollegen und die Langzeitarchivierung…Unknownnoreply@blogger.comBlogger51125tag:blogger.com,1999:blog-9052940887756266577.post-6200309350411855622023-03-16T15:11:00.005-07:002023-03-16T15:11:32.288-07:00Some thoughts about a minimalistic Archival Information System, part 3<p> <span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">In the <a href="http://kulturreste.blogspot.com/2023/03/some-thoughts-about-minimalistic_7.html">last blog post</a> we considered what the basic structure of the information packages should look like and how we will deal with versioning.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">In the following I would like to describe further cornerstones of a minimalistic archival information system.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">These will then form the basis for a first implementation, which would go beyond the scope of this blog.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">As soon as there is news worth reporting, I will announce it here.</span></span></span></p><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Data management</span></span></span></h1><p style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Other archival information systems sometimes make it too easy for themselves and use a database to manage information about the AIPs in the archive.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">In principle there is nothing wrong with this, but it often seems that it is forgotten that a basic principle of information packages is the intellectual unit (IE) of data and metadata.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">What does this mean?</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">The idea is that an IE should be able to stand on its own at all times.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Following this principle has two consequences.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">First, hierarchically nested IEs cannot exist unless self-contained IEs are encapsulated like a box of boxes.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">In other words: IEs that only contain references to other IEs are not possible because they would not be viable on their own.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></p><p style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">The second consequence is that all metadata must always be in a consistent state, regardless of the state of the archive information system.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">In other words, there must be no contradictions between the information stored in the AIP and the information in the system's database.</span></span></span></p><p style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Why is this important?</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">The Archival Information Packages are ultimately the time capsules that will outlast the Archive.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">If everything breaks, but a copy of an AIP is still found on tape, it contained all the information needed to interpret the data to be preserved.</span></span></span> </span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">So for the management of the AIPs we define the following:</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></span></span></span></p><ol style="text-align: left;"><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">the AIP is the basis for everything.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">If there are inconsistencies in the archival information system, we first ask the AIPs.</span></span></span></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz"><span class="ryNqvb"></span></span><span class="jCAhz ChMk0b"><span class="ryNqvb"> In order to speed up the processing, we can use a database.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">But then there must be a way to generate the database from the AIPs.</span></span></span></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz"><span class="ryNqvb"></span></span><span class="jCAhz ChMk0b"><span class="ryNqvb"> If the AIP is the basis for everything, then we need a mechanism that ensures that if there are errors in the creation of an AIP or in the creation of a new AIP version, these can be rolled back.</span></span><span class="jCAhz"><span class="ryNqvb"> </span></span></span></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">The archive can and should only assume responsibility for the data entrusted to it if an AIP or a new AIP version could be successfully generated.</span></span></span></span></span></span></li></ol><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Things that make life easier</span></span><span class="jCAhz"><span class="ryNqvb"> </span></span></span></span></span></span></span></span></span></h1><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">What has proven to be very helpful is the following:</span></span><span class="jCAhz"><span class="ryNqvb"> </span></span></span></span></span></span></span></span></span></p><ol style="text-align: left;"><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">We should only allow 1:1 mappings.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">This means that a SIP contains exactly one digital object</span></span><span class="jCAhz"><span class="ryNqvb"> </span></span></span></span></span></span></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">We do without nested IEs.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></span></span></span></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">We ignore for now that copy operations are expensive.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">It follows that AIP updates always consist of SIPs with complete data and metadata.</span></span></span></span></span></span></span></span></span></li></ol><h1><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Architectural decisions</span></span></span> </span></span></span></span></span></span></span></span></span></h1><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"></span></span></span></span></span></span></span></span></h1><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"></span></span></span></span></span></span></span></span></h1><p style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">A minimalist archival information system (MAIS) should have the following properties:</span></span></span></p><ul style="text-align: left;"><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Implemented as open source for study, improvement and reuse</span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Command line oriented so there is a clear interface.</span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Concentration on the essentials, therefore no routing, but suitable for parallel use.</span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Fast and small enough not to waste resources.</span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Avoiding XML to keep code simple and metadata human-readable</span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">BagIt as base for SIPs, AIPs and DIPs</span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Preservation Planning and Action as an external operation on a set of AIPs</span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Implementation in a programming language that can be used for all common operating systems without contortions.</span></span></span></span></span></span></li></ul><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">And now?</span></span></span></span></span></span></h1><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">I plan to tackle the programming in the coming weeks and months.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">I will probably not go into detail about the individual steps of programming here.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">As soon as there is something presentable, I'll let you know.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Otherwise let me know what your experiences are, which details are important to you with an AIS, especially if it should be particularly lightweight.</span></span></span> <br /></span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"></span></span></p><div><br /><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></span></span></span> </span></span></span></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-43694222532649494882023-03-07T10:17:00.001-08:002023-03-07T10:19:03.192-08:00Some thoughts about a minimalistic Archival Information System, part 2<p> <span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">In <a href="https://kulturreste.blogspot.com/2023/03/some-thoughts-about-minimalistic.html">the last post</a> I explained some basic terms.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Now it's time for the real thing.</span></span></span><span class="ZSCsVd"><span class="azoIfb"> </span></span></p><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Choosing the right format</span></span></span><span class="ZSCsVd"><span class="azoIfb"></span></span></h1><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">The first question, what should the information packages (SIP, AIP, DIP) look like?</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">It is important that they are easy to process, easy to understand and easy to expand.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Fortunately, there is <a href="https://www.rfc-editor.org/rfc/rfc8493">RFC8493</a> that has the solution ready for us: BagIt.</span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVd3NgSvmvDODX7axKKkR5YodvgNI2pxGFHMLyYfj3OM9N7b2RenGt3pU9sF9W9B5dJjGtfHwSe5wqnFgMW5ETmB4I06SZLF29nMwrbWzSnJrStHfJt05ILYya8u3aKRGEzyuf0hy_STW9KD4c9McO6EuC4jiP3Kiwz1BEWvQWp9sn4nvJO84TKBrz1Q/s270/bag.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="182" data-original-width="270" height="269" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVd3NgSvmvDODX7axKKkR5YodvgNI2pxGFHMLyYfj3OM9N7b2RenGt3pU9sF9W9B5dJjGtfHwSe5wqnFgMW5ETmB4I06SZLF29nMwrbWzSnJrStHfJt05ILYya8u3aKRGEzyuf0hy_STW9KD4c9McO6EuC4jiP3Kiwz1BEWvQWp9sn4nvJO84TKBrz1Q/w400-h269/bag.png" width="400" /></a></div><br /> <span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">In (1) we store the metadata, in (2) there is space for our payload.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">BagIt is simple, it is a definition of a directory structure and some files that take over certain functions.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Very interesting for us, if we want to store digital objects, we can store them in the BagIt payload.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">We can take over this area completely unchanged when processing a SIP and creating the AIP.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">The same is possible later when creating the DIPs from the AIPs. </span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">BagIt gives a lot of freedom.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">To limit ourselves, we choose UTF-8 for all metadata and text files.</span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">And we don't use fetch bags.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Since BagIt is now standardized, we use version 1.</span></span></span><p></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></p><div style="text-align: left;"><h1 style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Metadata and AIP update considerations</span></span></span></h1></div><div style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></div><div style="text-align: left;"><span class="HwtZe" lang="en"><span class="jCAhz"><span class="ryNqvb"></span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">Many AIS systems are insufficiently prepared for metadata and AIP updates.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">In my experience, it is important to think about how and which data is updated and what the consequences are.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">In order to enable the producer to submit supplements, these must be clearly assigned to an existing AIP.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Either you give the producer back an ID for his first recording.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">This is not a good choice because the process then has a strong coupling and internals are exposed to the outside world.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">In addition, if a producer wants to change the AIS, there can be collisions.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">A better choice is to tell the producer to choose a unique ID for your data yourself and transmit it in your SIPs.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Internally, we would then use these to search for the appropriate AIPs. The ID is called "ExternalID" and is the base for our internal MAIS-AIP-ID.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">More on that later.</span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span></div><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">In the last post I already mentioned that we have to think about the topic of versioning of AIPs.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Not only because of the metadata or AIP updates, but also in the case of a PP&A, i.e. format migration.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">A simple idea is to introduce linked lists.</span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span></span></span></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidVcqqf7wlyKC_wEnh7ewseiKpMRnd9IvbNCRWcMqbu4-HnpVI-dz9bg1cA8v1zEITZWDWTdKfy9kyK8YVC431jn7twZo_1asK-s5Z3ye3ifPYHYzn0I30C-KMWYgA8xuaUT4ZoXxsuSnNnfp8FL_UzpUzAQjhQWR05lPNSPoFCwKMrWjiFk_T_8EJjA/s640/aip_versions.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="168" data-original-width="640" height="168" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidVcqqf7wlyKC_wEnh7ewseiKpMRnd9IvbNCRWcMqbu4-HnpVI-dz9bg1cA8v1zEITZWDWTdKfy9kyK8YVC431jn7twZo_1asK-s5Z3ye3ifPYHYzn0I30C-KMWYgA8xuaUT4ZoXxsuSnNnfp8FL_UzpUzAQjhQWR05lPNSPoFCwKMrWjiFk_T_8EJjA/w640-h168/aip_versions.png" width="640" /></a></div><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">This allows us to easily implement the functionality of rolling back an AIP version as well.</span></span></span></p><p>A new AIP points to the predecessor in which the new version receives a reference entry in the "bag-info.txt":</p><p></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></span></span></span></p><ul style="text-align: left;"><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> 'MAIS-previous-AIP' - </span></span></span></span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b C1N51c"><span class="ryNqvb">contains AIP-ID of the current AIS (MAIS-AIP-ID)<br /></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">'MAIS-migrated-AIP' - </span></span></span></span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b C1N51c"><span class="ryNqvb">contains AIP-ID of the previous AIS if AIP was migrated from there<br /></span></span></span></li><li><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">'MAIS-origin-AIS' - </span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b C1N51c"><span class="ryNqvb">contains identifiers of the previous AIS from where the AIP was migrated</span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span></li></ul><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> The last two keys are optional and only needed if AIP-AIP-Transfer is needed to move digital objects from one archival information system to another.</span></span></span><span class="ZSCsVd"><span class="azoIfb"></span></span></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-40780286539241929312023-03-02T12:37:00.002-08:002023-03-07T09:37:53.842-08:00Some thoughts about a minimalistic Archival Information System, part 1<p>Many of those who are dealing with the digital preservation of objects for the first time and who work in small memory organizations are often helpless in the face of the vast range of functions and requirements of current archival information systems.<br />Students of library or archival science often appear to be similarly overwhelmed when they are supposed to learn what constitutes archival software.<br /><br />This has motivated me to write down thoughts on a minimalist archive information system. Because it really doesn't need much.<br /><br /></p><h1 style="text-align: left;">The basic terms</h1><p>An archive essentially has three roles: the submitter, called the <u>producer</u>, the user, also called the <u>consumer</u>, and the problem solver who maintains the archive, also called the <u>technical analyst</u>.</p><p> <br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgc3DkdGDB6lJyxwnOKF5VQe8vhY1LNE5HRnB9FZHm4FLvL3rmuSU_rfIMSY7WSD53gfvNN4udcqoHuhL4pum8AimBEetAWt50prnSdpB81RXZsn38onHzZF4YxWmSlyvbq9bAwkE4gLfiOVGTMIPpRBO0u9GqYpVYtUaHIfOwhEzcIhhIY6uNQRQXo2A/s345/AIS.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="291" data-original-width="345" height="270" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgc3DkdGDB6lJyxwnOKF5VQe8vhY1LNE5HRnB9FZHm4FLvL3rmuSU_rfIMSY7WSD53gfvNN4udcqoHuhL4pum8AimBEetAWt50prnSdpB81RXZsn38onHzZF4YxWmSlyvbq9bAwkE4gLfiOVGTMIPpRBO0u9GqYpVYtUaHIfOwhEzcIhhIY6uNQRQXo2A/s320/AIS.png" width="320" /></a></div><p></p><p>When digital objects are transferred to the archive, it is called the <u>ingest</u> process. When they are requested from the archive, then this is the <u>access</u> process. </p><p>The digital objects to be preserved are provided with all the necessary information for the archive ingest and are packaged in a predefined structure. This is called a <u>Submission Information Package</u> (<u>SIP</u>). You can actually imagine this just like in real life. For example, if you want to store a vase, you put it in a box, label it and put it on a shelf.<br /></p><p>In the archive it is checked whether (allegorically) the vase is in the box and intact, and if there is a stamp and signature that says that the content of the package is indeed a vase. A file number and a storage location is assigned and the box goes sealed and neatly labeled on the shelf. The "box" is called <u>Archival Information Package</u> (<u>AIP</u>). With the seal, the archive takes responsibility.</p><p>At some point, when the user would like to see the vase from the archive again, the archive would process the request and send the vase and accompanying information to the user. This is then called a <u>Dissemination Information Package</u> (<u>DIP</u>).</p><p>In addition to this simple "I store something safely and retrieve it again at some point" approach, an archive fulfills another task that is not so obvious: it ensures that objects entrusted to it are kept usable. </p><p>What does that mean in the digital world?<br />If it is possible in principle to store a digital object securely with bit accuracy, even over a very long period of time (<u>bitstream archival</u>), it still can age because the environment for using this object is no longer available. <br /><br />There are essentially three concepts for keeping digital objects usable (<u>content preservation</u>): <i>hardware museum</i>, <i>emulation</i> or <u>format migration</u>. </p><p>Hardware museums (e.g. a slot machine museum) try to keep old equipment running in a controlled environment. To do that, they have to build up a stock in time and build up knowledge on how to maintain and repair these devices.<br /></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">With an emulation, I try to recreate the environment for the digital object so that it feels at home and doesn't notice any difference from the previous, real world.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">A very good example of emulators is e.g.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">MAME, but also various others, the e.g.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">retro computers like the Amiga or C64, so that old programs from their time can run on them.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">Here, too, I need knowledge about what the environment to be emulated looks like and how I can recreate it with today's means.</span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">When migrating the format, I try to find a new form that retains the essential properties (<u>significant properties</u>) in good time and to transfer files from a digital object to a newer data format.</span></span></span> <br /></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">From this point onwards, it is assumed that this is the preferred way of maintaining usability.</span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">It follows that an <u>Archival Information System</u> (<u>AIS</u>) must be able to support this process of format migration.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">The process (also called <u>Preservation Planning and Action</u>) results in a new version of the Archival Information Package being created.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">The AIS must be able to manage this.</span></span></span></span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">That would basically be all there is to Archival Information Systems if it hadn't been for the librarians.</span></span></span> <br /></span></span></span></span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">Unlike archivists, where a record is complete and closed, librarians understand the concepts of supplements and metadata submissions.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">A page that has fallen out has turned up here, a letter has been discovered there in an estate, or it has dawned on some people that there is now money for costly in-depth indexing.</span></span><span class="jCAhz"><span class="ryNqvb">
</span></span><span class="jCAhz ChMk0b"><span class="ryNqvb">Ergo, librarians expect people to think about how to handle metadata and data updates on existing AIPs (called <u>metadata update</u> and <u>AIP update</u>).</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">This is not trivial, since some AIPs are also very large and you want to avoid pointless copying. </span></span></span></span></span></span></span></span></span></span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">For such an update, we also need a good way for producers to tell the archive which AIP needs to be added or updated.</span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"></span></span></span></span></span></span></span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">However, AIPs are already versioned in the case of format migration, the same can be used here as well.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Any change to the AIP creates a new version of an AIP.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">And so that you can't accidentally break anything, you should always be able to go back to an old version.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">And since that is also error-prone, the result of the rollback process will simply be a new version.</span></span></span> </span></span></span></span></span></span></span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span></span></span></span></span></span></span></span></span></span><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb">That's it.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">It's nothing more.</span></span> <span class="jCAhz ChMk0b"><span class="ryNqvb">Easy, isn't it?</span></span></span></span></span></span></span></span></span> </span></span></span></span></span></span></p><p><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"><span class="HwtZe" lang="en"><span class="jCAhz ChMk0b"><span class="ryNqvb"> </span></span></span> <br /></span></span></span></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-63943517444328572182022-05-12T09:39:00.003-07:002022-05-18T03:24:21.049-07:00Draft for a differential BagIt<h2 style="text-align: left;">The problem</h2><p>BagIt (<a href="https://datatracker.ietf.org/doc/html/rfc8493">RFC 8493</a>) forms the basis for Submission Information Packages (SIP) and Archival Information Packages (AIP) in many digital archives.<br />Especially in the library environment, it is necessary to support supplemental submissions in the Archival Information System (AIS) software. Supplements may be limited to metadata or may add new files, remove existing files, or replace existing files.<br />Unfortunately, there is no way to implement a differential SIP cleanly and easily in the BagIt specification.<br /><br /></p><h2 style="text-align: left;">The constraints</h2><p>A design of a differential BagIt (dBagIt) should meet the following conditions:<br /><br />1. existing BagIt should not be touched<br />2. it should be based on the BagIt structure so that the conversion effort is minimal<br />3. it should be easy to implement<br />4. it should support the "add" and "delete" operations<br />5. the checksum protection should be guaranteed<br />6. the referenced bag should be specified explicitly</p><h2 style="text-align: left;">The proposal of dBagIt</h2><p>The basis is the structure of BagIt. The following are the changes that are mandatory.</p><h3 class="newpage" style="text-align: left;"><span class="h4">Bag Declaration: dbagit.txt</span></h3><p class="newpage" style="text-align: left;"><span class="h4">In contrast to 2.1.1 of RFC8493 the filename is <i>dbagit.txt</i></span></p><h3 class="newpage" style="text-align: left;"><span class="h4">Payload Manifest</span></h3><p>In contrast to 2.1.3 of RFC8493 each line of a payload manifest file MUST be of the form</p><pre class="newpage"> sign checksum filepath</pre><p class="newpage" style="text-align: left;">where <i>sign</i> is either <i>+</i> for adding a file or <i>-</i> for deleting a file. </p><p class="newpage" style="text-align: left;">The replacement of files is simulated by one entry each for deleting and adding.</p><h3 class="newpage" style="text-align: left;"><span class="h4">Bag Metadata: bag-info.txt</span></h3><p class="newpage" style="text-align: left;"><span class="h4">Additional to </span><span class="h4">RFC8493 the key <i>Updates-External-Identifier</i> becomes mandatory. It is used to reference to the original data object, which will be updated by this dBagIt.<br /></span></p><h3 class="newpage" style="text-align: left;">Optional Tag Manifest</h3><p>The Tag Manifest is similar to RFC8493. </p><p>Although tag manifest files in BagIt could be used to describe additional proprietary subdirectories of a bag not specified in the RFC, it is not defined here to support changes as in the previous section on payload manifest. This facilitates the creation and processing of dBagIts. </p><h2 style="text-align: left;">Implementation of the behavior</h2><p style="text-align: left;">The implementation must ensure that:</p><ol style="text-align: left;"><li>the target object referenced by key <span class="h4"><i>Updates-External-Identifier </i>exists</span></li><li><span class="h4">the dBagIt is valid</span></li><li><span class="h4">the add/delete operations are atomic and rollback-able</span></li><li><span class="h4">the checksums of files which should be added are correct and part of the current payload<br /></span></li><li><span class="h4">the checksum of files which should be deleted are similar to the checksum of the files in the referenced digital object <br /></span></li><li><span class="h4">the files in tag manifests handled correctly if proprietary extensions used <br /></span></li><li><span class="h4">the metadata content in bag-info.txt replaced previous versions in referenced object completely<br /></span></li></ol><h2 style="text-align: left;">Future</h2><p>If there is interest, I would be happy to receive feedback via art1pirat ATgmail.com. Maybe a new RFC can grow out of it.</p><h2 style="text-align: left;">Alternate consideration</h2><p style="text-align: left;">A very simple solution could also be the use of unified 'diff'. This also allows partial changes in files, but would hardly bring any advantages with binary data and is not quite as intuitive for users who are not familiar with IT.</p><h2 style="text-align: left;">FAQ (Update 2022-05-18)</h2><ol style="text-align: left;"><li><i>What if "delete" references a non-existing file?</i> The complete operations via differential BagIt should be atomar and consistent. In this case the operations are rollbacked and aborted with an error. This ensure that no unintented updates will be applied.</li><li><i>Wouldn't it be nice, to avoid transferring files, to allow a simple rename instead of a replace?</i> This would be worth considering. however, a secure rename requires the checksum, the old filename, and the new filename. That makes it complicated again. Since the case would probably not be too frequent, this could be specified later if needed.</li><li><i>How is it ensured that of several files with the same checksum, the wrong file is not deleted or replaced?</i> Since for "delete" the checksum and the path of the already existing file must be specified, a mix-up is impossible.</li><li><i>Is it correct that when I pass metadata in baginfo.txt, it overwrites the metadata in the referenced object? If yes, why?</i> Yes, that is so. It simplifies the design to focus only on the payload. By the way, the purpose of differential BagIt is to reduce the cost of complete transfer of all files in case of supplement deliveries. And most of the costs are usually incurred in the transfer of the payload.<br /></li></ol>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-89949074993885779632021-12-03T08:24:00.000-08:002021-12-03T08:24:51.007-08:00Detectorist - Part two "A crumb of knowledge"<h1 style="text-align: left;">A crumb of knowledge</h1><p><br /></p><p>In the <a href="http://kulturreste.blogspot.com/2021/11/detectorist-part-one-first-indications.html">first part</a> I described how I came to know how to read the floppy disks<span> (using kryoflux)</span>. Now I would like to give an intermediate state about the floppy disk format of the Panasonic typewriter - in the quiet hope that someone could uncover the last secret.<br /><br /><br />I found the most important clue while researching a successor model - the Panasonc KX-W1000. I stumbled across the follow old blog post <a href="https://surrey.lug.org.uk/panasonic-kx-w1000">https://surrey.lug.org.uk/panasonic-kx-w1000</a>.</p><h2 style="text-align: left;">My findings <br /></h2><p>Even if it didn't lead to a full success, there were some interesting insights. The floppy image is strongly related to FAT12.</p><p>Here is my summary. <br /></p><p>The filesystem is based on FAT12 with proprietary extensions. </p><h3 style="text-align: left;">Header / MBR <br /></h3><p>The first bytes are: 0x00 00 00 4B 58 2D 57 31 35 31 30 20 31 2E 30 30 20, which corresponds to the string "KX-W1510 1.00" from the third byte onwards.</p><p>The first 256 bytes are very similar to a MBR of old DOS floppies:</p><p>
</p><pre>0000:0000 | 00 00 00 4B 58 2D 57 31 35 31 30 20 31 2E 30 30 | ...KX-W1510 1.00
0000:0010 | 20 F9 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ù..............
0000:0020 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0030 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0040 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0050 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0060 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0070 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0080 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:0090 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:00A0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:00B0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:00C0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:00D0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:00E0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0000:00F0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................</pre><h3 style="text-align: left;">FATs <br /></h3><p>There are two equal blocks which probably represent FATs, once at address 0x200:</p><pre>0000:0200 | F9 FF FF 03 40 00 05 B0 00 07 80 00 09 A0 00 FF | ùÿÿ.@..°..... .ÿ
0000:0210 | FF FF 0D E0 00 0F 00 01 FF 8F 01 13 40 01 15 60 | ÿÿ.à....ÿ...@..`
0000:0220 | 01 17 F0 FF 19 90 02 1B 10 02 1D E0 01 1F 00 02 | ..ðÿ.......à....
0000:0230 | FF 2F 02 23 F0 FF 25 60 02 2D 80 02 2C A0 02 2B | ÿ/.#ðÿ%`.-.., .+
0000:0240 | F0 FF FF EF 02 35 00 03 31 20 03 33 F0 FF 36 F0 | ðÿÿï.5..1 .3ðÿ6ð
0000:0250 | FF 37 80 03 39 F0 FF 3B C0 03 3D E0 03 FF 0F 04 | ÿ7..9ðÿ;À.=à.ÿ..
0000:0260 | 41 20 04 43 F0 FF 45 60 04 47 80 04 49 F0 FF 4B | A .CðÿE`.G..IðÿK
0000:0270 | C0 04 4D E0 04 4F F0 FF 51 20 05 53 40 05 55 F0 | À.Mà.OðÿQ .S@.Uð
0000:0280 | FF 57 80 05 59 A0 05 5B F0 FF 5D E0 05 69 B0 07 | ÿW..Y .[ðÿ]à.i°.
0000:0290 | 7F 20 06 63 40 06 65 F0 FF 6E 80 06 6B A0 06 FF | . .c@.eðÿn..k .ÿ
0000:02A0 | CF 06 6D F0 FF 7E 00 07 71 20 07 73 40 07 FF 6F | Ï.mðÿ~..q .s@.ÿo
0000:02B0 | 07 77 80 07 79 A0 07 FF CF 07 7D 60 08 80 30 08 | .w..y .ÿÏ.}`..0.
0000:02C0 | 81 20 08 FF 4F 08 85 80 08 87 F0 FF FF AF 08 8B | . .ÿO.....ðÿÿ¯..
0000:02D0 | F0 08 8D E0 08 90 20 09 91 40 09 93 F0 FF FF 6F | ð..à.. ..@..ðÿÿo
0000:02E0 | 09 A2 F0 09 99 A0 09 9B C0 09 9D F0 FF A1 00 0A | .¢ð.. ..À..ðÿ¡..
0000:02F0 | B2 60 0A A3 40 0A A5 F0 FF AC 80 0A A9 A0 0A AB | ²`.£@.¥ðÿ¬..© .«
0000:0300 | F0 FF AD E0 0A AF 00 0B B1 B0 0B BA 40 0B B5 C0 | ðÿ.à.¯..±°.º@.µÀ
0000:0310 | 0C B7 80 0B B9 60 0C C3 C0 0B BD E0 0B BF 00 0C | .·..¹`.ÃÀ.½à.¿..
0000:0320 | C1 20 0C CA 40 0C C5 80 0C C7 90 0C E8 00 0D CB | Á .Ê@.Å..Ç..è..Ë
0000:0330 | F0 FF CD E0 0C CF B0 0D D1 50 0D D3 40 0D DE F0 | ðÿÍà.Ï°.ÑP.Ó@.Þð
0000:0340 | FF D7 E0 0E D9 70 0E E2 C0 0D DD F0 FF DF 00 0E | ÿ×à.Ùp.âÀ.Ýðÿß..
0000:0350 | E1 50 0E E3 40 0E E6 90 0E FB C0 0E FF AF 0E EB | áP.ã@.æ..ûÀ.ÿ¯.ë
0000:0360 | F0 FF ED 60 0F EF 00 0F F1 20 0F F3 40 0F F5 F0 | ðÿí`.ï..ñ .ó@.õð
0000:0370 | FF F7 80 0F F9 A0 0F FF CF 0F FD E0 0F 08 01 00 | ÿ÷..ù .ÿÏ.ýà....
</pre><p>once at 0x800:</p><pre>0000:0800 | F9 FF FF 03 40 00 05 B0 00 07 80 00 09 A0 00 FF | ùÿÿ.@..°..... .ÿ
0000:0810 | FF FF 0D E0 00 0F 00 01 FF 8F 01 13 40 01 15 60 | ÿÿ.à....ÿ...@..`
0000:0820 | 01 17 F0 FF 19 90 02 1B 10 02 1D E0 01 1F 00 02 | ..ðÿ.......à....
0000:0830 | FF 2F 02 23 F0 FF 25 60 02 2D 80 02 2C A0 02 2B | ÿ/.#ðÿ%`.-.., .+
0000:0840 | F0 FF FF EF 02 35 00 03 31 20 03 33 F0 FF 36 F0 | ðÿÿï.5..1 .3ðÿ6ð
0000:0850 | FF 37 80 03 39 F0 FF 3B C0 03 3D E0 03 FF 0F 04 | ÿ7..9ðÿ;À.=à.ÿ..
0000:0860 | 41 20 04 43 F0 FF 45 60 04 47 80 04 49 F0 FF 4B | A .CðÿE`.G..IðÿK
0000:0870 | C0 04 4D E0 04 4F F0 FF 51 20 05 53 40 05 55 F0 | À.Mà.OðÿQ .S@.Uð
0000:0880 | FF 57 80 05 59 A0 05 5B F0 FF 5D E0 05 69 B0 07 | ÿW..Y .[ðÿ]à.i°.
0000:0890 | 7F 20 06 63 40 06 65 F0 FF 6E 80 06 6B A0 06 FF | . .c@.eðÿn..k .ÿ
0000:08A0 | CF 06 6D F0 FF 7E 00 07 71 20 07 73 40 07 FF 6F | Ï.mðÿ~..q .s@.ÿo
0000:08B0 | 07 77 80 07 79 A0 07 FF CF 07 7D 60 08 80 30 08 | .w..y .ÿÏ.}`..0.
0000:08C0 | 81 20 08 FF 4F 08 85 80 08 87 F0 FF FF AF 08 8B | . .ÿO.....ðÿÿ¯..
0000:08D0 | F0 08 8D E0 08 90 20 09 91 40 09 93 F0 FF FF 6F | ð..à.. ..@..ðÿÿo
0000:08E0 | 09 A2 F0 09 99 A0 09 9B C0 09 9D F0 FF A1 00 0A | .¢ð.. ..À..ðÿ¡..
0000:08F0 | B2 60 0A A3 40 0A A5 F0 FF AC 80 0A A9 A0 0A AB | ²`.£@.¥ðÿ¬..© .«
0000:0900 | F0 FF AD E0 0A AF 00 0B B1 B0 0B BA 40 0B B5 C0 | ðÿ.à.¯..±°.º@.µÀ
0000:0910 | 0C B7 80 0B B9 60 0C C3 C0 0B BD E0 0B BF 00 0C | .·..¹`.ÃÀ.½à.¿..
0000:0920 | C1 20 0C CA 40 0C C5 80 0C C7 90 0C E8 00 0D CB | Á .Ê@.Å..Ç..è..Ë
0000:0930 | F0 FF CD E0 0C CF B0 0D D1 50 0D D3 40 0D DE F0 | ðÿÍà.Ï°.ÑP.Ó@.Þð
0000:0940 | FF D7 E0 0E D9 70 0E E2 C0 0D DD F0 FF DF 00 0E | ÿ×à.Ùp.âÀ.Ýðÿß..
0000:0950 | E1 50 0E E3 40 0E E6 90 0E FB C0 0E FF AF 0E EB | áP.ã@.æ..ûÀ.ÿ¯.ë
0000:0960 | F0 FF ED 60 0F EF 00 0F F1 20 0F F3 40 0F F5 F0 | ðÿí`.ï..ñ .ó@.õð
0000:0970 | FF F7 80 0F F9 A0 0F FF CF 0F FD E0 0F 08 01 00 | ÿ÷..ù .ÿÏ.ýà....
0000:0980 | 00 00 00 00 00 00 00 00 00 00 00 00 FF 0F 00 00 | ............ÿ... <br /></pre><h3 style="text-align: left;">Directory <br /></h3><p>The main directory always starts from address 0xe00:</p><pre>0000:0E00 | 20 20 20 20 20 20 44 49 5B 54 20 FF 00 00 00 00 | DI[T ÿ....
0000:0E10 | 00 00 00 00 00 00 06 00 21 00 02 00 F5 13 00 00 | ........!...õ...
0000:0E20 | 20 20 20 20 20 20 41 46 46 45 20 FF 00 00 00 00 | AFFE ÿ....
0000:0E30 | 00 00 00 00 00 00 06 00 21 00 06 00 F6 13 00 00 | ........!...ö...
0000:0E40 | 20 20 20 54 52 5D 46 46 45 4C 20 FF 00 00 00 00 | TR]FFEL ÿ....
0000:0E50 | 00 00 00 00 00 00 06 00 21 00 0C 00 C4 13 00 00 | ........!...Ä...
0000:0E60 | 20 20 45 52 42 50 52 49 4E 5A 20 FF 00 00 00 00 | ERBPRINZ ÿ....
0000:0E70 | 00 00 00 00 00 00 06 00 21 00 11 00 17 14 00 00 | ........!.......
0000:0E80 | 20 20 20 20 42 49 53 54 52 4F 20 FF 00 00 00 00 | BISTRO ÿ....
0000:0E90 | 00 00 00 00 00 00 06 00 21 00 12 00 61 14 00 00 | ........!...a...
0000:0EA0 | 20 20 20 48 55 48 4E 20 49 49 20 FF 00 00 00 00 | HUHN II ÿ....
0000:0EB0 | 00 00 00 00 00 00 06 00 21 00 1C 00 CC 13 00 00 | ........!...Ì...
0000:0EC0 | 20 20 20 20 57 41 43 48 41 55 20 FF 00 00 00 00 | WACHAU ÿ....
0000:0ED0 | 00 00 00 00 00 00 06 00 21 00 1A 00 C0 13 00 00 | ........!...À...
0000:0EE0 | 20 20 20 20 20 4B 41 4B 41 4F 20 FF 00 00 00 00 | KAKAO ÿ....
0000:0EF0 | 00 00 00 00 00 00 06 00 21 00 24 00 1D 14 00 00 | ........!.$.....
0000:0F00 | 20 20 20 20 20 20 4D 5D 4C 4C 20 FF 00 00 00 00 | M]LL ÿ....
0000:0F10 | 00 00 00 00 00 00 06 00 21 00 27 00 22 0B 00 00 | ........!.'."...
0000:0F20 | 20 46 52 41 55 20 4D 4F 44 45 20 FF 00 00 00 00 | FRAU MODE ÿ....
0000:0F30 | 00 00 00 00 00 00 06 00 21 00 2F 00 AC 13 00 00 | ........!./.¬...
0000:0F40 | 20 20 20 20 53 55 50 50 45 4E 20 FF 00 00 00 00 | SUPPEN ÿ....
0000:0F50 | 00 00 00 00 00 00 06 00 21 00 34 00 C7 13 00 00 | ........!.4.Ç...
0000:0F60 | 55 4E 53 45 52 20 42 52 4F 54 20 FF 00 00 00 00 | UNSER BROT ÿ....
0000:0F70 | 00 00 00 00 00 00 06 00 21 00 3A 00 B8 13 00 00 | ........!.:.¸...
0000:0F80 | 20 20 20 20 20 20 31 39 39 34 20 FF 00 00 00 00 | 1994 ÿ....
0000:0F90 | 00 00 00 00 00 00 06 00 21 00 3F 00 AA 13 00 00 | ........!.?.ª...
0000:0FA0 | 20 20 20 20 20 4B 5D 43 48 45 20 FF 00 00 00 00 | K]CHE ÿ....
0000:0FB0 | 00 00 00 00 00 00 06 00 21 00 44 00 3D 14 00 00 | ........!.D.=...
0000:0FC0 | 20 55 43 4B 45 52 4D 41 52 4B 20 FF 00 00 00 00 | UCKERMARK ÿ....
0000:0FD0 | 00 00 00 00 00 00 06 00 21 00 4A 00 2B 14 00 00 | ........!.J.+...
0000:0FE0 | 20 20 52 49 45 53 4C 49 4E 47 20 FF 00 00 00 00 | RIESLING ÿ....
0000:0FF0 | 00 00 00 00 00 00 06 00 21 00 50 00 31 14 00 00 | ........!.P.1...
0000:1000 | 43 48 49 4E 41 54 52 5D 46 46 20 FF 00 00 00 00 | CHINATR]FF ÿ....
0000:1010 | 00 00 00 00 00 00 06 00 21 00 56 00 25 14 00 00 | ........!.V.%...
0000:1020 | 20 4B 5B 53 45 52 45 53 54 45 20 FF 00 00 00 00 | K[SERESTE ÿ....
0000:1030 | 00 00 00 00 00 00 06 00 21 00 5C 00 E3 12 00 00 | ........!.\.ã...
0000:1040 | 4B 41 54 5A 45 4E 46 55 54 54 20 FF 00 00 00 00 | KATZENFUTT ÿ....
0000:1050 | 00 00 00 00 00 00 06 00 21 00 61 00 CC 12 00 00 | ........!.a.Ì...
0000:1060 | 20 20 52 4F 42 55 43 48 4F 4E 20 FF 00 00 00 00 | ROBUCHON ÿ....
0000:1070 | 00 00 00 00 00 00 06 00 21 00 5F 00 55 14 00 00 | ........!._.U...
0000:1080 | 20 20 20 4D 41 4E 41 47 45 52 20 FF 00 00 00 00 | MANAGER ÿ....
0000:1090 | 00 00 00 00 00 00 06 00 21 00 67 00 FC 13 00 00 | ........!.g.ü...
0000:10A0 | 20 20 4D 49 43 48 45 4C 49 4E 20 FF 00 00 00 00 | MICHELIN ÿ....
0000:10B0 | 00 00 00 00 00 00 06 00 21 00 6F 00 8C 14 00 00 | ........!.o.....
0000:10C0 | 20 20 50 49 4D 45 4E 54 4F 53 20 FF 00 00 00 00 | PIMENTOS ÿ....
0000:10D0 | 00 00 00 00 00 00 06 00 21 00 75 00 14 14 00 00 | ........!.u.....
0000:10E0 | 54 48 4F 4D 41 53 4D 41 4E 4E 20 FF 00 00 00 00 | THOMASMANN ÿ....
0000:10F0 | 00 00 00 00 00 00 06 00 21 00 66 00 20 14 00 00 | ........!.f. ...
0000:1100 | 20 20 38 2D 4D 41 49 2D 34 35 20 FF 00 00 00 00 | 8-MAI-45 ÿ....
0000:1110 | 00 00 00 00 00 00 06 00 21 00 60 00 2A 14 00 00 | ........!.`.*...
0000:1120 | 20 20 43 4F 51 41 55 56 49 4E 20 FF 00 00 00 00 | COQAUVIN ÿ....
0000:1130 | 00 00 00 00 00 00 06 00 21 00 89 00 0B 14 00 00 | ........!.......
0000:1140 | 20 47 55 44 45 20 53 54 55 42 20 FF 00 00 00 00 | GUDE STUB ÿ....
0000:1150 | 00 00 00 00 00 00 06 00 21 00 8C 00 A0 14 00 00 | ........!... ...
0000:1160 | 20 20 4D 4F 4E 54 43 41 55 44 20 FF 00 00 00 00 | MONTCAUD ÿ....
0000:1170 | 00 00 00 00 00 00 06 00 21 00 95 00 63 15 00 00 | ........!...c...
0000:1180 | 20 53 50 41 52 47 45 4C 45 49 20 FF 00 00 00 00 | SPARGELEI ÿ....
0000:1190 | 00 00 00 00 00 00 06 00 21 00 98 00 BD 14 00 00 | ........!...½...
0000:11A0 | 53 45 4D 49 42 45 4C 47 49 45 20 FF 00 00 00 00 | SEMIBELGIE ÿ....
0000:11B0 | 00 00 00 00 00 00 06 00 21 00 97 00 B3 25 00 00 | ........!...³%..
0000:11C0 | 20 53 45 4D 49 4E 41 52 39 35 20 FF 00 00 00 00 | SEMINAR95 ÿ....
0000:11D0 | 00 00 00 00 00 00 06 00 21 00 9E 00 C9 4B 00 00 | ........!...ÉK..
0000:11E0 | 20 20 54 41 4E 54 41 4C 55 53 20 FF 00 00 00 00 | TANTALUS ÿ....
0000:11F0 | 00 00 00 00 00 00 06 00 21 00 A7 00 DC 12 00 00 | ........!.§.Ü...</pre><p>In contrast to FAT12 each directory entry consists of 10bytes for the file name, left padded with Spaces. Umlauts in filenames are possible (see below). A filename suffix does not exist. This corresponds with the findings in the typewriter manual.</p><p>Sometimes there is a special directory at Offset 0x100, this could hold the adress-lists or dictionaries:</p><pre>0000:0100 | 20 20 20 20 57 41 53 53 45 52 20 FF 00 00 00 00 | WASSER ÿ....
0000:0110 | 00 00 00 00 00 00 06 00 21 00 48 00 36 0A 00 00 | ........!.H.6...
0000:0120 | 20 20 20 20 20 4B 5D 43 48 45 20 FF 00 00 00 00 | K]CHE ÿ....
0000:0130 | 00 00 00 00 00 00 06 00 21 00 49 00 3D 14 00 00 | ........!.I.=...
0000:0140 | 20 20 20 41 55 53 54 45 52 4E 20 FF 00 00 00 00 | AUSTERN ÿ....
0000:0150 | 00 00 00 00 00 00 06 00 21 00 4E 00 D2 0A 00 00 | ........!.N.Ò...
0000:0160 | 20 20 20 20 54 52 5B 55 4D 45 20 FF 00 00 00 00 | TR[UME ÿ....
0000:0170 | 00 00 00 00 00 00 06 00 21 00 50 00 59 25 00 00 | ........!.P.Y%..
0000:0180 | 20 20 52 45 43 48 4E 55 4E 47 20 FF 00 00 00 00 | RECHNUNG ÿ....
0000:0190 | 00 00 00 00 00 00 06 00 21 00 54 00 C3 08 00 00 | ........!.T.Ã...
0000:01A0 | 20 20 20 20 20 48 45 4E 52 59 20 FF 00 00 00 00 | HENRY ÿ....
0000:01B0 | 00 00 00 00 00 00 06 00 21 00 59 00 66 11 00 00 | ........!.Y.f...
0000:01C0 | 53 43 48 57 41 52 5A 41 44 4C 20 FF 00 00 00 00 | SCHWARZADL ÿ....
0000:01D0 | 00 00 00 00 00 00 06 00 21 00 57 00 94 0A 00 00 | ........!.W.....
0000:01E0 | 20 50 4C 41 43 48 55 54 54 41 20 FF 00 00 00 00 | PLACHUTTA ÿ....
0000:01F0 | 00 00 00 00 00 00 06 00 21 00 5C 00 49 09 00 00 | ........!.\.I...
</pre>
<p>But sometimes there are textfragments (from other floppy):</p><pre>0000:0100 | 64 20 73 63 68 E9 64 6C 69 63 68 21 C9 20 20 20 | d schédlich!É
0000:0110 | 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |
0000:0120 | 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |
0000:0130 | 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |
0000:0140 | 55 64 6F 20 50 6F 6C 6C 6D 65 72 2C 20 65 69 6E | Udo Pollmer, ein
0000:0150 | 20 4C 65 62 65 6E 73 6D 69 74 74 65 6C 63 68 65 | Lebensmittelche
0000:0160 | 6D 69 6B 65 72 20 75 6E 64 20 65 72 66 6F 6C 67 | miker und erfolg
0000:0170 | 72 65 69 63 68 65 72 20 20 20 20 20 20 20 20 20 | reicher
0000:0180 | 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |
0000:0190 | 46 61 63 68 62 75 63 68 61 75 74 6F 72 20 68 61 | Fachbuchautor ha
0000:01A0 | 74 20 69 6E 20 65 69 6E 65 6D 20 5A 65 69 74 75 | t in einem Zeitu
0000:01B0 | 6E 67 73 69 6E 74 65 72 76 69 65 77 20 65 72 6B | ngsinterview erk
0000:01C0 | 6C E9 72 74 3A 20 20 20 20 20 20 20 20 20 20 20 | lért:
0000:01D0 | 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |
0000:01E0 | 22 44 69 E9 74 65 6E 20 6D 61 63 68 65 6E 20 64 | "Diéten machen d
0000:01F0 | 69 63 6B 22 2E 20 57 65 69 6C 20 64 65 72 20 4B | ick". Weil der K
</pre><h3 style="text-align: left;">Umlauts ans Special chars<br /></h3><p>Umlauts and Special chars are mapped as follows:</p><p>ä → 0x7b<br />ö → 0x7c<br />ü → 0x7d<br />Ä → 0x5b<br />Ö → 0x5c<br />Ü → 0x5d<br />ß → 0x85<br />hyphen → 0xbc</p><h2 style="text-align: left;">Open Questions</h2><p>What is still completely unclear is how the FATs are constructed. They do look like FAT12 entries, the first bytes 0xf9 0xff 0x03... and the frequently occurring 0xff suggest this, yet there seems to be no connection between the addresses of the text fragments in the image and the FAT byte sequences.</p><p><br />In the directory entries everything points to the fact that byte 26 indicates the start cluster and bytes 28-29 the file size, the connection with the FAT and the actual offset (or cluster) to the data I could not decipher yet.</p><p></p><p>The meaning of offset 0x100 is unclear. </p><p>If you have any ideas how to read the FATs, or how to interpret the bytes 26, 28-29 of the directory entries, or what the cluster size should be, feel free to write me.<br /><br />If you are the owner of such an old typewriter, it would be helpful to have a clean-room floppy copy, i.e. a freshly formatted floppy with a small test text, so that I can reverse engineer the data format even better.<br /><br />Just contact me at art1piratatgoogledotcom </p><p> </p><h2 style="text-align: left;">Supportive Links </h2><p style="text-align: left;"><a href="https://archive.org/details/MSXTechnicalDataBook/page/n269/mode/2up">https://archive.org/details/MSXTechnicalDataBook/page/n269/mode/2up</a></p><p style="text-align: left;"><a href="https://github.com/Konamiman/MSX2-Technical-Handbook/blob/master/md/Chapter3.md#3--structure-of-disk-files">https://github.com/Konamiman/MSX2-Technical-Handbook/blob/master/md/Chapter3.md#3--structure-of-disk-files</a></p><p style="text-align: left;"><a href="https://manualsbrain.com/ja/products/panasonic-kx-w1510/">https://manualsbrain.com/ja/products/panasonic-kx-w1510/</a></p><h2 style="text-align: left;">Thanks</h2><p style="text-align: left;">my thanks goes to </p><ul style="text-align: left;"><li>Panasonic Museum</li><li>David Murray, <a href="https://en.wikipedia.org/wiki/The_8-Bit_Guy">the 8-bit Guy</a></li><li>John Wash, for his analysis of kx-w1000 floppies, <a href="https://surrey.lug.org.uk/panasonic-kx-w1000">https://surrey.lug.org.uk/panasonic-kx-w1000</a></li><li>Jason Kleiner</li><li>to the Developers and Maintainers of Okteta, Kaitai, wxHexEditor, Debian, Kryoflux…<br /></li></ul>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-35593423541938143932021-11-28T03:22:00.007-08:002021-12-03T07:14:44.425-08:00Detectorist - Part one "First indications"<h1 style="text-align: left;">First indications <br /></h1><p>In an estate there are several CDROMs, DVDs and especially floppy disks. We were able to read most of them with Linux, including the floppy disks.
Only on the last 8 floppy disks did we have a hard time. </p><p>On one of the floppy disks a small inscription peeked out, referring to a Panasonic electronic typewriter.
There was none in the estate, no other information was available. </p><p>Eight floppy disks, 3,5", double density, not readable. </p><p>Time passed, constantly haunted by the voice in my mind: "There's something on the disks, only what?" </p><p></p><div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWDyIj7hVxxe9ybIumT_H2CBpgZCzNoh6BJI8FxvVJKXH0kUwICnAcCf70PGWn7wyAgzGoXnYgqJVZf0AVa7iJBPjW1a8u1stE75HKKS0X3SnWLncF541TYe8xW0m3dB3I95THtq2vi2u/s4608/IMG_20211111_155604_680.jpg" style="clear: right; display: block; float: right; margin-bottom: 1em; margin-left: 1em; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="4608" data-original-width="3456" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWDyIj7hVxxe9ybIumT_H2CBpgZCzNoh6BJI8FxvVJKXH0kUwICnAcCf70PGWn7wyAgzGoXnYgqJVZf0AVa7iJBPjW1a8u1stE75HKKS0X3SnWLncF541TYe8xW0m3dB3I95THtq2vi2u/w240-h320/IMG_20211111_155604_680.jpg" width="240" /></a></div>We managed to purchase a Kryoflux controller (see <a href="https://kryoflux.com/">https://kryoflux.com/</a>, there are also free opensource alternatives). This is a special disk controller that allows you to record the magnetic flux as the read heads move over the disk.<p></p><p> </p><p>After the first attempts, I was able to create an image file with the following command: </p><p>./dtc -fIMAGEFILE -dd1 -g2 -i4</p><p>The options mean:</p><ul style="text-align: left;"><li>"-dd1" - double density</li><li>"-g2" - double sided</li><li>"-i4" - MFM sector image 40/80+ tracks<br /></li></ul><p> </p><p>A look at the image using the hex editor showed that I was right with my intention. After the first three zero bytes, the string "KX-W1510 v1.00" followed (and to my happy surprise, a lot of readable text fragments). </p><p>Yep, there is exactly one electronic typewriter series from Panasonic.</p><h1 style="text-align: left;">Disillusionment<br /></h1><p>I was able to find a manual at <a href="https://manualsbrain.com/en/manuals/1814281/">https://manualsbrain.com/en/manuals/1814281/</a>. And yes, the machine used 3.5" floppy disks, double sided, double density with a capacity of 713,000 characters, but unfortunately without an exact description of the disk format and the file system.</p><p>I then contacted Panasonic support - no success. I started researching patent databases in Japan, the USA and Germany - nothing. I wrote to the <a href="https://www.panasonic.com/global/corporate/history/panasonic-museum.html">Panasonic museum</a> in Japan, but unfortunately they could not help me.</p><p><br />A proprietary disk format, which was forgotten after 30 years.</p><p><br /></p><p><i>In the next part I report what I could find out about the disk system of the Panasonic typewriter KX-W1510, and where I (still) fail...<br /></i></p><p><br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-42835562992526746752021-04-01T00:16:00.006-07:002021-11-28T03:23:08.162-08:00Backup is digital long-term preservation!<h2 style="text-align: left;">Exponential growth<br /></h2><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://cdn.statcdn.com/Infographic/images/normal/17727.jpeg" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="570" data-original-width="800" height="229" src="https://cdn.statcdn.com/Infographic/images/normal/17727.jpeg" title="https://www.statista.com/chart/17727/global-data-creation-forecasts/" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.statista.com/chart/17727/global-data-creation-forecasts/"><span style="font-size: xx-small;">https://www.statista.com/chart/17727/global-data-creation-forecasts/</span></a></td></tr></tbody></table><br />An important observation is that the number of files produced each year continues to increase worldwide (see <a href="https://en.wikipedia.org/wiki/Information_explosion">https://en.wikipedia.org/wiki/Information_explosion</a>). And with it the number of digital objects increases in the same measure, for which we must decide: Keep or throw away? <p></p><p>The truth is, the discard scenario becomes the more likely one with each passing year.</p><h2 style="text-align: left;">Magnificent diversity <br /></h2><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEii4ayed8oyKgvH7KMRruNs2vZ1ObsElCYTwMKNO2eYUVCKV8l27YQcy6_X4kBifvRccOAsuZ0Gv0s1EKyExsar-9sEeULr5BC0J0BGEa6x-ZUukQD10_zonfjSw4e8N_4DPJJZJcbEVxbZ/s1140/droid_categories.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="710" data-original-width="1140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEii4ayed8oyKgvH7KMRruNs2vZ1ObsElCYTwMKNO2eYUVCKV8l27YQcy6_X4kBifvRccOAsuZ0Gv0s1EKyExsar-9sEeULr5BC0J0BGEa6x-ZUukQD10_zonfjSw4e8N_4DPJJZJcbEVxbZ/s320/droid_categories.png" width="320" /></a></div> Another observation is that about 90 new file formats are added every year.<br /> And the file formats that are being dropped are already in place. <p></p><p> </p><p>The truth is, no one can build up format knowledge for this yet.</p><p> </p><h2 style="text-align: left;">A fuzzy concept <br /></h2><p>When talking to colleagues, the topic of validation does not play a role. For one thing, no one is clear about what "<a href="https://en.wikipedia.org/wiki/Validity">valid</a>" means. Valid against a specification? Valid against a profile? Valid because it can be opened by programs? On the other hand, nothing happens after that. If a file is broken, it is still archived. If it is not broken, fine. </p><p>The truth is, validation is useless.</p><p></p><p> </p><h2 style="text-align: left;">Success factors <br /></h2><p>Do you know how the success of digital preservation is measured? I'll tell you, in terabytes per year. If the numbers go up, that's a good thing to sell to politicians. Whether it was difficult to prepare digital objects for long-term availability doesn't matter. Whether born-digitals are more at risk, never mind. </p><p>Is that the truth?<br /></p><h2 style="text-align: left;">Overrated <br /></h2><p>It used to be said that long-term digital archiving could only be handled by organizations with a minimum of resources. Look around and you'll find dozens of one-man orchestras and part-time archives. And do you think that as the amount of data increases, so do the human resources? Oh, come on! </p><p> You know the truth!</p><h2 style="text-align: left;">That's too exhausting <br /></h2><p>If you've ever heard of format migration as a principle of long-term preservation, you've read in textbooks phrases like </p><p></p><blockquote><i>To ensure format migration, the significant properties of groups of objects that must be preserved must be determined.</i> </blockquote><p></p><p>Have you ever seen an archive that has actually determined and documented <a href="https://duckduckgo.com/?q=signifcant+properties+preservation&t=ffab&ia=web">significant properties</a>? </p><p>The truth is, significant properties are determined after the fact from technical metadata.</p><p></p><h2 style="text-align: left;">Summary <br /></h2><p>So what is digital long-term preservation? Only an expensive <a href="https://en.wikipedia.org/wiki/Backup">backup</a>.<br /><br /></p>Unknownnoreply@blogger.com0Dresden, Deutschland51.0504088 13.737262122.740174963821154 -21.418987899999998 79.360642636178852 48.8935121tag:blogger.com,1999:blog-9052940887756266577.post-74558689909608675312021-01-27T06:59:00.011-08:002021-01-28T05:19:32.985-08:00Impossible - or how I learned to read data storage media at the speed of light and what it's good for<p></p><br />When I receive data carriers from an inheritance, I want to get a quick overview of what is on the floppy disk, the CDROM, the USB stick or the hard disk drive so that I can look at the interesting things first.<br /><br />But I only know what is there when I read the media, right? A typical chicken and egg problem. <p></p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody><tr><td style="text-align: center;"><a href="https://openclipart.org/image/800px/212857" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="598" height="320" src="https://openclipart.org/image/800px/212857" width="239" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-size: xx-small;">https://openclipart.org/detail/212857/sci-fi-scanner-device</span></td></tr></tbody></table>I discovered the crucial clue to the solution in a 2014 talk by Simon Garfinkel "<span style="font-size: small;"><span style="font-family: inherit;"><i><span style="left: 56.3965px; top: 266.133px; transform: scaleX(0.886145);">Digital Forensics Innovation: </span><span style="left: 56.3965px; top: 305.606px; transform: scaleX(0.879236);">Searching A Terabyte of Data in 10 minutes</span></i>"</span></span> (<a href="http://simson.net/ref/2014/2014-02-21_RPI_Forensics_Innovation.pdf">http://simson.net/ref/2014/2014-02-21_RPI_Forensics_Innovation.pdf</a>)<br /><p></p><h2 style="text-align: left;">What is Random Sampling?</h2><p>Random sampling is nothing more than looking at only every n-th part of a total set and inferring the big picture.</p><p>To find out what is on a medium, it would be sufficient to look at random blocks and determine for them, based on their byte structure, whether they fall into the categories "<i>empty</i>", "<i>random</i>", "<i>text</i>", "<i>video</i>" or "<i>undef</i>".<br /><br />Exactly this approach is implemented in the Perl module <span style="font-family: courier;">File::FormatIdentification::RandomSampling</span>, which can be found on CPAN under <a href="https://metacpan.org/pod/File::FormatIdentification::RandomSampling">https://metacpan.org/pod/File::FormatIdentification::RandomSampling</a>.</p>The category "<i>empty</i>" is dominated by sequences of zero bytes, in the category "<i>random</i>" the byte values are almost equally distributed, in the category "<i>text</i>" values for the characters "a-z" from the ASCII character set appear frequently, "<i>video</i>" contains frequent byte sequences resulting from the basic structure of MPEG. And under "<i>undef</i>" everything else is subsumed.<h2 style="text-align: left;">Example</h2><p>The above Perl module contains the program <span style="font-family: courier;">crazy_fast_image_scan.pl</span>. The following simple call:</p><p><span style="font-family: courier;">perl -I lib bin/crazy_fast_image_scan.pl --percent=0.000001 --image=/dev/mapper/laptop--vg-home</span></p><p>provides the following output:</p><p><span style="font-family: courier;">Scanning Image /dev/mapper/laptop--vg-home with size 728982618112, checking 1423 sectors<br />scanning [...] <br />Estimate, that the image '/dev/mapper/laptop--vg-home'<br />has percent of following data types:<br /> 44.6% random/encrypted/compressed<br /> 35.6% undef<br /> 11.0% empty<br /> 5.4% video/audio<br /> 3.5% text</span><br /></p><p>The complete output is even more extensive. It is important to note that the examined partition was 668GB in size and was scanned in just 15s.<br /></p><p></p><h2 style="text-align: left;">Limits</h2><p>Importantly, the output provides only a rough estimate of what might be on the media. The choice of the sample size (here: via the <span style="font-family: courier;">--percentage</span> parameter) determines the informative value of the estimate, as well as the duration until a result can be delivered.</p><h2 style="text-align: left;">More ideas</h2><p>In the above module, I have implemented an experimental output of the MIME-Types potentially present on the media. This is not very stable yet and needs more work, but it can help to estimate even better whether the files on a disk are interesting enough to prioritize it. Here is an example output:</p><p><span style="font-family: courier;">The next mimetype estimation is experimental and needs further work:<br /> 87.9% unknown<br /> 3.5% application/pdf<br /> 1.1% video/quicktime<br /> 0.8% image/gif<br /> 0.8% text/java<br /> 0.7% application/msword<br /> 0.6% text/markdown<br /> 0.6% application/vnd.openxmlformats-officedocument.wordprocessingml.document<br /> 0.6% application/xml<br /> 0.4% application/msaccess<br /> 0.4% application/navimap<br /> 0.4% application/rtf<br /> 0.3% image/png<br /> 0.2% application/arj<br /> 0.1% application/vnd.ms-powerpoint<br /> 0.1% text/html</span><br /></p><p>The approach is to determine the MIME-Type of the files for a test corpus using other tools, determine typical bytegram values and pass the whole thing to a decision tree learner. If you are interested, you are welcome to contribute to the module. </p><p>Happy scanning!</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-85298449605095531342020-08-10T11:01:00.001-07:002020-08-10T11:01:43.223-07:00It is nonsense to consider significant properties only at file level<p>As it looks, most archives raise significant properties at the file level (by the way, they often mean technical properties, which is not the same. But this is a topic for another blog post). But this is insufficient and I will give two examples.</p><h1 style="text-align: left;">Example 1 - Retro-digitised material</h1><p>If monographs are scanned, as we do in-house, in order to preserve the originals and make them accessible to users, images are created.If you look at these image files, you can determine the following significant characteristics<br /><br /><br /></p><ul style="text-align: left;"><li>readable</li><li>accessible for OCR analysis</li><li>reproducible</li><li>maybe even true to color</li></ul><p style="text-align: left;"> </p><p>These properties can then be used to define technical parameters that can be found in certain requirement profiles and can lead, for example, to the recommendation of the TIFF file format.<br /></p><p>In the above consideration, the list of the significant property "the order of the scans should correspond to the original" (pagination) is missing. This property could be implemented by combining all scan pages into one file format, e.g. as BigTIFF or PDF/A. However, there may be good reasons not to include all pages in one file. What next? The remaining option is to add a file describing the structure of the digitized material in addition to the TIFF files. This can be a METS XML file, for example. METS is a good choice because it was created for this very purpose. Hmmm, is METS not a metadata format? And doesn't metadata belong outside of the payload? And isn't METS used by several archive information systems to map the AIPs? So can I not pack the structuring data into it?<br /></p><p><b>Stop!</b><br /><br />It is true, METS is a metadata format. And it is true that METS is often used to describe container structures in SIPs or AIPs. But we have to distinguish between metadata describing the IE (i.e. the payload) and metadata inherently belonging to the payload. This is not easy, but here the significant properties help us: If the METS is used, as in our example, to represent the significant property "pagination", then the METS is part of the IE, otherwise it is not.<br /></p><p>Now you might be tempted to get sloppy and just put the "pagination" into the METS of the AIP. Is that a good idea? No. Because IE should be kept available and usable. The AIP should only contain the metadata necessary to ensure availability. But when a user later accesses the payload via DIP, he should have everything together, i.e.: an intellectual unit as it was actually intended. This is the principle of independence.<br /><br />I admit that sounds abstract and difficult. But let us try an analogy. If I have loose pages where the order is important, then the order is important, whether the page is archived or not. For example, I tie them to a book or use other techniques. This is my intellectual unit that I want to archive. I put the whole thing in a box and write on it what is in it and what happened to the box or the content during archiving. This is then my AIP. If I want to hand over the contents of this box to someone later, they don't necessarily have to be interested in what happened to the box, they can take the contents and work with them and know exactly in which order the pages follow each other. <br /><br /></p><h1 style="text-align: left;">Example 2 - Web page</h1><p><br />I would like to present a second example to illustrate another aspect. Let us assume that we are to archive a very specific web page, which for the sake of simplicity consists of an HTML document, CSV files and graphic files. If you look at the web page, there is always a link in the text between one of the CSV files and one graphic file. The assignment could be the visualization of an experiment. It is only important to the department that the values, the textual content and the assignment to the graphic are not lost. Together with the department we determined the significant properties and after a lot of effort we transferred the website (IE) into the long-term archive. After some time we found out that the graphic files were subject to format obsolescence and had to be migrated to a new format. We decide on the new image archive format PNG/A and migrate the old files.</p><p>But is this sufficient? No. The HTML document still contains the file name of the old format. Should we change the file name or leave it as it is? The principle of least surprise speaks for "change". But if we change the file names during the migration, we <b>impossibly</b> have to change the file names <b>in</b> the HTML document as well.<br /><br /></p><h1 style="text-align: left;">Let's summarize</h1><ol style="text-align: left;"><li>Significant properties belong at the level of IE recorded. They are not file dependent.</li><li>Metadata, which is essential to represent the relationship of objects <b>within</b> an IE, is <b>mandatory</b> part of an IE</li><li>Format migrations can result in changes to other parts of the IE, even if they are not migrated themselves</li><li>Metadata and data that are <b>inside</b> an IE must never refer to data or metadata outside</li><li>Metadata outside of an IE, however, may already reference metadata and data of an IE. </li></ol><p><br />Whew, that was a lot of thinking, but I hope it was worth thinking about it.<br /><br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-78883460131715762482020-07-22T08:57:00.002-07:002020-07-22T08:57:48.969-07:00Format recognition, new analysis options?<div><h1 style="text-align: left;">Previous work</h1></div><div><br /></div><div>In an older article (see <a href="https://kulturreste.blogspot.com/2018/10/heres-tool-make-it-work.html">https://kulturreste.blogspot.com/2018/10/heres-tool-make-it-work.html)</a> I have already done an analysis of PRONOM signatures. Since today the module for this exists on CPAN, see <a href="https://metacpan.org/pod/File::FormatIdentification::Pronom">https://metacpan.org/pod/File::FormatIdentification::Pronom</a> for details.<br /></div><br />In addition to the statistics on PRONOM signatures, the Perl package comes with two more helper scripts that can make the work of a long-term archivist easier.<br /><div><br /></div><div><h1 style="text-align: left;">Format identification</h1></div><div><br /></div>On the one hand, we have the functionality of classic format recognition. The script delivers all hits. In the output the quality of the RegEx is indicated. This does not say how well the PRONOM signature matches the file, but how specifically it is created.<br /><br /><div>Here is an example output for a TIFF file, which was wrongly recognized as GeoTIFF by Droid:</div><div><br /></div><div><div style="margin-left: 40px; text-align: left;"><code>perl -I lib bin/pronomidentify.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif</code></div>
<div style="margin-left: 40px; text-align: left;"><pre>/tmp/00000007.tif identified as Tagged Image File Format with PUID fmt/353 (regex quality 1)
/tmp/00000007.tif identified as Geographic Tagged Image File Format (GeoTIFF) with PUID fmt/155 (regex quality 2)</pre></div></div><br /><div><br /></div><div><h1 style="text-align: left;">Colorized output of possible signature hits in the hexeditor wxHexEditor</h1></div><div><br /></div><div>Under Linux you can use the editor wxHexEditor to analyze files. It allows you to create tag-files, in which you can define sections that are marked with colors and annotated.<br /><br />The script pronom2wxhexeditor creates such a file. In the following you can see the call and a screenshot.</div><div><br /></div><div style="margin-left: 40px; text-align: left;"><code>perl -I lib bin/pronom2wxhexeditor.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif</code></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaCdXQpEtVlYap-p_urCHKg5O5iBubNH6KkqPEqUmX99ZOHOmDgcQCjcsXfbOW5BQ_rPKl5d-4Ow1Yiql814c_lL6G8Pb2QjEm7rtVAJG0tpwveicsWNYdUsilPfb0cyDGTw8rRSMu6Zth/s902/Screenshot_wxHexEditor.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="433" data-original-width="902" height="301" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaCdXQpEtVlYap-p_urCHKg5O5iBubNH6KkqPEqUmX99ZOHOmDgcQCjcsXfbOW5BQ_rPKl5d-4Ow1Yiql814c_lL6G8Pb2QjEm7rtVAJG0tpwveicsWNYdUsilPfb0cyDGTw8rRSMu6Zth/w625-h301/Screenshot_wxHexEditor.png" width="625" /></a></div><div><br /><h1 style="text-align: left;">What next?</h1><br />Well, it's up to us as a community to use the existing tools and use their possibilities to improve our daily work. Anyone who has suggestions for improvement or ideas is welcome to share them with us.<br /><br />I would be especially happy if servant spirits would take the pronoun statistics to their chest and help improve the pronoun signatures.<br /><br />It makes sense to start with the orphaned signatures and to check multiple used signatures again.<br /><br /></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-73055112584168798622020-07-13T05:05:00.001-07:002020-07-13T05:05:27.771-07:00Why it is a stupid idea to consider CSV as a valid long-term preservation file format<div>
<h1 style="text-align: left;">
Take CSV!</h1>
</div>
<div>
It's so nice and quick and easy to say. Take CSV!</div>
<br />
For simple cases that may be true. CSV files look so simple, so innocent, so sweet. Yet by their very nature they are insidious, vicious, and resemble a bloody walk into the deepest dungeons of classic role-players.<br />
<br />
<div style="text-align: left;">
Let us begin our journey.</div>
<div>
<h1 style="text-align: left;">
Innocent simplicity</h1>
You take a separator, e.g. the comma, use it to separate your values. Pour both into readable form. Done.<br />
<br />
Okay. We need a second separator to show us the next line. But then, done! It's a CSV.<br />
<br />
Hmm. There was something. Line separator. Now, is that line feed, carriage return or carriage return and line feed? It depends. For example, what operating system you're running.</div>
<div>
<h1 style="text-align: left;">
The monster is growing</h1>
</div>
<div>
It is not a bad idea to separate values of a list by commas. Especially for Americans, this feels quite natural.</div>
<div>
<br />
In other parts of the world, the decimal places of fractional numbers are separated by commas. Good, then we'll give the spreadsheets the opportunity to define the separator freely. Problem solved.</div>
<div>
<br /></div>
<div>
Well, not quite. It could be in other contexts that somehow the separator could appear in the individual values of a list. Good, then we'll introduce quoting. We define a character that allows us to recognize whether a separator is a separator or just a text component of a list value. Apostrophes would fit. That was easy, wasn't it?</div>
<div>
<h1 style="text-align: left;">
Short break</h1>
So, to sum up. CSV files are easy. You need a separator, which can be a comma or anything else. We have a second separator that separates the lines. Usually there are three variations. We need quoting to see that a value cannot be confused with a separator.</div>
<div>
<br /></div>
<div>
Yeah, it may have been a little more complex than it looked at first. But what is there to make it worse?</div>
<div>
<h1 style="text-align: left;">
Little toothy pegs!</h1>
Hmm, what if I want to store a text like this as a value after the raw value 1:<br />
<br />
<div style="margin-left: 40px; text-align: left;">
<i>And he said "Oh, no!"</i></div>
<br />
In the text, we have a comma, which would be protected by quoting, But we also have quotation marks, which we need for our quoting. No problem, then we double the quotation mark at that point to indicate that the text is not finished. So in the CSV it looks like this now:<br />
<br />
<div style="margin-left: 40px; text-align: left;">
<span style="font-family: "courier";">1, "And he said ""Oh, no!""</span></div>
<br />
I got it.<br />
<br />
But, wait, what happens if my text consists of a single quotation mark?<br />
<br />
<div style="margin-left: 40px; text-align: left;">
<span style="font-family: "courier";">1,""""</span></div>
<br />
You're lucky. It seems to be working.</div>
<div>
<br /></div>
<div style="text-align: left;">
Wait, so what if I have a lot of quotation marks? As in<br />
<br />
<div style="margin-left: 40px; text-align: left;">
<i>""""""</i></div>
<div style="text-align: left;">
This is translated to</div>
<div style="margin-left: 40px; text-align: left;">
1, """"""""""""""</div>
<div style="margin-left: 40px; text-align: left;">
<br /></div>
It works, too.<br />
<br />
<h1 style="text-align: left;">
The problem is in the details</h1>
Now, a nasty little devil might get the idea to construct a text as value that contains line breaks, for example this one:<br />
<br />
<div style="margin-left: 40px; text-align: left;">
<i>Evil Text<br />",<br />",</i></div>
<br />
That would then:<br />
<br />
<div style="margin-left: 40px;">
<span style="font-family: "courier";">1, "Evil text<br />"","<br />"",</span></div>
<br />
Oops! If I now stubbornly read this in line by line, I would have read strange lines. </div>
<div style="text-align: left;">
Good thing there is real software out there that reads and parses CSV files cleanly from the beginning. Not that anyone here still uses 'grep' and co.</div>
<h1 style="text-align: left;">
The Abyss</h1>
Have we actually talked about character encoding yet? ASCII, Latin-1, UTF32? UTF8? With or without byte-order mark? No. Let's turn back. We still have a chance.<br />
<h1 style="text-align: left;">
Later, at the pub.</h1>
I admit it was a terrible trip. Now, over a cold beer, we can laugh about it. But our hearts were already in our mouth. We had no idea what to expect.<br />
<div>
<br /></div>
<div>
If only there had been a sign that said what character encoding, what line end encoding, what separators for lines and columns we could expect, yes, then we would have been able to understand CSV and we would have been spared the horror. But the horror comes from the darkness, from the premonitions of the unknown. </div>
<div>
<br /></div>
<div>
Therefore, be warned!<b><br /></b></div>
<div>
<b><br /></b></div>
<div>
<b>Don't use CSV, it could get you!</b></div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-64030660696486855922020-02-18T02:26:00.002-08:002020-02-18T05:38:33.540-08:00format zoo for videos - a bad idea in digital preservation<h2>
Background </h2>
<br />
In an article on <a href="https://axfelix.github.io/ffv1">https://axfelix.github.io/ffv1</a>, reasons are given not to apply the existing normalization of born-digital videos to FFV1, but to convert to lossy codecs instead. Elsewhere I even heard that normalization is not applied at all because it requires so many resources.<br />
<br />
<br />
<h2>
Why is normalization a good idea after all?</h2>
<br />
Normalization ensures that a manageable set of file formats remains from the huge format zoo, which can be handled well in the future. Normalization therefore reduces the organizational complexity above all.<br />
<br />
<h2>
And why should you use Matroska/FFV1?</h2>
<br />
<a href="https://github.com/FFmpeg/FFV1">FFV1</a> has the disadvantage of imposing higher storage requirements on its users, but in my opinion, the following points outweigh it:<br />
<br />
<ul>
<li>FFV1 is much less complex than h264 (read "reduced technical complexity")</li>
<li>FFV1 (like other lossless codecs) allows automatic format migration (see also <a href="https://mediaarea.net/RAWcooked">RAWcooked</a>) — this reduces organizational complexity</li>
<li>FFV1 is freely available, widely used, well documented and standardized</li>
</ul>
<br />
<br />
The point that FFV1 is also more resistant to bit rot is just the icing on the cake.<br />
<br />
<br />
<h2>
Summary </h2>
<br />
Incidentally, personnel cost is the cost driver in digital preservation, as opposed to the pure storage cost.<br />
<br />
Hence, the ultimate question is: how expensive is storage capacity in relation to the reduced technical and organizational complexity?Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-86254089279574232812019-05-29T01:43:00.000-07:002019-05-29T01:43:26.281-07:00Legacy media<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPxq24HydaR060-y3dNUJ_Thvil9lk-5Qi8H9RfWhYzMiiE5CtbS89f_1SVuMDw_8kyOWQN_zUoG1oZcpmVYr6QtTtjNc8RGzyipyopIOfRrpHqqDLYhg3-3cDF0jrF-OR_a-QXcSFqc6F/s1600/IMG_20190529_103216498.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPxq24HydaR060-y3dNUJ_Thvil9lk-5Qi8H9RfWhYzMiiE5CtbS89f_1SVuMDw_8kyOWQN_zUoG1oZcpmVYr6QtTtjNc8RGzyipyopIOfRrpHqqDLYhg3-3cDF0jrF-OR_a-QXcSFqc6F/s320/IMG_20190529_103216498.jpg" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdblGpX0KHoqUf5ySAtp29eRVygNrrvkQwQlASfJdWXn1I9P2NJxqfJ2XO2r6s2OueNhLSJwjZtcASbtUb2lUUEPIIZDdQYc2l2VBzByuc6izEI_ZcLcVnhhcT7UcG-XOsaeUtjUa0oDVW/s1600/IMG_20190529_103141806.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="1600" data-original-width="1200" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdblGpX0KHoqUf5ySAtp29eRVygNrrvkQwQlASfJdWXn1I9P2NJxqfJ2XO2r6s2OueNhLSJwjZtcASbtUb2lUUEPIIZDdQYc2l2VBzByuc6izEI_ZcLcVnhhcT7UcG-XOsaeUtjUa0oDVW/s400/IMG_20190529_103141806.jpg" width="300" /></a>This is the reason why you have to pay special attention to legacy digital media. Defective tracks of a floppy disk, special hardware (and knowledge) is necessary here.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-36443096366597155922019-04-01T00:08:00.000-07:002019-04-01T00:08:23.013-07:00Vorsicht vor Bitfischchen - Bestandserhaltung im digitalen ZeitalterSchädlingsbekämpfung ist ein immerwährendes Problem in Bibliotheken und Archiven. Silberfischchen, Papierfischchen und andere Übeltäter laben sich an den Beständen und richten dabei beträchtliche Schäden an.<br />
<br />
Da die Schädlingsbekämpfung nicht als explizite Aufgabe im OAIS-Referenzmodell aufgeführt ist, haben einige digitale Langzeitarchive hier bisher deutliche Defizite. Inzwischen spüren aber auch diese Einrichtungen immer deutlicher, dass die Schädlingsbekämpfung nicht vernachlässigt werden darf.<br />
<br />
Angelockt von umfangreichen digitalen Beständen nisten sich Bitfischchen und Käfer (in der Fachsprache "Bugs" genannt) in Kabelhaufen ein und vermehren sich dort ungestört. Das Nahrungsangebot durch den reichlich vorhandenen Kabelsalat ist gut, und so wachsen die Populationen schnell an. Reste von Junk sowie Binärmüll-Krümel verschärfen das Problem zusätzlich.<br />
<br />
Nicht nur die Anzahl der Fischchen, sondern auch ihre lange Lebensdauer ist ein Problem. Viele von Ihnen werden acht bis zehn Jahre alt, Microfichechen sogar noch deutlich älter.<br />
<br />
Im moderigen Milieu vieler digitaler Archive fühlen sich auch Magnetbandwürmer wohl, die sich vor allem an den Daten auf WORM-Tapes laben. Daten, die nicht von den kleinen Plagegeistern zerstört werden, zerfallen in der fauligen Umgebung durch den Bitrot zu unlesbarem Datenkompost, der die Datenleitungen verstopft und so die Verarbeitung stört.<br />
<br />
Eine gute Seite hat die neue Plage allerdings: findige Informatiker haben herausgefunden, dass Bitfischchen hervorragend zur Herstellung von Bitfett geeignet sind. Sie nutzen es, um Leitungsverbindungen zu schmieren und so die Reibung bei der Datenübertragung zu reduzieren, was wiederum positiv auf den Durchsatz auswirkt.Jörg Sachsehttp://www.blogger.com/profile/17097541683565972324noreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-69769757121193486532018-10-05T02:06:00.003-07:002018-10-08T01:00:36.834-07:00Here's a tool, make it work!In the last post you may have already noticed it. To analyze the hits of DROID signatures I wrote a small Perl script which converts Droid signatures into Perl Regular Expressions and writes the matches into tag files of the hex editor wxHexEdit so that you can see which signatures were used where in a file.<br />
<br />
From this small script a bigger Perl module called "File::FormatIdentification::Pronom" was created. It should not replace Droid, Fido or Siegfried. It only serves to analyze which patterns can be optimized and gives statistics about how to improve the Pronom database in the future.<br />
In the following a statistic of the current Droid signature is shown, so that you get a feeling, what is possible.<br />
<blockquote class="tr_bq">
<pre style="white-space: pre-wrap;">perl -I lib/ bin/pronom_statistics.pl ../DROID_SignatureFile_V94.xml
Statistics of file ../DROID_SignatureFile_V94.xml
=======================================
Countings
---------------------------------------
Count of PUIDs: 1670
internal IDs: 1441
regular expressions: 1730
file endings: 1167
PUIDs with file endings only: 503
(56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)
orphaned internal IDs: 20
(56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)
Quality of internal IDs
---------------------------------------
1-best quality internal ID (PUID, name): 110 (fmt/75, Drawing Interchange File Format (ASCII)) -> 4.882;3.135
combined regex: (?=((\x0A)|(\x0D\x0A)(0))SECTION((\x0A)|(\x0D\x0A)(\x20\x202)((\x0A)|(\x0D\x0A)(HEADER)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(9))\$ACADVER((\x0A)|(\x0D\x0A)(\x20\x201)((\x0A)|(\x0D\x0A)(AC1009)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(0))ENDSEC((\x0A)|(\x0D\x0A)))(?=(((\x0A)|(\x0D\x0A)(0))EOF((\x0A)|(\x0D\x0A)))\Z)
2-best quality internal ID (PUID, name): 105 (fmt/70, Drawing Interchange File Format (ASCII)) -> 4.736;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC((1001)|(2\x2E21)|(2\x2E22)(\x0D\x0A))0
ENDSEC
)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
3-best quality internal ID (PUID, name): 104 (fmt/69, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC2\x2E10\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
4-best quality internal ID (PUID, name): 103 (fmt/68, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E50\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
5-best quality internal ID (PUID, name): 102 (fmt/67, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E40\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
1-worst quality internal ID (PUID, name): 1299 (fmt/950, MIME Email) -> -1.993;-2.91;-2.776;-2.776;-2.29
combined regex: (?=\A.{0,16384}(((V)|(v)(\x2D)((IME)|(ime)(M)))ersion: 1\.0))(?=\A.{0,16384}(To\x3A\x20))(?=\A.{0,16384}(From\x3A\x20))(?=\A.{0,16384}(Date\x3A\x20))(?=\A.{0,16384}(Content\x2DType\x3A\x20))
2-worst quality internal ID (PUID, name): 527 (fmt/358, Internet Data Query File) -> -2.806;-2.743;-2.629;-2.981
combined regex: (?=\A.{0,3424}(\x5BQuery\x5D).*(((S)|(s)(i)((C)|(c)))cope=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((C)|(c)(i)((C)|(c)))olumns=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((T)|(t)(i)((C)|(c)))emplate=\/))(?=\A.{0,3424}(\x5BQuery\x5D).*(((R)|(r)(i)((C)|(c)))estriction=.?(\x25)))
3-worst quality internal ID (PUID, name): 532 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
4-worst quality internal ID (PUID, name): 533 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
5-worst quality internal ID (PUID, name): 835 (fmt/532, Drawing Interchange File Format (ASCII)) -> -3.614;-3.842
combined regex: (?=\A.{1,3}((0).{1,2}SECTION.{1,2}(\x20\x202).{1,2}(HEADER)).+((9).{1,2}\$ACADVER.{1,2}(\x20\x201).{1,2}(AC1027)).+((0).{1,2}ENDSEC))(?=((0).{1,2}EOF).{1,3}\Z)
Regular expressions
---------------------------------------
Count of multiple used regular expressions: 67
common regex group no 0:
regex='(((\x0A)|(\x0D)|(\x0D\x0A)(0))EOF).{0,2}\Z'
internal IDs: 111,112,113
</pre>
[…]
</blockquote>
<br />
<br />
<br />
I would be pleased about feedback. The code is available under <a href="http://andreas-romeyke.de/software.html#_file_formatidentification_pronom">http://andreas-romeyke.de/software.html#_file_formatidentification_pronom</a> .<br />
<br />
Have fun!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-36984151794392646522018-09-17T07:43:00.001-07:002018-09-17T07:43:53.061-07:00A file is a TIFF is a MP3 is a…In den letzten Tagen sind uns einige Dateien aufgefallen, die in der Formatidentifizierung hängengeblieben sind. Diese wurden von Droid als TIFF (<a href="http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1099&strPageToDisplay=summary">fmt/353</a>) und als MP3 (<a href="http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=687&strPageToDisplay=summary">fmt/134</a>) erkannt.<br />
<br />
Die Frage, die sich uns stellte: Lag ein Fehler vor, oder handelt es sich tatsächlich um Dateien, die man anhand der Pronom-Signaturen sowohl als TIFF als auch als MP3 interpretieren könnte?<br />
<br />
Um diese genauer zu untersuchen, haben wir uns ein Perl-Script¹ geschrieben. welches die Muster aus der Droid-Signaturen Datei verwendet und die entsprechenden Treffer im HexEditor sichtbar macht. Hier ein Screenshot:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXoFXawCoqiyt6HNxvkjk9FAQl-UdqYjsaOezFqkJxuQ2lN50iq4tjSU3AH4w6LTgOFsixdZ-w_oQ-v5qjjYODIoT5SThRYoswUKDZlUcgkiCBdBd30DL8M4A4ASiyH8-q-I0YgYwoLRj_/s1600/BildschirmfotoWxHexEditor_TIFF.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="948" data-original-width="1600" height="377" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXoFXawCoqiyt6HNxvkjk9FAQl-UdqYjsaOezFqkJxuQ2lN50iq4tjSU3AH4w6LTgOFsixdZ-w_oQ-v5qjjYODIoT5SThRYoswUKDZlUcgkiCBdBd30DL8M4A4ASiyH8-q-I0YgYwoLRj_/s640/BildschirmfotoWxHexEditor_TIFF.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">wxHexeditor, Screenshot mit spezieller Tags-Datei</td><td class="tr-caption" style="text-align: center;"><br /></td></tr>
</tbody></table>
<br />
<br />
Wie man sieht, treffen mehrere Muster. Zum einen das Muster für TIFF-Dateien, indem am Anfang der Magicbyte-String "0x4949" vorkommt. Zum anderen auch eines der Rezepte, die einen MP3-Datenstrom beschreiben.<br />
<br />
Bei Wikipedia findet man unter XXX folgende Darstellung eines MP3-Frames. Das Muster in der Droid-Signatur trifft, da 8 Frames hintereinander vorkommen:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Mp3filestructure.svg/1280px-Mp3filestructure.svg.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="318" data-original-width="800" height="254" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Mp3filestructure.svg/1280px-Mp3filestructure.svg.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">MP3-Struktur, Quelle: Wikipedia, sh. https://commons.wikimedia.org/wiki/File:Mp3filestructure.svg (CC-BY/GFDL)</td><td class="tr-caption" style="text-align: center;"><br /></td><td class="tr-caption" style="text-align: center;"><br /></td></tr>
</tbody></table>
Diese Datei ist ein gutes Beispiel dafür, daß nicht die Muster in der Pronom-Datenbank das Problem sind, sondern dateiformat-spezifische Eigenschaften es notwendig machen, den Ingest-Prozess so zu gestalten, dass dieser mit mehreren Treffern in der Formatidentifikation umgehen kann.<br />
<br />
Siehe hierzu auch unser Beitrag "Formatidentifikation vs. Formatvalidierung - Wem glauben wir eigentlich?" unter <a href="https://kulturreste.blogspot.com/2016/06/formatidentifikation-vs.html">https://kulturreste.blogspot.com/2016/06/formatidentifikation-vs.html</a><br />
<br />
<br />
--<br />
¹ Das Perlscript stellen wir demnächst zur Verfügung <br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-84529581382803145572018-04-16T04:57:00.004-07:002018-04-16T10:18:31.893-07:00Wie verwirrend! How confusing! Defaults in TIFF<i>Hint: english version below :)</i><br />
<h2>
Erste Überlegung: Hä? </h2>
Ernsthaft? Was soll denn an den Defaults von TIFF so problematisch sein? Steht doch alles in der <a href="https://archive.org/details/TIFF6">Spezifikation</a>. Es gilt:<br />
<ol>
<li>Enthält ein TIFF ein Tag nicht, für das ein Default definiert ist, gilt der Default.</li>
<li>Wenn ein TIFF ein Tag enthält, gilt der Wert des Tags.</li>
<li>Sonst gilt, der Wert ist nicht definiert und demnach nicht vorhanden.</li>
</ol>
<h2>
Der zweite Blick</h2>
Leider ist es in der Praxis komplizierter. Ich bekam die Frage<i>, </i>wenn <a href="http://jhove.openpreservation.org/"><i>jhove</i></a> bei der Prüfung der von <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> mitgelieferten Beispiel-TIFFs für das Thresholding-Tag 263 den Wert "1" ausgibt:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> jhove tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Jhove (Rel. 1.6, 2011-01-04)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Date: 2018-04-16 12:41:25 MESZ</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RepresentationInformation: tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ReportingModule: TIFF-hul, Rel. 1.5 (2007-10-02)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> LastModified: 2017-07-14 11:28:57 MESZ</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Size: 323</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Format: TIFF</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Version: 5.0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Status: Well-Formed and valid</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SignatureMatches:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TIFF-hul</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MIMEtype: image/tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Profile: Baseline bilevel (Class B), TIFF/IT-BP (ISO 12639:1998), TIFF/IT-BP/P1 (ISO 12639:1998), TIFF/IT-BP/P2 (ISO 12639:1998), TIFF/IT-MP (ISO 12639:1998)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TIFFMetadata: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ByteOrder: little-endian</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> IFDs: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Number: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> IFD: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Offset: 38</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Type: TIFF</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Entries: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> NisoImageMetadata: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ByteOrder: little_endian</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> CompressionScheme: uncompressed</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ImageWidth: 20</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ImageHeight: 10</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ColorSpace: white is zero</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Orientation: normal</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SamplingFrequencyUnit: inch</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XSamplingFrequency: 376,193</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> YSamplingFrequency: 376,193</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerSample: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerSampleUnit: integer</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SamplesPerPixel: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> NewSubfileType: 0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SampleFormat: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MinSampleValue: 0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MaxSampleValue: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> <b><span style="color: #cc0000;">Threshholding: 1</span></b></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TIFFITProperties: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BackgroundColorIndicator: background not defined</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ImageColorIndicator: image not defined</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TransparencyIndicator: no transparency</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> PixelIntensityRange: 0, 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RasterPadding: 1 byte</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerRunLength: 8</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerExtendedRunLength: 16</span></blockquote>
aber <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> mit dem beigefügten Beispiel keinen Fehler wirft, obwohl doch <b>keine</b> Positiv-Regel in der Konfigurationsdatei hinterlegt ist:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> checkit_tiff example_configs/cit_tiff6_baseline_SLUB.cfg tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">'./build/checkit_tiff' version: development_v0.4.0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> revision: 408</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">licensed under conditions of libtiff (see http://libtiff.maptools.org/misc.html)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">cfg_file=example_configs/cit_tiff6_baseline_SLUB.cfg</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">tiff file/dir=tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">file: tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> TIFF should have just one IFD, (lineno: 12)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All tag offsets should be word aligned, (lineno: 14)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All offsets may only be used once, (lineno: 14)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All tag offsets should be greater than zero, (lineno: 14)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All IFDs should be word aligned, (lineno: 15)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> Tags should be sorted in ascending order, (lineno: 15)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 256 (ImageWidth) --> Tag should have a value in a range of (lineno: 23)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 257 (ImageLength) --> Tag should have a value in a range of (lineno: 25)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 258 (BitsPerSample) --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 30)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 259 (Compression) --> Tag should have one exact value. (lineno: 36)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 262 (Photometric) --> Tag should have a value in a range of (lineno: 40)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 273 (StripOffsets) --> TIFF should contain this tag. (lineno: 45)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 277 (SamplesPerPixel) --> Tag should have one exact value. (lineno: 52)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 278 (RowsPerStrip) --> Tag should have a value in a range of (lineno: 55)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 279 (StripByteCounts) --> TIFF should contain this tag. (lineno: 60)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 282 (XResolution) --> Tag should have a value in a range of (lineno: 63)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 283 (YResolution) --> Tag should have a value in a range of (lineno: 66)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 296 (ResolutionUnit) --> Tag should have one exact value. (lineno: 69)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 254 (SubFileType) --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 77)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 274 (Orientation) --> Tag should have one exact value. (lineno: 113)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 284 (PlanarConfig) --> Tag should have one exact value. (lineno: 122)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./)</span><br />
<span style="color: #cc0000; font-family: "courier new" , "courier" , monospace;"><b>(./)Yes, the given tif is valid :)</b></span></blockquote>
Zuerst war ich etwas erschrocken, war ich mir doch sicher, dass <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> funktioniert und ich alles sorgfältig geprüft hatte. Zur Sicherheit habe ich die Ausgabe mit <i>tiffdump</i> der <a href="http://www.simplesystems.org/libtiff/"><i>libtiff</i></a> geprüft:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> tiffdump tiffs_should_pass/minimal_valid_baseline.tifftiffs_should_pass/minimal_valid_baseline.tiff:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Directory 0: offset 38 (0x26) next 0 (0)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SubFileType (254) LONG (4) 1<0></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ImageWidth (256) SHORT (3) 1<20></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ImageLength (257) SHORT (3) 1<10></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">BitsPerSample (258) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Compression (259) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Photometric (262) SHORT (3) 1<0></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">StripOffsets (273) LONG (4) 1<8></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Orientation (274) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SamplesPerPixel (277) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">RowsPerStrip (278) SHORT (3) 1<64></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts (279) LONG (4) 1<30></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">XResolution (282) RATIONAL (5) 1<376.193></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">YResolution (283) RATIONAL (5) 1<376.193></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">PlanarConfig (284) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ResolutionUnit (296) SHORT (3) 1<2></span></blockquote>
Gut, <i>tiffdump</i> war auf meiner Seite. Was ist also der Grund für diese Diskrepanz? Schauen wir zuerst in die <a href="https://archive.org/details/TIFF6">TIFF-6.0 Spezifikation</a>, dort steht auf Seite 41:<br />
<blockquote class="tr_bq">
For black and white TIFF files that represent shades of gray, the technique used to<br />
convert from gray to black and white pixels.<br />
Tag = 263 (107.H)<br />
Type = SHORT<br />
N = 1<br />
1 = No dithering or halftoning has been applied to the image data.<br />
2 = An ordered dither or halftone technique has been applied to the image data.<br />
3 = A randomized process such as error diffusion has been applied to the image data.<br />
Default is Threshholding = 1. See also CellWidth, CellLength.</blockquote>
Okay. Für das oben benutzte TIFF trifft zu, dass es schwarz-weiß ist und <b>kein</b> Tag 263 enthält. Daher wird der Default = 1 angenommen.<br />
<br />
<a href="http://jhove.openpreservation.org/"><i>Jhove</i></a> präsentiert die Metadaten der TIFF-Dateien also so, wie ein TIFF-Reader sie <b>interpretieren</b> würde. Die Tools <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> und <i>tiffdump</i> zeigen dagegen, welche TIFF-Tags mit welchen Werten <b>tatsächlich</b> in den TIFF-Dateien <b>explizit kodiert</b> sind.<br />
<h2>
Fazit</h2>
Kenne Deine Tools! Statt Default-Werte zu interpretieren, sollten solche Annahmen <b>explizit</b> gekennzeichnet werden. Für den Durchschnittsanwender ist sonst nicht ersichtlich, wie die Ergebnisse zustande kommen. Als Lektion für <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> nehme ich diese Frage mit in die FAQ auf.<br />
<br />
<br />
<br />
<br />
<a href="https://www.blogger.com/null" name="more"></a><br />
<br />
<h2>
First thought: WTF?</h2>
Seriously? What's supposed to be so problematic about TIFF's defaults? After all, the <a href="https://archive.org/details/TIFF6">Spezifikation</a> says it all. The rules are:<br />
<ol>
<li>If a TIFF does not contain a tag that has a well-defined default value, then that default value is used.</li>
<li>If a TIFF does contain a tag, then that tag's value is used.</li>
<li>In all other cases, the value is undefined and hence nonexistent.</li>
</ol>
<h2>
Der zweite Blick</h2>
Unfortunately, the real world is a little more complicated. I was asked why <a href="http://jhove.openpreservation.org/"><i>jhove</i></a> would give a value of "1" for the Thresholding tag 263 when validating TIFF-examples that are delivered with <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> as shown below:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> jhove tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Jhove (Rel. 1.6, 2011-01-04)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Date: 2018-04-16 12:41:25 MESZ</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RepresentationInformation: tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ReportingModule: TIFF-hul, Rel. 1.5 (2007-10-02)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> LastModified: 2017-07-14 11:28:57 MESZ</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Size: 323</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Format: TIFF</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Version: 5.0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Status: Well-Formed and valid</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SignatureMatches:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TIFF-hul</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MIMEtype: image/tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Profile: Baseline bilevel (Class B), TIFF/IT-BP (ISO 12639:1998), TIFF/IT-BP/P1 (ISO 12639:1998), TIFF/IT-BP/P2 (ISO 12639:1998), TIFF/IT-MP (ISO 12639:1998)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TIFFMetadata:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ByteOrder: little-endian</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> IFDs:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Number: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> IFD:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Offset: 38</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Type: TIFF</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Entries:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> NisoImageMetadata:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ByteOrder: little_endian</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> CompressionScheme: uncompressed</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ImageWidth: 20</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ImageHeight: 10</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ColorSpace: white is zero</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Orientation: normal</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SamplingFrequencyUnit: inch</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XSamplingFrequency: 376,193</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> YSamplingFrequency: 376,193</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerSample: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerSampleUnit: integer</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SamplesPerPixel: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> NewSubfileType: 0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> SampleFormat: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MinSampleValue: 0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MaxSampleValue: 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> <b><span style="color: #cc0000;">Threshholding: 1</span></b></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TIFFITProperties:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BackgroundColorIndicator: background not defined</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ImageColorIndicator: image not defined</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> TransparencyIndicator: no transparency</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> PixelIntensityRange: 0, 1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RasterPadding: 1 byte</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerRunLength: 8</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> BitsPerExtendedRunLength: 16</span></blockquote>
However, <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> does not throw an error while validating the same sample file, even though there's <b>no</b> whitelist rule for that tag in the config file:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> checkit_tiff example_configs/cit_tiff6_baseline_SLUB.cfg tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">'./build/checkit_tiff' version: development_v0.4.0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> revision: 408</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">licensed under conditions of libtiff (see http://libtiff.maptools.org/misc.html)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">cfg_file=example_configs/cit_tiff6_baseline_SLUB.cfg</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">tiff file/dir=tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">file: tiffs_should_pass/minimal_valid_baseline.tiff</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> TIFF should have just one IFD, (lineno: 12)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All tag offsets should be word aligned, (lineno: 14)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All offsets may only be used once, (lineno: 14)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All tag offsets should be greater than zero, (lineno: 14)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> All IFDs should be word aligned, (lineno: 15)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) general --> Tags should be sorted in ascending order, (lineno: 15)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 256 (ImageWidth) --> Tag should have a value in a range of (lineno: 23)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 257 (ImageLength) --> Tag should have a value in a range of (lineno: 25)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 258 (BitsPerSample) --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 30)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 259 (Compression) --> Tag should have one exact value. (lineno: 36)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 262 (Photometric) --> Tag should have a value in a range of (lineno: 40)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 273 (StripOffsets) --> TIFF should contain this tag. (lineno: 45)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 277 (SamplesPerPixel) --> Tag should have one exact value. (lineno: 52)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 278 (RowsPerStrip) --> Tag should have a value in a range of (lineno: 55)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 279 (StripByteCounts) --> TIFF should contain this tag. (lineno: 60)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 282 (XResolution) --> Tag should have a value in a range of (lineno: 63)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 283 (YResolution) --> Tag should have a value in a range of (lineno: 66)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 296 (ResolutionUnit) --> Tag should have one exact value. (lineno: 69)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 254 (SubFileType) --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 77)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 274 (Orientation) --> Tag should have one exact value. (lineno: 113)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./) tag 284 (PlanarConfig) --> Tag should have one exact value. (lineno: 122)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(./)</span><br />
<span style="color: #cc0000; font-family: "courier new" , "courier" , monospace;"><b>(./)Yes, the given tif is valid :)</b></span></blockquote>
Being sure that <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> works as expected and that I had checked everything, I was shocked at first. To err on the side of safety, I ran a crosscheck of checkit_tiff's output with the output of the <i>tiffdump</i> tool from the <a href="http://www.simplesystems.org/libtiff/"><i>libtiff</i></a>:<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> tiffdump tiffs_should_pass/minimal_valid_baseline.tifftiffs_should_pass/minimal_valid_baseline.tiff:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Directory 0: offset 38 (0x26) next 0 (0)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SubFileType (254) LONG (4) 1<0></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ImageWidth (256) SHORT (3) 1<20></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ImageLength (257) SHORT (3) 1<10></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">BitsPerSample (258) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Compression (259) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Photometric (262) SHORT (3) 1<0></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">StripOffsets (273) LONG (4) 1<8></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Orientation (274) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SamplesPerPixel (277) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">RowsPerStrip (278) SHORT (3) 1<64></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts (279) LONG (4) 1<30></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">XResolution (282) RATIONAL (5) 1<376.193></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">YResolution (283) RATIONAL (5) 1<376.193></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">PlanarConfig (284) SHORT (3) 1<1></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ResolutionUnit (296) SHORT (3) 1<2></span></blockquote>
Well, <i>tiffdump</i> was in my team there. So, what's the reason for that discrepancy? First, let's have a loot at the <a href="https://archive.org/details/TIFF6">TIFF-6.0 Spezifikation</a>. On page 41, the specification states:<br />
<blockquote class="tr_bq">
<i>For black and white TIFF files that represent shades of gray, the technique used to</i><br />
<i>convert from gray to black and white pixels.</i><br />
<i>Tag = 263 (107.H)</i><br />
<i>Type = SHORT</i><br />
<i>N = 1</i><br />
<i>1 = No dithering or halftoning has been applied to the image data.</i><br />
<i>2 = An ordered dither or halftone technique has been applied to the image data.</i><br />
<i>3 = A randomized process such as error diffusion has been applied to the image data.</i><br />
<i>Default is Threshholding = 1. See also CellWidth, CellLength.</i></blockquote>
Okay. Looking at the sample TIFF we used above, it's true that it's a black-and-white image and does <b>not</b> contain tag 263. Hence, a default = 1 is assumed.<br />
<br />
Apparently, <a href="http://jhove.openpreservation.org/" style="font-style: italic;">Jhove</a> will present the metadata in the TIF files in a way that a TIF reader would interpret them. The tools <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> and <i>tiffdump</i> however show which TIF tags are actually explicitely encoded in the TIFFs and what values they have.<br />
<h2>
Wrap-up</h2>
Know your tools!Instead of interpreting default values, these kinds of exceptions need to be cleary marked. Otherwise, the genesis of these results might not be apparent to the average user.<br />
I have learned learned my lesson and will include this question into the <a href="https://github.com/SLUB-digitalpreservation/checkit_tiff/"><i>checkit_tiff</i></a> FAQ.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-44842287160880152542018-02-26T23:41:00.002-08:002018-02-26T23:45:06.931-08:00Valid TIFFs need love, too.<i>(english version below)</i><br />
<br />
<div class="MsoPlainText">
Über einen Kollegen haben wir ein interessantes TIFF erhalten. Es hatte alle Validierungen bestanden und zeigte keine strukturellen Fehler in tiffinfo/tiffdump, ließ sich aber trotzdem im Vorschaubetrachter des Workflowtools nicht anzeigen. Außerdem war es ca. dreimal so groß wie alle anderen Scans aus dem gleichen Vorgang. Er bat uns, das TIFF zu untersuchen.</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Im Gegensatz zu ihm habe ich keine Probleme damit gehabt, das TIFF überhaupt zu öffnen; der Windows-Bildbetrachter, IrfanView, MS Paint, Paint.NET und XnViewMP stellten alle das Bild dar. Allerdings war es in der Horizontalen stark gestreckt, d.h. deutlich breiter als erwartet. Große Teile des Bildinhaltes (eine gescannte Zeitschriftenseite) fehlten, und der rechte Rand war nicht sichtbar. </div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWpAGD9-osG9ldEU3cU-_3Yn7mfkHJvjIQpdISwYZ1mgBBpEayYGzS7KE45tJJPgL0YUfOHpQQj1gxEtsRhBU3tbEFKtzh3YC26T24atPzgQ7BIQdJXeMI3-JOQHcTA40RzeqtIIuGRXK4/s1600/before_211027_00000001.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="372" data-original-width="1600" height="74" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWpAGD9-osG9ldEU3cU-_3Yn7mfkHJvjIQpdISwYZ1mgBBpEayYGzS7KE45tJJPgL0YUfOHpQQj1gxEtsRhBU3tbEFKtzh3YC26T24atPzgQ7BIQdJXeMI3-JOQHcTA40RzeqtIIuGRXK4/s320/before_211027_00000001.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">kaputte Anzeige des TIFFs</td></tr>
</tbody></table>
<br />
<span id="goog_49935563"></span></div>
<div class="MsoPlainText">
In tiffinfo sahen wir, dass das TIFF ein Grayscale-Image ist:<o:p></o:p></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Bits/Sample: 8<o:p></o:p></span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Samples/Pixel: 1</span><o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Auffällig war, dass die Listeneinträge für StripByteCounts genau um Faktor 3 größer als die ImageWidth waren (4302 * 3 = 12906); das erklärte die Streckung des Bildes in X-Richtung. Man sah außerdem, dass die StripOffsets in Schritten von 12906 Bytes anwuchsen; vermutlich war der Viewer deswegen überhaupt in der Lage, irgendein Bild anzuzeigen. Die ImageLength stimmte mit der Anzahl der Einträge in StripByteCount überein (6020), deshalb gab es hier keine Verzerrung.<o:p></o:p></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Image Width: 4302<o:p></o:p></span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Image Length: 6020<o:p></o:p></span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts (279) LONG (4) 6020<12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 ...> StripOffsets (273) LONG (4) 6020<8 12914 25820 38726 ...></span><o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
In Okteta konnten wir sehen, dass die Bilddaten für jedes Pixel dreimal identisch gespeichert waren. Das deckt sich der Aussage des Kollegen, dass das Bild ca. dreimal größer war als alle anderen Scans im gleichen Vorgang. Außerdem haben wir gesehen, dass das IFD0 am Dateiende stand und Hinweise auf Bearbeitungen mit IrfanView enthielt.<o:p></o:p><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJ3f4nC2cTE_Iuo0UPsKiQ37XbIRvRet_TUPyzvCE9fnqcfpNPV5JSkquoqnIDkWWM3A83V7rD602Gy33STnbAjZUjDs-QpIrPKG97Kb671sJoShMd9V-zGOO5T0ZrQ90Q0V3d_6hi3YM_/s1600/RGB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="213" data-original-width="1092" height="62" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJ3f4nC2cTE_Iuo0UPsKiQ37XbIRvRet_TUPyzvCE9fnqcfpNPV5JSkquoqnIDkWWM3A83V7rD602Gy33STnbAjZUjDs-QpIrPKG97Kb671sJoShMd9V-zGOO5T0ZrQ90Q0V3d_6hi3YM_/s320/RGB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">normales RGB-TIFF</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2KfQbWn4BUcZwg70WRz44dcIY-MxwCBdCi9tIajSWABI-ByIYNj9nkbWh4BNhP16o0ey4rfY4_go6nmRJldqX0s05mmRs_grex7VYeL4fb2I6q-JhaCcpsGbExWLERniu6pRuh98OB338/s1600/gray_redundant.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="366" data-original-width="1124" height="104" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2KfQbWn4BUcZwg70WRz44dcIY-MxwCBdCi9tIajSWABI-ByIYNj9nkbWh4BNhP16o0ey4rfY4_go6nmRJldqX0s05mmRs_grex7VYeL4fb2I6q-JhaCcpsGbExWLERniu6pRuh98OB338/s320/gray_redundant.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">defektes TIFF mit zwei Bytes redundanten Grayscale-Daten je Pixel</td></tr>
</tbody></table>
<br /></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Nachdem wir das Problem verstanden hatten, haben wir Reparaturmöglichkeiten diskutiert:<o:p></o:p></div>
<div class="MsoPlainText">
- Man könnte die Redundanz der Pixel entfernen und die StripOffsets (und wahrscheinlich noch andere Offsets) anpassen. Das wäre wahrscheinlich die sauberere Lösung, müsste aber definitiv mit Softwareunterstützung getan werden.<o:p></o:p></div>
<div class="MsoPlainText">
- Man könnte die SamplesPerPixel auf "3" setzen, um die drei duplizierten Bytes je Pixel als RGB-Kanäle zu interpretieren und damit drei Bytes zu einem Pixel im Bild zusammenzufassen. Das haben wir getan, und es hat funktioniert; zumindest war das Bild anzeigbar, nicht gestaucht und nicht in ausgefallene Farben getaucht.<o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Zur Ursache des Fehlers gab es nun zwei Theorien:<o:p></o:p></div>
<div class="MsoPlainText">
- Es könnte einen Bitflip gegeben haben, bei dem SamplesPerPixel beschädigt wurde: der Weg von "<span style="font-family: "courier new" , "courier" , monospace;">00 11</span>"B ("0 3" D) zu "<span style="font-family: "courier new" , "courier" , monospace;">00 01</span>"B ("0 1" D) ist nicht weit und würde das Fehlerbild erklären.<o:p></o:p></div>
<div class="MsoPlainText">
- Es könnte einen Fehler bei der Konvertierung eines RGB-Scans von einer Grayscale-Vorlage gegeben haben, bei dem die überzähligen Bytes pro Pixel nicht entfernt wurden. Das SamplesPerPixel Tag wäre dabei korrekt und absichtlich gesetzt worden.<o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Als erstes haben wir nun also SamplesPerPixel im Hex-Editor auf "3" gesetzt, um den TIFF-Viewer anzuweisen, die Bilddaten als RGB-Bild zu interpretieren. Schon diese kleine Änderung bewirkte, dass sich das Bild fehlerfrei anzeigen ließ. Der Umstand, dass das Bild ungewöhnlich groß war (wir hatten erwartet, dass es ähnlich groß wäre wie die anderen Scans aus der gleichen Zeitschrift), blieb aber vorerst ungeklärt.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzMNUPu_LhBcsQo2gmLQ9QWg9P17arAtamzYNot4AkTuI-89vZw8xqjzVH6yj7oNLU3LOWwP_LVPCh-ZI86DB8GQQ6RWEVO6ehrhhIfCBFEdQy-tluuTI2wovJGaEitd7-t_E_M3JRXpPk/s1600/gray_SamplesPerPixel3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="216" data-original-width="1095" height="63" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzMNUPu_LhBcsQo2gmLQ9QWg9P17arAtamzYNot4AkTuI-89vZw8xqjzVH6yj7oNLU3LOWwP_LVPCh-ZI86DB8GQQ6RWEVO6ehrhhIfCBFEdQy-tluuTI2wovJGaEitd7-t_E_M3JRXpPk/s320/gray_SamplesPerPixel3.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">defektes Grayscale-TIFF, als RGB interpretiert</td></tr>
</tbody></table>
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGYZptudG0x8tmjT4P-yKTaCBY3dgoqBfVWcYMJiPnCFRDDRb-KQxXUX-jlADfPvuu3ZRlwBkzvTobXor8gyd4j7F1AHFnqQbMigz72ihjqNcaPpkWIzuiWzrW-nhnjEabbG7ARzhiL8t_/s1600/after_211027_00000001.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="332" data-original-width="1600" height="66" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGYZptudG0x8tmjT4P-yKTaCBY3dgoqBfVWcYMJiPnCFRDDRb-KQxXUX-jlADfPvuu3ZRlwBkzvTobXor8gyd4j7F1AHFnqQbMigz72ihjqNcaPpkWIzuiWzrW-nhnjEabbG7ARzhiL8t_/s320/after_211027_00000001.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">korrekte Anzeige des TIFFs</td></tr>
</tbody></table>
<br /></div>
<div class="MsoPlainText">
Wir erwägen, eine Plausibilitätsprüfung für diesen Fehlertyp in checkit_tiff zu implementieren, sofern man davon ausgeht, dass innerhalb eines Bildes alle Strips gleich lang sind. Dazu verwendet man die Formel: "<span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts / SamplesPerPixel / RowsPerStrip = ImageWidth</span>". Am einfachsten funktioniert das mit TIFFs, bei denen RowsPerStrip = 1 ist; andernfalls müssen zusätzlich komplexere Prüfungen durchgeführt werden, weil bei mehrzeiligen Strips, deren Bytelänge nicht ohne Rest ganzzahlig durch die Zeilenanzahl teilbar ist, kein Padding angefügt wird. Dadurch können Rows entstehen, die kürzer sind als die vorderen Rows eines Strips. <o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Zusätzlich denkbare Plausibilitätsprüfungen wären:<o:p></o:p></div>
<div class="MsoPlainText">
- Die Höhe des Bildes ist genau so lang wie das Produkt aus RowsPerStrip und Anzahl der Strips: <span style="font-family: "courier new" , "courier" , monospace;">ImageLength = RowsPerStrip * StripOffsets.Count</span><o:p></o:p></div>
<div class="MsoPlainText">
- Jeder StripByteCount muss so groß sein wie die Differenz der dazugehörigen StripByteOffsets: <span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts[0] = StripOffsets[1] - StripOffsets[0]</span> (bzw. allgemeiner <span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts[n] = StripOffsets[n+1] - StripOffsets[n]</span>)<o:p></o:p></div>
<div class="MsoPlainText">
- Jeder Strip muss gleich lang sein: <span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts[0] = StripByteCounts[1] = StripByteCounts[2] = ... = StripByteCounts[n]</span><o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Diese Möglichkeiten haben wir im größeren Kreis diskutiert, was Andreas neugierig gemacht hat. Er hat also sein neues Tool zum Finden möglicher ehemaliger IFDs in TIFFs um einige weiche Suchkritierien erweitert und es genutzt, um IFDs aus früheren Dateiversionen zu finden. Außerdem hat er ein ganz neues Tool geschrieben, das eine TIFF-Datei und eine Adresse in Hex-Notation einliest und den Inhalt an dieser Adresse so interpretiert, als wäre dort ein IFD gespeichert. Auf diese Weise konnten wir insgesamt sechs frühere IFDs ermitteln, die auf ältere Versionen der Datei hinweisen, und den Inhalt dieser IFDs in Augenschein nehmen. Die Tools sind unter <a href="https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/src/archeological_tools">https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/src/archeological_tools</a> im Quellcode verfügbar; sie sind Teil des bekannten Tools fixit_tiff.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-4yRM_VmNVya1lV60qxf5bAL5M-Tft0QRigS-RwYWVHfx4pHMk8YDhhQDDffQKlZp8Ae1CGtKnMgn06WEJ0HupB2vJ7BdVkd5POSfRJ3EF-LA_H61GdauHyqdmYpxymcLopEDAGtwuZ9z/s1600/1st_IFD0_pointer.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="167" data-original-width="700" height="76" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-4yRM_VmNVya1lV60qxf5bAL5M-Tft0QRigS-RwYWVHfx4pHMk8YDhhQDDffQKlZp8Ae1CGtKnMgn06WEJ0HupB2vJ7BdVkd5POSfRJ3EF-LA_H61GdauHyqdmYpxymcLopEDAGtwuZ9z/s320/1st_IFD0_pointer.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">Pointer zum ursprünglichen IFD0, wie er in der ersten Version der Datei stand</td></tr>
</tbody></table>
</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Die Ausgabe möglicher IFD-Adressen sieht so aus:<br />
<span style="font-family: "courier new" , "courier" , monospace;"># adress,weight,is_sorted,has_required_baseline</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a184b0,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a241aa,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a2fea4,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a3bbb0,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a478d0,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a535ea,2,y,y</span><br />
<br />
Diese Adressen der IFDs haben wir mittels Hex-Editor als IFD0-Offset in die TIFF-Datei eingetragen und so in einer Art TIFF-Archäologie schrittweise die alten Versionen der Datei wieder hergestellt. Dabei bestätigte sich die Annahme, dass der Scan ursprünglich in RGB abgespeichert worden war. Danach wurde wohl eine fehlerhafte Grayscale-Konvertierung durchgeführt, bei der nur die Tags PhotometricInterpretation (min-is-black) und BitsPerSample (1) verändert wurden. Ob dabei auch die Bilddaten selbst verändert wurden, lässt sich nicht mehr genau rekonstruieren.</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
In der vermutlich ersten Version des IFD0 sieht man mit tiffinfo noch die Angaben zum RGB-Bild:</div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Photometric Interpretation: RGB color</span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Samples/Pixel: 3</span></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Die späteren Fassungen dagegen enthalten die Werte:</div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Photometric Interpretation: min-is-black</span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Samples/Pixel: 1</span></div>
<div class="MsoPlainText">
<br />
Außerdem wurden noch einige weitere Versionen des TIFFs erzeugt, bei denen einige andere Tags verändert, hinzugefügt oder entfernt wurden (Make, Model und Software).<br />
<br /></div>
<div class="MsoPlainText">
Der Fehler war überhaupt nur aufgefallen, weil es eine intellektuelle Prüfung gab und der Bearbeiterin der Anzeigefehler auffiel (und sie ihn dann auch gemeldet hat!). Weil außerdem die MD5-Summen erst am Ende der Bearbeitung generiert werden und damit zum Fehlerzeitpunkt noch keine Prüfsumme existierte, wäre der Fehler nicht durch einen Fixity-Mismatch aufgefallen. Die einzig saubere Lösung wird nun wohl sein, die Seite neu zu scannen. Trotzdem ist es aber sehr eindrucksvoll zu sehen, welche Möglichkeiten das TIF Format bietet, kaputte Dateien wiederherzustellen.<br />
<br />
frühere Artikel zu diesem Thema (also available in English):<br />
<br />
<ul>
<li><a href="https://kulturreste.blogspot.de/2018/02/restaurierung-von-kaputten-tiff-dateien.html">Restaurierung von kaputten TIFF-Dateien</a></li>
<li><a href="https://kulturreste.blogspot.de/2017/01/repairing-tiff-images-preliminary-report.html">repairing TIFF images - a preliminary report</a></li>
<li><a href="https://kulturreste.blogspot.de/2016/11/some-thoughts-about-risks-in-tiff-file.html">Some thoughts about risks in TIFF file format</a></li>
<li><a href="https://kulturreste.blogspot.de/2016/08/image-file-directories-reparieren.html">Image File Directories reparieren</a></li>
</ul>
<br />
<br />
-------------------------------------------------------------------------------------------------------------------<br />
<br />
<i>english version</i><br />
<br />
<div class="MsoPlainText">
A few days ago, a colleague gave us an interesting TIFF. It had successfully completed all validation attempts and didn't show any signs of structural issues in tiffinfo/tiffdump. However, it was not possible to display the image in the preview of the workflow tool used. Also, it was about three times the size of the other scans in the same intellectual entity. Our colleague asked us to have a closer look at that TIFF, so we went at it.</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
In contrast to our colleague, I didn't have any problem in displaying the TIFF altogether; the Windows Image Viewer, IrfanView, MS Paint, Paint.NET und XnViewMP all displayed the image correctly. However, it was significantly stretched horizontally, which means that it was a lot wider than expected. Large parts of the scanned newspaper page were missing, and the rightmost part of the image was not visible.</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWpAGD9-osG9ldEU3cU-_3Yn7mfkHJvjIQpdISwYZ1mgBBpEayYGzS7KE45tJJPgL0YUfOHpQQj1gxEtsRhBU3tbEFKtzh3YC26T24atPzgQ7BIQdJXeMI3-JOQHcTA40RzeqtIIuGRXK4/s1600/before_211027_00000001.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="372" data-original-width="1600" height="74" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWpAGD9-osG9ldEU3cU-_3Yn7mfkHJvjIQpdISwYZ1mgBBpEayYGzS7KE45tJJPgL0YUfOHpQQj1gxEtsRhBU3tbEFKtzh3YC26T24atPzgQ7BIQdJXeMI3-JOQHcTA40RzeqtIIuGRXK4/s320/before_211027_00000001.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">broken display of the TIFF</td></tr>
</tbody></table>
<br />
<span id="goog_49935563"></span></div>
<div class="MsoPlainText">
In tiffinfo, we saw that the TIFF is a grayscale image:<o:p></o:p></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Bits/Sample: 8<o:p></o:p></span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Samples/Pixel: 1</span><o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Particularly striking was the fact that the list entries for StripByteCounts was exactly by faktor 3 larger than the ImageWidth (4302 * 3 = 12906), which explained the stretch we saw in the image. Also, you could see that the StripOffsets grew in steps of 12906 Bytes; presumeably that's why the viewer was able to display a picture in the first place, regardless of the final quality. The ImageLength matched up with the number of entries in StripByteCount (6020), which is why there was no stretch in vertical direction.<br />
<o:p></o:p></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Image Width: 4302<o:p></o:p></span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Image Length: 6020<o:p></o:p></span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts (279) LONG (4) 6020<12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 ...> StripOffsets (273) LONG (4) 6020<8 12914 25820 38726 ...></span><o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
We could see in Okteta that the image data for each pixel were saved identically three times in a row. That explains our colleagues information about the filesize being three times larger than the other files in that IE. Also, we noticed that the IFD0 was written to the end of the file and contained information about an editing step in IrfanView.<o:p></o:p><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJ3f4nC2cTE_Iuo0UPsKiQ37XbIRvRet_TUPyzvCE9fnqcfpNPV5JSkquoqnIDkWWM3A83V7rD602Gy33STnbAjZUjDs-QpIrPKG97Kb671sJoShMd9V-zGOO5T0ZrQ90Q0V3d_6hi3YM_/s1600/RGB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="213" data-original-width="1092" height="62" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJ3f4nC2cTE_Iuo0UPsKiQ37XbIRvRet_TUPyzvCE9fnqcfpNPV5JSkquoqnIDkWWM3A83V7rD602Gy33STnbAjZUjDs-QpIrPKG97Kb671sJoShMd9V-zGOO5T0ZrQ90Q0V3d_6hi3YM_/s320/RGB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">normal RGB-TIFF<br />
<br />
<div class="MsoPlainText" style="font-size: medium; text-align: start;">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2KfQbWn4BUcZwg70WRz44dcIY-MxwCBdCi9tIajSWABI-ByIYNj9nkbWh4BNhP16o0ey4rfY4_go6nmRJldqX0s05mmRs_grex7VYeL4fb2I6q-JhaCcpsGbExWLERniu6pRuh98OB338/s1600/gray_redundant.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="366" data-original-width="1124" height="104" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2KfQbWn4BUcZwg70WRz44dcIY-MxwCBdCi9tIajSWABI-ByIYNj9nkbWh4BNhP16o0ey4rfY4_go6nmRJldqX0s05mmRs_grex7VYeL4fb2I6q-JhaCcpsGbExWLERniu6pRuh98OB338/s320/gray_redundant.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">defective TIFF with two Bytes of redundant grayscale data per pixel</td></tr>
</tbody></table>
<br /></div>
</td></tr>
</tbody></table>
</div>
<div class="MsoPlainText">
After having understood the problem, we discussed possible ways to repair the file:<o:p></o:p></div>
<div class="MsoPlainText">
- We could remove the redundant pixels and adapt the StripOffsets (and quite possibly all other ofsets in that file). While this is the more proper solution, software support for this kind of work would be imperative.<o:p></o:p></div>
<div class="MsoPlainText">
- We could set SamplesPerPixel to"3" to interpret the three duplicate pixels each as three RGB channels, thus summarizing three Bytes into one pixel. We actually did that, and it worked like a charm; at least we could display the image without getting any stretching or funky colors.<o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Now we had two theories about the origin of this error:<o:p></o:p></div>
<div class="MsoPlainText">
- There might have been a bit flip that damaged SamplesPerPixel. It's not a long way to go from "<span style="font-family: "courier new" , "courier" , monospace;">00 11</span>"B ("0 3" D) to "<span style="font-family: "courier new" , "courier" , monospace;">00 01</span>"B ("0 1" D), and it would explain the error we're seing.<o:p></o:p></div>
<div class="MsoPlainText">
- There could have been an error during a conversion of an RGB scan that was made from an analog grayscale template, during which the unnecessary pixels have not been removed. During this conversion, the SamplesPerPixel tag would have been rightfully set to a new value.<o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
In a first test we set SamplesPerPixel to "3" using a Hex editor in order to command the TIFF viewer to interpret the image data in an RGB fashion. This little change alone caused the image to be displayed without any errors. The puzzle, however, that the image was uncommonly large (we expected it to about ad big as the other scans from the same newspaper) remained unsolved.<br />
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzMNUPu_LhBcsQo2gmLQ9QWg9P17arAtamzYNot4AkTuI-89vZw8xqjzVH6yj7oNLU3LOWwP_LVPCh-ZI86DB8GQQ6RWEVO6ehrhhIfCBFEdQy-tluuTI2wovJGaEitd7-t_E_M3JRXpPk/s1600/gray_SamplesPerPixel3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="216" data-original-width="1095" height="63" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzMNUPu_LhBcsQo2gmLQ9QWg9P17arAtamzYNot4AkTuI-89vZw8xqjzVH6yj7oNLU3LOWwP_LVPCh-ZI86DB8GQQ6RWEVO6ehrhhIfCBFEdQy-tluuTI2wovJGaEitd7-t_E_M3JRXpPk/s320/gray_SamplesPerPixel3.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">defective grayscale TIFF, interpreted as RGB</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGYZptudG0x8tmjT4P-yKTaCBY3dgoqBfVWcYMJiPnCFRDDRb-KQxXUX-jlADfPvuu3ZRlwBkzvTobXor8gyd4j7F1AHFnqQbMigz72ihjqNcaPpkWIzuiWzrW-nhnjEabbG7ARzhiL8t_/s1600/after_211027_00000001.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="332" data-original-width="1600" height="66" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGYZptudG0x8tmjT4P-yKTaCBY3dgoqBfVWcYMJiPnCFRDDRb-KQxXUX-jlADfPvuu3ZRlwBkzvTobXor8gyd4j7F1AHFnqQbMigz72ihjqNcaPpkWIzuiWzrW-nhnjEabbG7ARzhiL8t_/s320/after_211027_00000001.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">TIFF displayed <span style="font-size: 12.8px;">correctly</span></td></tr>
</tbody></table>
<br /></div>
<div class="MsoPlainText">
We contemplated implementing plausibility checks for this type of error in checkit_tiff, which would be easily feasible assuming that all Strips in an image are of the same length. The following formula could be used: "<span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts / SamplesPerPixel / RowsPerStrip = ImageWidth</span>". This works best for TIFFs with RowsPerStrip = 1 set; other TIFFs would have to undergo more complex checks, because multiline Strips with byte counts that cannot be divided by the row number without modulo may not contain any padding. Due to this, there may be Rows that are shorter that the previous Rows in the same Strip. <o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Other possible plausibility checks include:<o:p></o:p></div>
<div class="MsoPlainText">
- The image height is exactly as large as the multiplication product of RowsPerStrip and number of Strips: <span style="font-family: "courier new" , "courier" , monospace;">ImageLength = RowsPerStrip * StripOffsets.Count</span><o:p></o:p></div>
<div class="MsoPlainText">
- Each StripByteCount must be equally large as the difference of the neighboring StripByteOffsets: <span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts[0] = StripOffsets[1] - StripOffsets[0]</span> (or more general <span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts[n] = StripOffsets[n+1] - StripOffsets[n]</span>)<o:p></o:p></div>
<div class="MsoPlainText">
- Each Strip needs to be equally long: <span style="font-family: "courier new" , "courier" , monospace;">StripByteCounts[0] = StripByteCounts[1] = StripByteCounts[2] = ... = StripByteCounts[n]</span><o:p></o:p></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
We discussed these possibilities in a larger group, which made Andreas curious, so he sat down to enhance his tool for finding candidates for former IFDs in TIFFs by some soft search criteria. Furthermore, he created an entirely new tool reads a TIFF and interprets the contents at a given address in a way that ressembles the IFD structure. This way, we were able to identify six former IFDs that hint to older versions of this file and inspect these IFDs a little further. The tools are available at <a href="https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/src/archeological_tools">https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/src/archeological_tools</a> in source code, they are part of the established tool fixit_tiff.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-4yRM_VmNVya1lV60qxf5bAL5M-Tft0QRigS-RwYWVHfx4pHMk8YDhhQDDffQKlZp8Ae1CGtKnMgn06WEJ0HupB2vJ7BdVkd5POSfRJ3EF-LA_H61GdauHyqdmYpxymcLopEDAGtwuZ9z/s1600/1st_IFD0_pointer.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="167" data-original-width="700" height="76" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-4yRM_VmNVya1lV60qxf5bAL5M-Tft0QRigS-RwYWVHfx4pHMk8YDhhQDDffQKlZp8Ae1CGtKnMgn06WEJ0HupB2vJ7BdVkd5POSfRJ3EF-LA_H61GdauHyqdmYpxymcLopEDAGtwuZ9z/s320/1st_IFD0_pointer.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">Pointer to the original IFD0, just like it was stored in the 1st file version</td></tr>
</tbody></table>
</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
The list of possible IFD addresses as given by our tools looks like this:<br />
<span style="font-family: "courier new" , "courier" , monospace;"># adress,weight,is_sorted,has_required_baseline</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a184b0,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a241aa,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a2fea4,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a3bbb0,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a478d0,2,y,y</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">0x4a535ea,2,y,y</span><br />
<br />
We inserted these IFD addresses into the file's IFD0 offset pointer using a Hex Editor. Step by step, using this method, we were able to recreate older versions of the file in an archaeology style of work. In the course of the work we could confirm that the scan was originally saved in RGB. Later, there must have been an error in a grayscale conversion where only the tags PhotometricInterpretation (min-is-black) and BitsPerSample (1) were changed. We were not able to find out if the image data had been altered as well.</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Ttiffinfo shows these information from the preusmeable 1st IFD0 version of the RGB image:</div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Photometric Interpretation: RGB color</span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Samples/Pixel: 3</span></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Later versions, however, contain the values:</div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Photometric Interpretation: min-is-black</span></div>
<div class="MsoPlainText">
<span style="font-family: "courier new" , "courier" , monospace;">Samples/Pixel: 1</span></div>
<div class="MsoPlainText">
<br />
Also, there have been later files versions where some other tags have been added, altered or deleted (Make, Model and Software).<br />
<br /></div>
<div class="MsoPlainText">
The error was only even discovered because intellectual checks were in place and the human operator noticed the error in displaying the TIFF (and because she decided to inform our colleague of this oddity!). Also, because checksums are only generated after the processing workflow is completed, we wouldn't have noticed the error by a fixity mismatch. We simply didn't have any checksums yet to compare the image against. In the end, the only proper solution will be a rescan of that newspaper page. However, it's still impressive to see the possibilities that TIF offers to repair seemingly broken images.<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
former articles on this subject (also available in English):<br />
<br />
<ul>
<li><a href="https://kulturreste.blogspot.de/2018/02/restaurierung-von-kaputten-tiff-dateien.html">Restaurierung von kaputten TIFF-Dateien</a></li>
<li><a href="https://kulturreste.blogspot.de/2017/01/repairing-tiff-images-preliminary-report.html">repairing TIFF images - a preliminary report</a></li>
<li><a href="https://kulturreste.blogspot.de/2016/11/some-thoughts-about-risks-in-tiff-file.html">Some thoughts about risks in TIFF file format</a></li>
<li><a href="https://kulturreste.blogspot.de/2016/08/image-file-directories-reparieren.html">Image File Directories reparieren</a></li>
</ul>
</div>
</div>
Jörg Sachsehttp://www.blogger.com/profile/17097541683565972324noreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-65497376330981615722018-02-02T04:17:00.003-08:002018-02-02T05:56:49.015-08:00Restaurierung von kaputten TIFF-Dateien(English version below)<br />
<h2>
Kaputtes TIFF, erste Analyse </h2>
<br />
Ein Kollege schickte uns dieser Tage eine TIFF-Datei, die sich nicht öffnen liess. <b><a href="https://www.imagemagick.org/script/index.php">ImageMagick</a></b> meldete:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">display-im6.q16: Can not read TIFF directory count. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/564.<br />display-im6.q16: Failed to read directory at offset 27934990. `TIFFReadDirectory' @ error/tiff.c/TIFFErrors/564.</span></blockquote>
<br />
Das Tool <b><span style="font-family: inherit;">tiffinfo</span></b> gab diese Fehlermeldung zurück:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">TIFFFetchDirectory: Can not read TIFF directory count.<br />TIFFReadDirectory: Failed to read directory at offset 27934990.</span></blockquote>
<br />
Ein Blick mit dem Hexeditor <b><a href="https://www.kde.org/applications/utilities/okteta/">Okteta</a></b> und aktiviertem TIFF-Profil (welches im Übrigen unter <a href="https://github.com/art1pirat/okteta_tiff">https://github.com/art1pirat/okteta_tiff</a> zu finden ist) zeigt, dass das der Offset-Zeiger, der auf das erste ImageFileDirectory (IFD) verweisen sollte, eine Adresse außerhalb der Datei enthält:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWZtvflrc2WRUtYcOWyRnW93t3YR6MiPVUAmwtTmO8ivqRqfDWOu3RynforiFi5szBWsD-tCt37Suh9kgUi-l5ioQsRTd2xXgczEaA3xjqkfnaIBaRjAtW-RbbnxfMS93JF7cmzDOuxe_A/s1600/okteta_kaputterZeigeraufIFD.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="406" data-original-width="1492" height="108" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWZtvflrc2WRUtYcOWyRnW93t3YR6MiPVUAmwtTmO8ivqRqfDWOu3RynforiFi5szBWsD-tCt37Suh9kgUi-l5ioQsRTd2xXgczEaA3xjqkfnaIBaRjAtW-RbbnxfMS93JF7cmzDOuxe_A/s400/okteta_kaputterZeigeraufIFD.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Screenshot Okteta, TIFF mit defektem Verweis auf erstes IFD</td><td class="tr-caption" style="text-align: center;"><br /></td></tr>
</tbody></table>
Faktisch ist das TIFF damit kaputt. Doch bestimmte Eigenschaften dieses Dateiformates erlauben es, eine Restaurierung zu versuchen.<br />
<br />
<h3>
Nebeneinschub </h3>
Für eine gut lesbare Einführung in den Aufbau von TIFF-Dateien sei auf den Blogeintrag "<a href="http://art1pirat.blogspot.de/2013/07/baseline-tiff.html">baseline TIFF</a>" verwiesen. In "<a href="http://art1pirat.blogspot.de/2013/08/baseline-tiff-versuch-einer.html">baseline TIFF - Versuch einer Rekonstruktion</a>" wird auf einige manuelle Plausibilitätsprüfungen eingegangen.<br />
<br />
Einen kurzen Überblick liefert auch "nestor Thema: Das Dateiformat TIFF" (zu finden auf <a href="http://www.langzeitarchivierung.de/Subsites/nestor/DE/Publikationen/Thema/thema.html">http://www.langzeitarchivierung.de/Subsites/nestor/DE/Publikationen/Thema/thema.html</a>)<br />
<br />
<h2>
Finden von IFDs</h2>
<br />
TIFF bringt ein paar Eigenschaften mit, die den Versuch einer Restaurierung erleichtern. So müssen laut Spezifikation Offsets immer auf gerade Adressen verweisen. Damit halbiert sich schon einmal der Suchraum.<br />
<br />
Desweiteren können wir annehmen, dass ein IFD mindestens 4 Tags (oft deutlich mehr) enthält, in der Regel Subfiletype (<i>0x00fe</i>), ImageWidth (<i>0x0100</i>), ImageLength (<i>0x0101</i>) und BitsPerSample (<i>0x0102</i>).<br />
<br />
Da ein IFD nach den Tags als letzten Eintrag ein NextIFD Feld enthält, welches entweder auf 0 gesetzt ist oder auf ein weiteres IFD verweist, haben wir bereits einiges an wertvollen Hinweisen zusammen.<br />
<br />
Auch die Tageinträge innerhalb des IFD selber folgen einer Struktur. Jeder Eintrag besteht aus 2 Bytes TagId, 2 Bytes FieldType, sowie 4 Bytes Count und 4 Bytes ValueOrOffset (sh. <a href="http://3.bp.blogspot.com/-wKhvsq4CFHE/UfkFxyy9owI/AAAAAAAAAeI/FuH02zP7vUI/s1600/tiff_vortrag__3.png">Tag-Aufbau, Artikel "baseline TIFF" auf http://art1pirat.blogspot.de</a>).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
In der TIFF-Spezifikation sind für FieldType 12 mögliche Werte definiert, die <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://en.wikipedia.org/wiki/Libtiff">libtiff</a></span> kennt 18 Werte. Wir können also für jedes angenommene Tag prüfen, ob die Werte im Bereich 1-18 liegen.<br />
<br />
Neben diesen harten Kriterien könnten wir, falls die Notwendigkeit besteht, noch weitere hinzuziehen, zum Beispiel:<br />
<br />
<ul>
<li>Prüfe, ob bestimmte Pflicht-Tags vorhanden sind</li>
<li>Prüfe, ob alle Tags, wie von der Spezifikation gefordert, aufsteigend sortiert sind und keine Dubletten enthalten</li>
<li>Prüfe, ob ValueOrOffset ein Offset sein könnte und damit auf eine gerade Adresse verweist</li>
</ul>
<br />
Sicherlich ließen sich noch weitere Kriterien finden, doch in der Praxis zeigt sich, dass die og. harten Kriterien in der Regel schon ausreichen.<br />
<br />
Um die Suche nach diesen nicht händisch vornehmen zu müssen, besitzt das Tool <b><span style="font-family: inherit;"><a href="https://github.com/SLUB-digitalpreservation/fixit_tiff">fixit_tiff</a></span></b> seit kurzem das Programm "<b><span style="font-family: inherit;">find_potential_IFD_offsets</span></b>".<br />
<br />
Wenn man es mit:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> ./find_potential_IFD_offsets test.tiff test.out.txt</span></blockquote>
<br />
aufruft, spuckt es in der Datei "<span style="font-family: "courier new" , "courier" , monospace;">test.out.txt</span>" eine Liste von Adressen aus, die potentiell ein IFD sein könnten. Für unsere Datei lieferte es den Wert "<span style="font-family: "courier new" , "courier" , monospace;">0x0008</span>", sprich: das IFD müsste an Adresse 8 anfangen.<br />
<br />
Mit <b>Okteta</b> die Datei geladen und geändert, voila!, es sieht gut aus:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyZD00bXSVwM21ZCbrZeeFybEDrgXDWx0JbzvVqjMzR00q-2y8pNLavfsBGEqqWZ7rZKKa8mgggldCNyNI6l9G4xqa1xups_hmxIitBgRHMKE0hKrwJv4YdvonB54rOZ6f7aJ1n0z6tLQE/s1600/okteta_repariert.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="481" data-original-width="1473" height="130" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyZD00bXSVwM21ZCbrZeeFybEDrgXDWx0JbzvVqjMzR00q-2y8pNLavfsBGEqqWZ7rZKKa8mgggldCNyNI6l9G4xqa1xups_hmxIitBgRHMKE0hKrwJv4YdvonB54rOZ6f7aJ1n0z6tLQE/s400/okteta_repariert.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Screenshot Okteta, TIFF mit repariertem Verweis auf erstes IFD</td></tr>
</tbody></table>
<br />
<br />
<br />
<br />
<br />
Auch <b>tiffinfo</b> ist jetzt etwas glücklicher:<br />
<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">TIFFReadDirectory: Warning, Bogus "StripByteCounts" field, ignoring and calculating from imagelength.<br />TIFF Directory at offset 0x8 (8)<br /> Subfile Type: (0 = 0x0)<br /> Image Width: 4506 Image Length: 6101<br /> Resolution: 300, 300 pixels/inch<br /> Bits/Sample: 8<br /> Compression Scheme: None<br /> Photometric Interpretation: min-is-black<br /> FillOrder: msb-to-lsb<br /> Orientation: row 0 top, col 0 lhs<br /> Samples/Pixel: 1<br /> Rows/Strip: 6101<br /> Planar Configuration: single image plane<br /> Color Map: (present)<br /> Software: Quantum Process V 1.04.73</span></blockquote>
<br />
<br />
Und <b>ImageMagick</b> zeigt sich nun gnädiger:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCiZghpD6cGDTnS-XzXjSfypeProR2EodqBN0QRqqf4QX7Hk8z3HRbUDy6QJ4RmySY2NEiw7S7AVd0AH0TFySOLIlPl1j0tnhbN3SQmlQgyKgrrPIIgYVdAlemfAyjElWBBXUCaSDomKWt/s1600/repaired.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1524" data-original-width="1126" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCiZghpD6cGDTnS-XzXjSfypeProR2EodqBN0QRqqf4QX7Hk8z3HRbUDy6QJ4RmySY2NEiw7S7AVd0AH0TFySOLIlPl1j0tnhbN3SQmlQgyKgrrPIIgYVdAlemfAyjElWBBXUCaSDomKWt/s400/repaired.png" width="295" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Ansicht des TIFFs mit repariertem Offset auf IFD</td></tr>
</tbody></table>
<br />
<br />
<br />
<br />
<br />
Wie man sieht, ist noch nicht alles repariert, schliesslich meldet auch <b>ImageMagick</b> noch Probleme:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">display-im6.q16: Bogus "StripByteCounts" field, ignoring and calculating from imagelength. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/912.<br />display-im6.q16: Read error on strip 4075; got 2706 bytes, expected 4506. `TIFFFillStrip' @ error/tiff.c/TIFFErrors/564.</span></blockquote>
<br />
Doch sollte vorliegend gezeigt werden, dass eine Restaurierung von kaputten TIFF-Dateien durchaus möglich ist.<br />
<br />
---------------------------------------------------------------------<br />
<br />
<h2>
Broken TIFF, a first analysis</h2>
<br />
A colleague recently sent us a TIFF file that he couldn't open. <b><a href="https://www.imagemagick.org/script/index.php">ImageMagick</a></b> reported:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">display-im6.q16: Can not read TIFF directory count. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/564.<br />display-im6.q16: Failed to read directory at offset 27934990. `TIFFReadDirectory' @ error/tiff.c/TIFFErrors/564.</span></blockquote>
<br />
The tool <b>tiffinfo</b> returned the following error:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">TIFFFetchDirectory: Can not read TIFF directory count.<br />TIFFReadDirectory: Failed to read directory at offset 27934990.</span></blockquote>
<br />
A quick investigation in the Hex editor <b><a href="https://www.kde.org/applications/utilities/okteta/">Okteta</a></b> with the TIFF profile activated (to be found at <a href="https://github.com/art1pirat/okteta_tiff">https://github.com/art1pirat/okteta_tiff</a>) revealed that the offset pointer, which should be pointing to the first ImageFileDirectory (IFD), points to an address that is beyond the end of the file:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWZtvflrc2WRUtYcOWyRnW93t3YR6MiPVUAmwtTmO8ivqRqfDWOu3RynforiFi5szBWsD-tCt37Suh9kgUi-l5ioQsRTd2xXgczEaA3xjqkfnaIBaRjAtW-RbbnxfMS93JF7cmzDOuxe_A/s1600/okteta_kaputterZeigeraufIFD.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="406" data-original-width="1492" height="108" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWZtvflrc2WRUtYcOWyRnW93t3YR6MiPVUAmwtTmO8ivqRqfDWOu3RynforiFi5szBWsD-tCt37Suh9kgUi-l5ioQsRTd2xXgczEaA3xjqkfnaIBaRjAtW-RbbnxfMS93JF7cmzDOuxe_A/s400/okteta_kaputterZeigeraufIFD.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">screenshot Okteta, TIFF with defective pointer to the 1st IFD</td><td class="tr-caption" style="font-size: 12.8px;"><br /></td></tr>
</tbody></table>
Given that, the TIFF is de facto broken. However, we can leverage certain properties of this file format to try a restoration.<br />
<br />
<h3>
Side note</h3>
For a well-readable introduction into the structure of TIFF files, pleases refer to the blog post "<a href="http://art1pirat.blogspot.de/2013/07/baseline-tiff.html">baseline TIFF</a>". The article "<a href="http://art1pirat.blogspot.de/2013/08/baseline-tiff-versuch-einer.html">baseline TIFF - Versuch einer Rekonstruktion</a>" describes some manual plausibility checks.<br />
<br />
Another short overview is provided by "nestor Thema: Das Dateiformat TIFF" (to be found at <a href="http://www.langzeitarchivierung.de/Subsites/nestor/DE/Publikationen/Thema/thema.html">http://www.langzeitarchivierung.de/Subsites/nestor/DE/Publikationen/Thema/thema.html</a>)<br />
<br />
<h2>
Finding IFDs</h2>
<br />
TIFF comes with a few properties that facilitate restoration attempts. According to the specification, offsets must point to even addresses, which already cuts the search space in half.<br />
<br />
Also, we can assume that an IFD contains at least four tags (often significantly more), usually Subfiletype (<i>0x00fe</i>), ImageWidth (<i>0x0100</i>), ImageLength (<i>0x0101</i>) and BitsPerSample (<i>0x0102</i>).<br />
<br />
As an IFD's last entry after all the tags is a pointer to the NextIFD, which is either set to 0 or points to another IFD, we already have some useful hints to work with.<br />
<br />
The tag entries inside of the IFD follow a strict structure as well. Each entry consists of 2 Bytes TagId, 2 Bytes FieldType, 4 Bytes Count and 4 Bytes ValueOrOffset (also see <a href="http://3.bp.blogspot.com/-wKhvsq4CFHE/UfkFxyy9owI/AAAAAAAAAeI/FuH02zP7vUI/s1600/tiff_vortrag__3.png">Tag-Aufbau, Artikel "baseline TIFF" auf http://art1pirat.blogspot.de</a>).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
The TIFF specification defines 12 possible values for the FieldType, <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://en.wikipedia.org/wiki/Libtiff">libtiff</a></span> knows 18 values. Following that, we can check for each chunk of Bytes that might be a tag if the value is between 1 and 18.<br />
<br />
Additionally, we could add some soft criteria to these hard criteria that we already have:<br />
<br />
<ul>
<li>check if certain mandatory tags can be found</li>
<li>check if all tags are sorted in an ascending order and don't contain any duplicates as required by the specification</li>
<li>check is ValueOrOffset can be an actual offset by checking if it points to an even offset</li>
</ul>
<br />
We could think up even more criteria, but practical experience shows that the hard criteria are already sufficient for most of the cases.<br />
<br />
In order to avoid having to search for potential IFDs in the files manually, the tool <b><span style="font-family: inherit;"><a href="https://github.com/SLUB-digitalpreservation/fixit_tiff">fixit_tiff</a></span></b> now comes with the program "<b>find_potential_IFD_offsets</b>".<br />
<br />
If it is invoked like:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">$> ./find_potential_IFD_offsets test.tiff test.out.txt</span></blockquote>
<br />
it will spew out a list of addresses to the file "<span style="font-family: "courier new" , "courier" , monospace;">test.out.txt</span>" that might potentially mark the beginning of an IFD. For the file from our colleague, it gave us only one value, which was "<span style="font-family: "courier new" , "courier" , monospace;">0x0008</span>". In other words, the IFD should start at address 8.<br />
<br />
Now load up the file in <b>Okteta</b> change the pointer to the first IFD right after the TIFF header to the correct address, et voila!, it looks good:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyZD00bXSVwM21ZCbrZeeFybEDrgXDWx0JbzvVqjMzR00q-2y8pNLavfsBGEqqWZ7rZKKa8mgggldCNyNI6l9G4xqa1xups_hmxIitBgRHMKE0hKrwJv4YdvonB54rOZ6f7aJ1n0z6tLQE/s1600/okteta_repariert.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="481" data-original-width="1473" height="130" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyZD00bXSVwM21ZCbrZeeFybEDrgXDWx0JbzvVqjMzR00q-2y8pNLavfsBGEqqWZ7rZKKa8mgggldCNyNI6l9G4xqa1xups_hmxIitBgRHMKE0hKrwJv4YdvonB54rOZ6f7aJ1n0z6tLQE/s400/okteta_repariert.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">screenshot Okteta, TIFF with repaired pointer to 1st IFD</td></tr>
</tbody></table>
<br />
<br />
<br />
<br />
<br />
<b>tiffinfo</b> is now a little happier as well:<br />
<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">TIFFReadDirectory: Warning, Bogus "StripByteCounts" field, ignoring and calculating from imagelength.<br />TIFF Directory at offset 0x8 (8)<br /> Subfile Type: (0 = 0x0)<br /> Image Width: 4506 Image Length: 6101<br /> Resolution: 300, 300 pixels/inch<br /> Bits/Sample: 8<br /> Compression Scheme: None<br /> Photometric Interpretation: min-is-black<br /> FillOrder: msb-to-lsb<br /> Orientation: row 0 top, col 0 lhs<br /> Samples/Pixel: 1<br /> Rows/Strip: 6101<br /> Planar Configuration: single image plane<br /> Color Map: (present)<br /> Software: Quantum Process V 1.04.73</span></blockquote>
<br />
<br />
And even <b>ImageMagick</b> is now a little more gracious:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCiZghpD6cGDTnS-XzXjSfypeProR2EodqBN0QRqqf4QX7Hk8z3HRbUDy6QJ4RmySY2NEiw7S7AVd0AH0TFySOLIlPl1j0tnhbN3SQmlQgyKgrrPIIgYVdAlemfAyjElWBBXUCaSDomKWt/s1600/repaired.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1524" data-original-width="1126" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCiZghpD6cGDTnS-XzXjSfypeProR2EodqBN0QRqqf4QX7Hk8z3HRbUDy6QJ4RmySY2NEiw7S7AVd0AH0TFySOLIlPl1j0tnhbN3SQmlQgyKgrrPIIgYVdAlemfAyjElWBBXUCaSDomKWt/s400/repaired.png" width="295" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">Ansicht des TIFFs mit repariertem Offset auf IFD</td></tr>
</tbody></table>
<br />
<br />
<br />
<br />
<br />
As you can see, not everything has been repaired yet, and <b>ImageMagick</b> is still reporting some problems:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">display-im6.q16: Bogus "StripByteCounts" field, ignoring and calculating from imagelength. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/912.<br />display-im6.q16: Read error on strip 4075; got 2706 bytes, expected 4506. `TIFFFillStrip' @ error/tiff.c/TIFFErrors/564.</span></blockquote>
<br />
However, we were able to show that a restoration of broken TIFFs is indeed feasible, and even though some of the data is lost, we still can see a part of what has been a magazine scan.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-62022192870437680032017-09-06T08:10:00.001-07:002017-09-06T08:10:51.344-07:00Hinweis auf interessantes Interview zu FFV1Ein äußerst interessantes Interview von Jürgen Keiper mit Peter Bubestinger zur Entstehung und Motivation von Matroska/FFV1 als langzeitarchivfähiges Datenformat für audiovisuelle Medien.<br />
<br />
Es ist besonders interessant für Archivare, die wissen wollen, warum FFV1/Matroska ihre Probleme lösen kann. Peter schafft es Sachverhalte einfach und anschaulich zu erklären und kommt (fast) ohne technisches Vokabular aus.<br />
<br />
Prädikat: Sehenswert!<br />
<br />
Hier der Link zum Video:<br />
<br />
<a href="https://www.memento-movie.de/2017/08/die-geschichte-eines-codecs-ffv1-in-der-archivwelt/">https://www.memento-movie.de/2017/08/die-geschichte-eines-codecs-ffv1-in-der-archivwelt/</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-28468586849447652112017-05-30T13:41:00.002-07:002017-05-30T13:41:23.643-07:00Bibtag - und 'ne Kleinigkeit gelerntHeute hatte ich einen Abstecher zum<a href="http://bibliothekartag2017.de/"> Bibliothekartag 2017</a> nach Frankfurt am Main gemacht. Zum einen, um etliche Ex-Kommilitonen zu treffen, zum anderen war ich am <a href="http://www.professionalabstracts.com/api/iplanner/?conf=dbt2017&model=sessions&method=get&params[sids]=119&params[pids]=241&params[format]=pdf">Workshop</a> von <a href="http://www.zbw.eu/de/ueber-uns/arbeitsschwerpunkte/langzeitarchivierung/yvonne-friese/">Yvonne Tunnat von der ZBW</a> zur Formatidentifikation interessiert.<br />
<br />
Yvonne hat eine wunderbare, pragmatische Art komplizierte Sachverhalte zu erklären. Wer sie kennenlernen möchte, der <a href="http://www.langzeitarchivierung.de/Subsites/nestor/DE/Veranstaltungen/TermineNestor/praktikertag2017.html">nestor-Praktikertag 2017</a> zur Formatvalidierung hat noch Plätze frei. <br />
<br />
Zwei Dinge, die ich mitnehme. Zum einen kannte ich das Werkzeug <a href="http://eternal-todo.com/tools/peepdf-pdf-analysis-tool"><i>peepdf</i></a> noch nicht. Es handelt sich um ein CLI-Programm um eine PDF-Datei zu sezieren und kommt ursprünglich aus der Forensik-Ecke.<br />
<br />
Zum anderen gibt es mit <a href="http://coptr.digipres.org/Bad_Peggy">Bad Peggy</a> ein Validierungstool um JPEGs zu analysieren.<br />
<br />
Eine Diskussion, die immer wieder auftaucht ist die, wie man mit unbekannten Dateiformaten umgeht. IMHO sind diese nicht archivfähig, und wie Binärmüll zu betrachten. Dazu bedarf es aber mal eines längeren Beitrags und einer genaueren Analyse, ob und unter welchen Bedingungen solche Dateien vernachlässigbar sind, oder der long-tail zuschlägt.<br />
<br />
BTW., wer am Mittwoch noch auf dem Bibtag ist, schaue mal beim <a href="http://www.professionalabstracts.com/api/iplanner/?conf=dbt2017&model=sessions&method=get&params[sids]=175&params[pids]=165,166,167,168&params[format]=pdf">Vortrag</a> unserer Kollegin Sabine zu den Ergebnissen der PDF/A Validierung vorbei.Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-9052940887756266577.post-54558693991555109812017-05-16T02:13:00.000-07:002017-05-17T01:41:22.701-07:00Über die Idee, ein Langzeitarchiv vermessen zu wollen <table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://openclipart.org/detail/10430/meter" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="OpenClipart von yves_guillou, sh. Link" height="253" src="https://openclipart.org/download/10430/yves-guillou-meter.svg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">OpenClipart von yves_guillou, sh. Link im Bild</td></tr>
</tbody></table>
<div dir="ltr">
Irgendwann gerät man in einer Organisation an den Punkt, an dem man auf Menschen trifft, die sich den Zahlen verschrieben haben. Menschen, die als Mathematiker, als Finanzbuchhalter oder als Controller arbeiten. Das ist okay, denn Rechnungen wollen bezahlt, Ressourcen geplant und Mittel bereitgestellt werden.</div>
<div dir="ltr">
<br /></div>
<h3 dir="ltr">
Omnimetrie </h3>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Problematisch wird das Zusammentreffen mit Zahlenmenschen dann, wenn diese die Steuerung der Organisation bestimmen. Wenn es nur noch um Kennzahlen geht, um Durchsatz, um messbare Leistung, um Omnimetrie.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Schon Gunter Dueck schrieb in Wild Duck¹: <i>"In unserer Wissens- und Servicegesellschaft gibt es immer mehr Tätigkeiten, die man bisher nicht nach Metern, Kilogramm oder Megabytes messen kann, weil sie quasi einen 'höheren', im weitesten Sinn einen künstlerischen Touch haben. Die Arbeitswelt versagt bisher bei der Normierung höherer Prinzipien."</i></div>
<div dir="ltr">
</div>
<h3 dir="ltr">
Zahlen lügen nicht </h3>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Schauen wir uns konkret ein digitales Langzeitarchiv an. Mit Forderungen nach der Erhebung von Kennzahlen, wie:<br />
<ul>
<li>Anzahl der Dateien, die pro Monat in das Archiv wandern, </li>
<li>oder Zahl der Submission Information Packages (SIPs), die aus bestimmten Workflows stammen, </li>
</ul>
demotiviert man ein engagiertes Archivteam. </div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Denn diese Zahlen sagen nichts aus. Digitale Langzeitarchive stehen auch bei automatisierten Workflows am Ende der Verwertungskette. Es wäre in etwa so als würde man den Verkauf von Würstchen an der Zahl der Besucher der Kundentoilette messen wollen.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
In der Praxis ist es so, dass Intellektuelle Einheiten (IE), die langzeitarchiviert werden sollen, nach dem Grad ihrer Archivfähigkeit und Übereinstimmung mit den archiveigenen Format-Policies sortiert werden.<br />
<br />
Diejenigen IEs, die als valide angesehen werden, wandern in<br />
Archivinformationspaketen (AIP) eingepackt in den Langzeitspeicher. Die IEs, die nicht archivfähig sind, landen in der Quarantäne und ein Technical Analyst (TA) kümmert sich um eine Lösung oder weist die Transferpakete (SIP) mit diesen IEs zurück.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Wenn wir einen weitgehend homogenen Workflow, wie die Langzeitarchivierung von Retrodigitalisaten, betrachten, so sollte der größte Bestandteil der IEs ohne Probleme im Langzeitspeicher landen können. In dem Fall kann man leicht auf die Idee kommen, einfach die Anzahl der IEs und Anzahl und Größe der zugehörigen Dateien zu messen, um eine Aussage über den Durchsatz des Langzeitarchivs und die Leistung des LZA-Teams zu bekommen.<br />
</div>
<h3 dir="ltr">
Ausnahme Standardfall</h3>
<div dir="ltr">
</div>
<div dir="ltr">
<br />
Doch diese Betrachtung negiert, dass nicht der Standardfall, wo IEs homogenisiert und automatisiert in das Archivsystem wandern, zeitaufwändig ist, sondern der Einzelfall, in dem sich der TA mit der Frage auseinander setzen muss, warum das IE anders aufgebaut ist und wie man eine dazu passende Lösung findet.<br />
</div>
<h3 dir="ltr">
Formatwissen</h3>
<div dir="ltr">
</div>
<div dir="ltr">
<br />
Was die einfache Durchsatzbetrachtung ebenfalls negiert, ist, dass das Archivteam Formatwissen für bisher nicht oder nur allgemein bekannte Daten- und Metadatenformate aufbauen muss. Dieser Lernprozess ist hochgradig davon abhängig, wie gut die Formate bereits dokumentiert und wie komplex deren inneren Strukturen sind.<br />
</div>
<h3 dir="ltr">
Organisatorischer Prozess</h3>
<div dir="ltr">
</div>
<div dir="ltr">
<br />
Ein dritter Punkt, den ein Management nach der Methode Omnimetrie negiert, ist die bereits im Nestor-Handbuch² formulierte Erkenntnis, dass digitale Langzeitarchivierung ein organisatorischer Prozess sein muss.</div>
<div dir="ltr">
<br />
Wenn, wie in vielen Gedächtnisorganisationen, die Retrodigitalisate produzieren, auf Halde digitalisiert wurde, und das Langzeitarchivteam erst ein bis zwei Jahre später die entstandenen digitalen Bilder erhält, so kann von diesem im Fehlerfall kaum noch auf den Produzenten der Digitalisate zurückgewirkt werden. Die oft projektweise Abarbeitung von Digitalisierungsaufgaben durch externe Dienstleister verschärft das Problem zusätzlich. Was man in dem Falle messen würde, wäre in Wahrheit keine Minderleistung des LZA-Teams, sondern ein Ausdruck des organisatorischen Versagens, die digitale Langzeitverfügbarkeit der Digitalisate von Anfang an mitzudenken.</div>
<div dir="ltr">
<br /></div>
<div dir="ltr">
Natürlich ist es sinnvoll, die Entwicklung des Archivs auch mit Kennzahlen zu begleiten. Speicher muss rechtzeitig beschafft, Bandbreite bereitgestellt werden. Auch hier gilt, Augenmaß und Vernunft.</div>
<br />
<div dir="ltr">
<u>¹ Gunter Dueck, Wild Duck -- Empirische Philosophie der Mensch-Computer-Vernetzung, Springer-Verlag Berlin-Heidelberg, (c)2008, 4. Auflage., S. 71 </u></div>
<div dir="ltr">
<u>² <a href="http://nestor.sub.uni-goettingen.de/handbuch/artikel/nestor_handbuch_artikel_74.pdf">Nestor Handbuch</a> -- </u>Eine kleine Enzyklopädie der digitalen Langzeitarchivierung, Dr. Heike Neuroth u.a., Kapitel 8 Vertrauenswürdigkeit von digitalen Langzeitarchiven, von Susanne Dobratz und Astrid Schoger, http://nestor.sub.uni-goettingen.de/handbuch/artikel/text_84.pdf, S.3</div>
<div dir="ltr">
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-1206090093336664092017-04-29T02:13:00.000-07:002017-06-29T00:11:00.481-07:00FFV1 - some compression resultsIn a pilot we got some retrodigitized films and videos in Matroska/FFV1 format. In the following table I summarized the results:<br />
<br />
<br />
n/a <br />
<table border="1" cellpadding="1" cellspacing="0"><colgroup><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col></colgroup><tbody>
<tr><th>film/video</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th></tr>
<tr><th>description</th><td>8mm, positive, b/w</td><td>8mm, positiv, b/w</td><td>16mm, positive, b/w</td><td>35mm, combined, color</td><td>35mm, combined, color</td></tr>
<tr><th>width</th><td>2500</td><td>2500</td><td>2048</td><td>4096</td><td>4096</td></tr>
<tr><th>height</th><td>1524</td><td>1524</td><td>1520</td><td>3460</td><td>2976</td></tr>
<tr><th>bits per pixel</th><td>48</td><td>48</td><td>48</td><td>48</td><td>48</td></tr>
<tr><th>pxfmt</th><td>gbrp16le</td><td>gbrp16le</td><td>gbrp16le</td><td>gbrp16le</td><td>gbrp16le</td></tr>
<tr><th>duration in s</th><td>12</td><td>12</td><td>11,459</td><td>2,5</td><td>2,5</td></tr>
<tr><th>fps</th><td>24</td><td>24</td><td>24</td><td>24</td><td>24</td></tr>
<tr><th>frames</th><td>288</td><td>288</td><td>275</td><td>60</td><td>60</td></tr>
<tr><th>original size</th><td>6583680000</td><td>6583680000</td><td>5136682844,16</td><td>5101977600</td><td>4388290560</td></tr>
<tr><th>compressed size</th><td>3861943880</td><td>3790690517</td><td>3680779719</td><td>3908475344</td><td>3576745774</td></tr>
<tr><th>compression ratio</th><td>1,704</td><td>1,736</td><td>1,395</td><td>1,305</td><td>1,226</td></tr>
<tr><th>(DPX size)</th><td>6584159232</td><td>6584159232</td><td>5136841600</td><td>5102077440</td><td>4388390400</td></tr>
<tr><th>(h264 lossless)</th><td>n/a </td><td>n/a </td><td>n/a </td><td>n/a </td><td>n/a </td></tr>
<tr><th>(h265 lossless)</th><td>3573420309</td><td>3559442475</td><td>2756504247</td><td>3015053822</td><td>2992764833</td></tr>
<tr><th>(jp2k lossless)</th><td>4589886341</td><td>4534014321</td><td>3732555539</td><td>3869665916</td><td>3514687046</td></tr>
<tr><th>with audio</th><td>n</td><td>n</td><td>n</td><td>y</td><td>n</td></tr>
</tbody></table>
<br />
<br />
n/a <br />
<table border="1" cellpadding="1" cellspacing="0"><colgroup><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col><col></col></colgroup><tbody>
<tr><th>film/video</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th></tr>
<tr><th>description</th><td>35mm, combined, color</td><td>vhs, color</td><td>betacam, color</td><td>betacam, color</td><td>Digi-beta, color</td></tr>
<tr><th>width</th><td>4096</td><td>720</td><td>720</td><td>720</td><td>720</td></tr>
<tr><th>height</th><td>3200</td><td>576</td><td>576</td><td>576</td><td>576</td></tr>
<tr><th>bits per pixel</th><td>48</td><td>20</td><td>20</td><td>20</td><td>20</td></tr>
<tr><th>pxfmt</th><td>gbrp16le</td><td>yuv422p10le</td><td>yuv422p10le</td><td>yuv422p10le</td><td>yuv422p10le</td></tr>
<tr><th>duration in s</th><td>1088,042</td><td>280</td><td>280</td><td>280</td><td>280</td></tr>
<tr><th>fps</th><td>24</td><td>25</td><td>25</td><td>25</td><td>25</td></tr>
<tr><th>frames</th><td>26113</td><td>7000</td><td>7000</td><td>7000</td><td>7000</td></tr>
<tr><th>original size</th><td>2053610510746</td><td>7257600000</td><td>7257600000</td><td>7257600000</td><td>7257600000</td></tr>
<tr><th>compressed size</th><td>1575415175611</td><td>3565437155</td><td>3838500934</td><td>3449372280</td><td>4451325952</td></tr>
<tr><th>compression ratio</th><td>1,303</td><td>2,035</td><td>1,890</td><td>2,104</td><td>1,630</td></tr>
<tr><th>(DPX size)</th><td> <br />
2053653333632</td><td>17472217728</td><td>17429888000</td><td>17429888000</td><td>17429888000</td></tr>
<tr><th>(h264 lossless)</th><td><br />
n/a</td><td>n/a </td><td>n/a </td><td>n/a </td><td>n/a </td></tr>
<tr><th>(h265 lossless)</th><td>1248031292634</td><td>3659828688</td><td>3772522257</td><td>3442739259</td><td>4323623225</td></tr>
<tr><th>(jp2k lossless)</th><td>1517117560575</td><td>3300899483</td><td>3470434177</td><td>3150727081</td><td>4022908822</td></tr>
<tr><th>with audio</th><td>n</td><td>y</td><td>y</td><td>y</td><td>y</td></tr>
</tbody></table>
<br />
All files are encoded with FFV1v3 with slices, slice-crc, GOP=1. If audio exists, it is (lin. PCM 48kHz, 16bit) included in compression-size, but not in original size, because original size is calculated by width*height*pits_per_pixel*frames and compression-size is equivalent to filesize. The count of frames is calculated with the duration value of the MKV-files. The files 1 to 5, and 7-10 are first parts of the movies (each 4GB splits).<br />
<br />
Hint: <span lang="en">Once the project is completed, rights must be clarified. If possible, I will publish the sources.</span><br />
<br />
<b>Update 2017-06-09</b><br />
<br />
<ul>
<li>added file size for DPX after using "<i>ffmpeg -i input.mkv DPX/frame_%06d.dpx</i>"</li>
<li>added file size for h264 after using "<i>ffmpeg -i input.mkv -c:v libx264 -g 1 -qp 0 -crf 0 output.mkv</i>" (RGB without lossy conversion to YUV not supported yet)
</li>
<li>added file size for h265 after using "<i>ffmpeg -i input.mkv -c:v libx265 -preset veryslow -x265-params lossless=1 output.mkv</i>"
</li>
<li>added file size for openjpeg2000 after using "<i>ffmpeg -i input.mkv -c:v libopenjpeg output.mkv</i>" </li>
</ul>
<b>Update 2017-06-29</b><br />
<br />
<ul>
<li>added sizes for film no 6</li>
<li>in general, the processing time of h265 and jp2k is one magnitude greater than for ffv1 </li>
</ul>
<ul>
</ul>
<h3>
Interpretation</h3>
<br />
The files 1-3 are all originally b/w. It seems to be that the codec does not decorrelate the color channels. Also the material 1-6 is retrodigitized from film and are noisy. The file 1 is very special. In decoding the FFV1 produces a very high load on the CPU (eight cores at 100%). The most decoding time is spent in method get_rac(). The original film has the highest noise level in contrast to the other files.<br />
<br />
I think the compression-ratio difference between video- and film files
comes from the different pixel format. A ratio between 1,5 - 2 was
expected, but 1,3 is a surprise. <br />
<br />
<b>Update 2017-06-09</b><br />
<br />
The reason for high CPU load was, that the digitization service provider has created a file with a framerate of 1000 fps, but the scanner has provided 24 or 25 fps.<b> </b>Therefore 42-40 equal frames was encoded on block.<br />
<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-36810014619505170632017-03-30T07:37:00.001-07:002017-03-30T07:43:35.651-07:00Nestor - DIN - Workshop "Digitale Langzeitarchivierung", Nachlese<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgetf2V37RU5ZwGEaMKTKPQG8OYXixu1m2lXTyA5ioDkqsutxXMNTSvWnRUHySumBPZeZxMTHxA1wgQUm-SE6rRUMUwtT6ZzYF14arRXkCU6Px8hrVxDOIHoU2lBLMSpIBKbEURmSc6wE2k/s1600/C8EjNX6WkAADZys.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgetf2V37RU5ZwGEaMKTKPQG8OYXixu1m2lXTyA5ioDkqsutxXMNTSvWnRUHySumBPZeZxMTHxA1wgQUm-SE6rRUMUwtT6ZzYF14arRXkCU6Px8hrVxDOIHoU2lBLMSpIBKbEURmSc6wE2k/s320/C8EjNX6WkAADZys.jpg" width="240" /></a></div>
Gestern fand in den Räumen des DIN e.V. ein Workshop des Kompetenznetzwerkes digitale Langzeitarchivierung nestor und der DIN statt. Dies soll nur eine kleine Zusammenfassung für die Zuhausegebliebenen sein und erhebt keinen Anspruch auf ein objektives oder gar vollständiges Protokoll :)<br />
Falls Fehler vorliegen bitten wir um eine Email mit Korrekturhinweisen ;)<br />
<br />
<h2>
Arbeiten des NID 15 Ausschuß</h2>
<br />
Im Kern ging es im Workshop um die Frage, welchen Standard wollen wir in der digitalen Langzeitarchivierung in den nächsten 5-8 Jahren haben und wie kommen wir dahin?<br />
<br />
Mit dieser Frage startete Prof. Keitel den Workshop und skizzierte nachfolgend die Ausgangslage von 2005.<br />
<br />
<ul>
<li>abstraktes Thema "digitale Archivierung"</li>
<li>DIN 31646/31644/31645 aus Nestor "Dunstkreis"</li>
<li>DIN 31647 "Beweiserhaltung kryptograf. signierter Dokumente"</li>
<li>Rücklauf, ob Norm in Praxis verwendet werden ist schwierig zu erkennen</li>
<li>beziehen sich auf OAIS (ISO14721)</li>
<li>zeigen, ob man sich noch im Rahmen der digitalen LZA bewegt.</li>
</ul>
<br />
Aktuell ergänzen praktische Erfahrungen diese frühen theorethischen Überlegungen. Die Frage ist daher, ob es Bereiche gibt, wo sich die Ausgangsthesen mittlerweile überholt haben?<br />
<br />
Es gilt, so Prof. Keitel,<br />
<ul>
<li> Schwerpunkte, die sich zur Standardisierung eignen, herauszukristallisieren</li>
<li> Mitarbeitern zu finden, die sich in der Normierungsarbeit in den neuen Feldern einbringen wollen</li>
</ul>
<br />
Ob man für Normungsarbeit geeignet sei, läßt sich launisch an folgenden Kriterien festmachen (Zitat):<br />
<ul>
<li>Lange auf Stuhl sitzen</li>
<li>Verbessere gern Geschriebenes anderer Leute</li>
<li>bei genauen terminologischen Definitionen verstehe ich keinen Spaß und mache keine Kompromisse</li>
<li>ich lese gerne Dokumente mit Titelen, wie...</li>
</ul>
Im Anschluss wurde die Schwierigkeit angesprochen, Feedback zu bestehenden DIN Normen zu erhalten.<br />
<br />
<h2>
PDF Standardisierung </h2>
<br />
Olaf Drümmer von der callas software GmbH skizzierte einführend die Geschichte von PDF und wies auf die neue Version 2 hin:<br />
<br />
<ul>
<li>1993-2006 Adobe PDF 1.0 -> 1.7</li>
<li>2008 ISO: PDF 1.7 als ISO 32000-1</li>
<li>2017 ISO: PDF 2.0 als ISO 32000-2 (im nächsten Quartal, >1000 Seiten)</li>
<ul>
<li>neue kryptografische Verfahren</li>
<li>tagging überarbeitet</li>
<li>Problemfeld im Normungsprozess waren Farben</li>
<li>Namespaces wurden eingeführt, zB. um Tags aus HTML 5 einbinden</li>
</ul>
</ul>
Er ging dann auf die PDF-Spezialisierungen ein:<br />
<br />
<ul>
<li>2001 PDF/X Übermittlung von Druckvorlagen</li>
<li>2005 PDF/A Archivierung, ISO Reihe 19005</li>
<ul>
<li>entstanden aus Notwendigkeiten der US Courts, Library of Congress</li>
</ul>
<li>2008 PDF/E ISO 24517, Engineering (CAD), noch nicht stark verbreitet, Ende des Jahres auch 3D Modelle</li>
<li>2010 PDF/VT ISO 16612-2 + PDF/VCR ISO 16612-3, variabler Datendruck (großvolumige Rechnungen, Serienbriefe)</li>
<li>2012 PDF/UA ISO 14289 Reihe, Barrierefreiheit</li>
</ul>
Die Bedeutung der Normung ergibt sich nach Drümmer allein schon aus der <br />
Verbreitung von PDF Dokumenten:<br />
<ul>
<li>Anzahl PDF Dokumente weltweit, mind. Billionen (10¹²), davon 6 Millionen allein beim US Court</li>
<li>Lebenserwartungen pro PDF: Stunden bis Jahre</li>
</ul>
Weiter ging er auf die Herausforderung Variantenvielfalt ein:<br />
<ul>
<li>PDF/X, 8 Normteile, insgesamt 12 Konformitätsstufe</li>
<li>PDF/A Normenreihe, 3 Normteile, insgesamt 8 Konformitätsstufen</li>
<li>Unübersichtlich, mangelnde Trennschärfe?</li>
<li>Flexibilität bzw. Mächtigkeit</li>
<li>offener Charakter</li>
<li>breite Abdeckung</li>
</ul>
Wie es mit der Normierung ab 2017 weitergehen soll skizzierte er anschliessend:<br />
<ul>
<li>PDF2.0 weitgehend rückwärtskompatibel, keine Validierung bei Veröffentlichungen vorgesehen</li>
<li>Projekt "Camelot2" soll klassische PDF-Dokumentenwelt und Open Web Platform zusammenbringen, mehr Infos zu PDF Days Europe 2017, Berlin, 15.-16. Mai 2017</li>
<li>PDF/A4 als Ziel: keine Konformitätsstufen</li>
<li>PDF/E erlaubt interaktive Elemente (JS), PDF/E-2 soll eher eine Archivausprägung weniger eine Arbeitsdokumentausprägung bekommen</li>
<li>XMP kann im PDF an *allen* Stellen angebracht werden, so dass man darin auch Quellen oder zB. <a href="https://de.wikipedia.org/wiki/Universally_Unique_Identifier">UUIDs</a> dafür hinterlegen kann</li>
<li>PDFA/3 kann auch alternative Verknüpfung zum Inhalt beliebiger Dateien hinterlegen, Problem: nicht verpflichtend und muss über Policy geregelt werden</li>
</ul>
<br />
<h2>
nestor </h2>
<br />
Prof. Keitel skizzierte kurz die Arbeit von nestor:<br />
<br />
<ul>
<li> …ist auf jeden Fall Kooperationsnetzwerk</li>
<li>stellt AGs vor</li>
</ul>
<br />
Vertrauenswürdige Archive<br />
<br />
<ul>
<li>* 2004-2008 Nestor Kriterienkatalog</li>
<li>* 2008-2012 DIN31644</li>
<li>* 2013-… nestor Siegel</li>
</ul>
<br />
<h2>
Submission Information Packages - Überarbeitung der Ingest-Standards </h2>
<br />
Dr. Sina Westphal und Dr. Sebastian Gleixner (Dt. Bundesarchiv) regten in einem Impulsvortrag die Normierung des Ingestvorgangs und der SIPs an.<br />
<br />
<ul>
<li>Bundesarchiv 4PB/Jahr Zuwachs</li>
<li>Anreiz zur allmählichen Angleichung der Systeme</li>
<li>vereinheitlichte Metadaten</li>
<li>verbesserter Datenaustausch</li>
<li>vereinheitlichte Schnittstellen</li>
</ul>
Konsequenzen:<br />
<ul>
<li>Vereinheitlichung bestehender SIPs (ggf. auch AIPs/DIPs)</li>
<li>Vereinheitlichung bestehender digitaler Archivsysteme</li>
</ul>
<br />
Zwei Teilbereiche:<br />
<ul>
<li>Standardisierung des SIP (konkret)</li>
<ul>
<li>Struktur</li>
<li>Metadaten</li>
<li>Primärdaten</li>
<li>vgl. E-ARK, e-CH, EMEA</li>
</ul>
<li>Standardisierung des Ingest-Prozesses (abstrakt)</li>
<ul>
<li>Verbindung zum Erschliessungstool</li>
<li>Validierung</li>
<li>Ingest</li>
<li>Umgang mit Primärdaten</li>
</ul>
</ul>
<br />
Fragen:<br />
<ul>
<li>Vereinheitlichung möglich?</li>
<li>Ist Standardisierung AIPs/DIPs und der damit verbundenen Prozesse notwendig?</li>
</ul>
<br />
Im Anschluss erfolgte eine Diskussion über Abgrenzung und konkrete Austauschverfahren mit ff. Ergebnis:<br />
<br />
<ul>
<li>Trend geht hin zu abstrakter Modulbeschreibung</li>
<li>konzeptioneller Rahmen erwünscht</li>
<li>Festlegung welche Module verpflichtend, welche optional sind</li>
<li>empfohlener Einstiegspunkt für Automatisierung</li>
</ul>
<br />
<h2>
Videoarchivierung als neue Herausforderung, Langzeiterhaltung audiovisueller Medien jenseits von Film- und Fernsehen </h2>
<br />
In diesem Impulsvortrag von Alfred Werner, HUK Coburg wurde die Problematik der Langzeitarchivierung von Videos skizziert.<br />
<br />
<ul>
<li>Bandbreite Außenstelle 5-15MBit/s</li>
<li>wandeln in Multipage-TIFF monochrom (kleine Dateien) und in JPG um, </li>
<li>Videos erwünscht,</li>
<ul>
<li>2011 5 Videos/Tag</li>
<li>2016 20 Videos/Tag (im Gegensatz zu 10.000 Schadensfälle pro Tag)</li>
<li>2021 100?/1000? Videos/Tag</li>
</ul>
<li>Dashcam-Videos seit diesem Jahr erlaubt</li>
</ul>
<br />
Problem: unterschiedlichste Formate, Tendenz steigend, es wird nicht besser (3D, HDR, 4k, 2 Objektive, Spezialsensoren)<br />
<br />
mögliche Lösung: Konvertierung in ein Langzeitarchivformat für Videos<br />
<br />
Anforderungen:<br />
<ul>
<li>Standard für die nächsten 50 Jahre</li>
<li>Lizenzfrei</li>
<li>bestmögliche Qualität</li>
<li>geringer Speicherplatz</li>
<li>gute Antwortzeiten auch bei geringer Bandbreite</li>
</ul>
<br />
dann noch Funktionen für Sachbearbeiter, wie:<br />
Zoomen, Sprungmarken setzen, Extrahieren Einzelbilder, Schwärzen, Szenen extrahieren.<br />
<br />
In der anschliessenden Diskussion wurde das Problem deutlich, dass man sich im Spannungsfeld zwischen Robustheit und originalgetreuer Wiedergabe einerseits und Ressourcenbedarf (Speicher, Bandbreite, Processingzeit) andererseits befindet.<br />
<br />
Anmerkung: Dazu wurde auf der nestor-ML ein ergänzender Beitrag verfasst.<br />
<br />
<br />
<h2>
Digital Curation </h2>
<br />
Auch hier hielt Prof. Keitel ein Impulsreferat. Ich hoffe, ich kann den Inhalt korrekt wiedergeben:<br />
<br />
Unterschied Data Curation zu Langzeitarchivierung nach OAIS: wir reden nicht mehr von Einrichtungen/Organisationen, sondern von Techniken. D.h., fehlen der organisatorischen Verantwortung.<br />
<br />
OAIS goes Records Managment, dh. wie kann man Anforderungen der digitalen LZA an Produzenten bringen (durch digital curation), AIP liegt quasi beim Produzenten.<br />
Wie harmonieren die von OAIS/PREMIS genannten Erhaltungsfunktionen mit den Rgelungen des Records Managment? Welche Elemente/Gruppen müssen wir aus Erhaltungsgründen unterscheiden?<br />
<br />
Keitel: "Wir gingen bisher immer von einem Kümmerer aus, der Dinge auf Dauer bewahrt. Digital Curation setzt vorher beim Producer an"<br />
<br />
<h2>
Zusammenfassung</h2>
<br />
Aus unserer Sicht sollte der Ingest versucht werden besser zu standardisieren. Nur so wäre es möglich, dass man Produzenten Werkzeuge in die Hand geben kann, die nicht archivspezifisch sind. Der Weg dorthin ist steil, zumal allein schon die Wege die Archive und Bibliotheken einschlagen sehr unterschiedlich sind.<br />
<br />
PDF ist und bleibt leider ein Minenfeld. Weder wurden mit PDF2 bestehende Ambiguitäten ausgeräumt, noch vereinfacht sich der Standard. Besonders nachteilich dürfte sich die fehlende offizielle Validierung erweisen. Hinzukommt dass der Formatzoo rund um PDF weiter anwächst und Mischformen von Dokumenten möglich sind, d.h. ein PDF kann sowohl PDF/E als auch PDF/A sein.<br />
<br />
Der Bedarf nach langzeittauglichen Videoformaten ist vorhanden. Eine Normierung könnte helfen, die Unterstützung durch Hersteller zu forcieren. Am Thema Video wurde deutlich, dass die digitale Langzeitarchivierung Kosten verursacht, die nicht leicht zu vermitteln sind. Datenkompression, insbesondere die verlustbehaftete führt zu einem höheren Schadensrisiko bei Bitfehlern. Die Diskussion über das Spannungsfeld Robustheit/Qualität vs. Kosten muss in der Community geführt werden, ist aber außerhalb von Normungsbemühungen anzusiedeln.<br />
<br />
Data Curation ist eine Aktie für sich. Es gibt Lücken, die entstehen, wenn Dokumente Lebenszyklen von mehreren Jahrzehnten aufweisen. Mein Bauchgefühl sagt mir, dass dies ebenfalls unter Langzeitverfügbarkeit subsummiert werden kann, da wir in der Langzeitarchivierung ja die Dokumente auf unbestimmte Zeiten nutzbar halten wollen. Data Curation scheint mir demnach nichts anderes als der Sonderfall zu sein, als das Produzent und Archiv als Rolle zusammenfallen.<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-9052940887756266577.post-44428498483722955252017-02-13T07:54:00.001-08:002017-02-13T07:54:25.949-08:00Where have all the standards gone? A singalong for archivists.Recently, we noticed that the specification for the TIFF 6 file format has vanished from Adobe's website, where it was last hosted. As you might know, Adobe owns TIFF 6 due to legal circumstances created by the acquisition of Aldus in 1994.<br />
<br />
Up until now, we used to rely on the fact that TIFF is publicly specified by the document that was always available. However, since Adobe has taken down the document, all we have left are the local copies on our workstations, and we only have those out of pure luck. The link to <a href="http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf">http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf</a> has been dead for several months now.<br />
<br />
This made us think about the standards and specifications themselves. We've always, half jokingly, said that we would have to preserve the standard documents in our repositories as well if we wanted to do our jobs right. We also thought that this would never be actually be necessary. Boy, were we wrong.<br />
<br />
We're now gathering all the standard and specification documents for the file formats that we are using and that we are planning to use. These documents will then be ingested into the repository using separate workflows to keep our documents apart from the actual repository content. That way, we hope to have all documents at hand even if they vanished from the web.<br />
<br />
From our new perspective, we urge all digital repositories to take care of not only their digital assets, but also of the standard documents they are using.<br />
<br />
The TIFF user community just recently had to take a major hit when the domain owners of <a href="http://www.remotesensing.org/libtiff/">http://www.remotesensing.org/libtiff/</a> lost control of their domain, thus making the libtiff and the infrastructure around it unavailable for several weeks. Even though the LibTIFF is now available again at their new home (<a href="http://libtiff.maptools.org/">http://libtiff.maptools.org</a>), we need to be aware that even widely available material might be unavailable from one day to another.<br />
<br />
<br />
<a name='more'></a>(german version)<br />
<br />
<h2>
Sag mir, wo die Standards sind. Ein Mitsinglied für Archivare.</h2>
Vor einiger Zeit haben wir festgestellt, dass die Spezifikationsdokumente für TIFF6 von Adobes Website verschwunden sind, wo sie bisher gehostet wurden. Wie Sie vielleicht wisst, besitzt Adobe TIFF6 auf Grund von juristischen Umständen, die durch den Kauf von Aldus im Jahre 1996 entstanden sind.<br />
<br />
Bisher haben wir uns immer darauf verlassen, dass TIFF durch ein Dokument spezifiziert wird, das jederzeit verfügbar war. Da aber Adobe nun das Dokument von seiner Website entfernt hat, haben wir nur noch die lokalen Kopien auf unseren PCs, und auch die haben wir nur durch puren Zufall. Der Link auf <a href="http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf">http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf</a> ist nun schon seit mehreren Monaten offline.<br />
<br />
Das zwang uns, uns mehr Gedanken über die Standarddokumente selbst zu machen. Bisher hatten wir immer halb im Scherz gesagt, dass wir doch eigentlich auch die Standarddokumente selbst in unserem Archiv aufbewahren müssten, wenn wir unseren Job richtig machen wollten. Wir dachten auch, dass das niemals tatsächlich nötig werden würde. Da lagen wir wohl ziemlich falsch.<br />
<br />
Wir sind jetzt dabei, alle Standards, die wir zur Zeit oder in Zukunft verwenden wollen, einzusammeln. Diese Dokumente werden dann über einen separaten Workflow in unser Repository eingeliefert, damit sie von unseren tatsächlichen Archivinhalten getrennt aufbewahrt werden können. Wir hoffen, dadurch alle Dokumente auch dann zur Hand zu haben, wenn sie aus dem Internet verschwunden sein werden.<br />
<br />
Aus unserer neu gewonnenen Sicht können wir nur allen Archivbetreiber anraten, nicht nur ihre Nutzdaten, sondern auch ihre Standarddokumente aufzubewahren.<br />
<br />
Die TIFF-Community musste erst vor Kurzem einen größeren Schlag verkraften. Die Domaininhaber von <a href="http://www.remotesensing.org/libtiff/">http://www.remotesensing.org/libtiff/</a> hatten die Kontrolle über die Domain verloren, so dass die LibTIFF und die zugehörige Infrastruktur für mehrere Wochen nicht verfügbar war. Auch wenn die LibTIFF inzwischen unter ihrer neuen Adresse <a href="http://libtiff.maptools.org/">http://libtiff.maptools.org</a> wieder erreichbar ist, müssen wir uns bewusst sein, dass auch weit verbreitete Materialien von einem Tag auf den anderen nicht mehr verfügbar sein können.Jörg Sachsehttp://www.blogger.com/profile/17097541683565972324noreply@blogger.com0