Dienstag, 7. März 2023

Some thoughts about a minimalistic Archival Information System, part 2

 In the last post I explained some basic terms. Now it's time for the real thing. 

Choosing the right format

The first question, what should the information packages (SIP, AIP, DIP) look like? It is important that they are easy to process, easy to understand and easy to expand. Fortunately, there is RFC8493 that has the solution ready for us: BagIt.


 In (1) we store the metadata, in (2) there is space for our payload. BagIt is simple, it is a definition of a directory structure and some files that take over certain functions. Very interesting for us, if we want to store digital objects, we can store them in the BagIt payload. We can take over this area completely unchanged when processing a SIP and creating the AIP. The same is possible later when creating the DIPs from the AIPs. BagIt gives a lot of freedom. To limit ourselves, we choose UTF-8 for all metadata and text files.And we don't use fetch bags. Since BagIt is now standardized, we use version 1.

 

Metadata and AIP update considerations

 
Many AIS systems are insufficiently prepared for metadata and AIP updates. In my experience, it is important to think about how and which data is updated and what the consequences are. In order to enable the producer to submit supplements, these must be clearly assigned to an existing AIP. Either you give the producer back an ID for his first recording. This is not a good choice because the process then has a strong coupling and internals are exposed to the outside world. In addition, if a producer wants to change the AIS, there can be collisions. A better choice is to tell the producer to choose a unique ID for your data yourself and transmit it in your SIPs. Internally, we would then use these to search for the appropriate AIPs. The ID is called "ExternalID" and is the base for our internal MAIS-AIP-ID. More on that later.

In the last post I already mentioned that we have to think about the topic of versioning of AIPs. Not only because of the metadata or AIP updates, but also in the case of a PP&A, i.e. format migration. A simple idea is to introduce linked lists.

This allows us to easily implement the functionality of rolling back an AIP version as well.

A new AIP points to the predecessor in which the new version receives a reference entry in the "bag-info.txt":

 

  • 'MAIS-previous-AIP' - contains AIP-ID of the current AIS (MAIS-AIP-ID)
  • 'MAIS-migrated-AIP' - contains AIP-ID of the previous AIS if AIP was migrated from there
  • 'MAIS-origin-AIS' - contains identifiers of the previous AIS from where the AIP was migrated

 The last two keys are optional and only needed if AIP-AIP-Transfer is needed to move digital objects from one archival information system to another.

Donnerstag, 2. März 2023

Some thoughts about a minimalistic Archival Information System, part 1

Many of those who are dealing with the digital preservation of objects for the first time and who work in small memory organizations are often helpless in the face of the vast range of functions and requirements of current archival information systems.
Students of library or archival science often appear to be similarly overwhelmed when they are supposed to learn what constitutes archival software.

This has motivated me to write down thoughts on a minimalist archive information system. Because it really doesn't need much.

The basic terms

An archive essentially has three roles: the submitter, called the producer, the user, also called the consumer, and the problem solver who maintains the archive, also called the technical analyst.

 

When digital objects are transferred to the archive, it is called the ingest process. When they are requested from the archive, then this is the access process. 

The digital objects to be preserved are provided with all the necessary information for the archive ingest and are packaged in a predefined structure. This is called a Submission Information Package (SIP). You can actually imagine this just like in real life. For example, if you want to store a vase, you put it in a box, label it and put it on a shelf.

In the archive it is checked whether (allegorically) the vase is in the box and intact, and if there is a stamp and signature that says that the content of the package is indeed a vase. A file number and a storage location is assigned and the box goes sealed and neatly labeled on the shelf. The "box" is called Archival Information Package (AIP). With the seal, the archive takes responsibility.

At some point, when the user would like to see the vase from the archive again, the archive would process the request and send the vase and accompanying information to the user. This is then called a Dissemination Information Package (DIP).

In addition to this simple "I store something safely and retrieve it again at some point" approach, an archive fulfills another task that is not so obvious: it ensures that objects entrusted to it are kept usable. 

What does that mean in the digital world?
If it is possible in principle to store a digital object securely with bit accuracy, even over a very long period of time (bitstream archival), it still can age because the environment for using this object is no longer available.

There are essentially three concepts for keeping digital objects usable (content preservation): hardware museum, emulation or format migration

Hardware museums (e.g. a slot machine museum) try to keep old equipment running in a controlled environment. To do that, they have to build up a stock in time and build up knowledge on how to maintain and repair these devices.

With an emulation, I try to recreate the environment for the digital object so that it feels at home and doesn't notice any difference from the previous, real world. A very good example of emulators is e.g. MAME, but also various others, the e.g. retro computers like the Amiga or C64, so that old programs from their time can run on them. Here, too, I need knowledge about what the environment to be emulated looks like and how I can recreate it with today's means.

When migrating the format, I try to find a new form that retains the essential properties (significant properties) in good time and to transfer files from a digital object to a newer data format.

From this point onwards, it is assumed that this is the preferred way of maintaining usability.

It follows that an Archival Information System (AIS) must be able to support this process of format migration. The process (also called Preservation Planning and Action) results in a new version of the Archival Information Package being created. The AIS must be able to manage this.

That would basically be all there is to Archival Information Systems if it hadn't been for the librarians.

Unlike archivists, where a record is complete and closed, librarians understand the concepts of supplements and metadata submissions. A page that has fallen out has turned up here, a letter has been discovered there in an estate, or it has dawned on some people that there is now money for costly in-depth indexing. Ergo, librarians expect people to think about how to handle metadata and data updates on existing AIPs (called metadata update and AIP update). This is not trivial, since some AIPs are also very large and you want to avoid pointless copying. For such an update, we also need a good way for producers to tell the archive which AIP needs to be added or updated.

However, AIPs are already versioned in the case of format migration, the same can be used here as well. Any change to the AIP creates a new version of an AIP. And so that you can't accidentally break anything, you should always be able to go back to an old version. And since that is also error-prone, the result of the rollback process will simply be a new version. 

 That's it. It's nothing more. Easy, isn't it?

 

Donnerstag, 12. Mai 2022

Draft for a differential BagIt

The problem

BagIt (RFC 8493) forms the basis for Submission Information Packages (SIP) and Archival Information Packages (AIP) in many digital archives.
Especially in the library environment, it is necessary to support supplemental submissions in the Archival Information System (AIS) software. Supplements may be limited to metadata or may add new files, remove existing files, or replace existing files.
Unfortunately, there is no way to implement a differential SIP cleanly and easily in the BagIt specification.

The constraints

A design of a differential BagIt (dBagIt) should meet the following conditions:

1. existing BagIt should not be touched
2. it should be based on the BagIt structure so that the conversion effort is minimal
3. it should be easy to implement
4. it should support the "add" and "delete" operations
5. the checksum protection should be guaranteed
6. the referenced bag should be specified explicitly

The proposal of dBagIt

The basis is the structure of BagIt. The following are the changes that are mandatory.

Bag Declaration: dbagit.txt

In contrast to 2.1.1 of RFC8493 the filename is dbagit.txt

Payload Manifest

In contrast to 2.1.3 of RFC8493 each line of a payload manifest file MUST be of the form

   sign checksum filepath

where sign is either + for adding a file or - for deleting a file. 

The replacement of files is simulated by one entry each for deleting and adding.

Bag Metadata: bag-info.txt

Additional to RFC8493 the key Updates-External-Identifier becomes mandatory. It is used to reference to the original data object, which will be updated by this dBagIt.

Optional Tag Manifest

The Tag Manifest is similar to RFC8493. 

Although tag manifest files in BagIt could be used to describe additional proprietary subdirectories of a bag not specified in the RFC, it is not defined here to support changes as in the previous section on payload manifest. This facilitates the creation and processing of dBagIts. 

Implementation of the behavior

The implementation must ensure that:

  1. the target object referenced by key Updates-External-Identifier exists
  2. the dBagIt is valid
  3. the add/delete operations are atomic and rollback-able
  4. the checksums of files which should be added are correct and part of the current payload
  5. the checksum of files which should be deleted are similar to the checksum of the files in the referenced digital object
  6. the files in tag manifests handled correctly if proprietary extensions used
  7. the metadata content in bag-info.txt replaced previous versions in referenced object completely

Future

If there is interest, I would be happy to receive feedback via art1pirat ATgmail.com. Maybe a new RFC can grow out of it.

Alternate consideration

A very simple solution could also be the use of unified 'diff'. This also allows partial changes in files, but would hardly bring any advantages with binary data and is not quite as intuitive for users who are not familiar with IT.

FAQ (Update 2022-05-18)

  1. What if "delete" references a non-existing file? The complete operations via differential BagIt should be atomar and consistent. In this case the operations are rollbacked and aborted with an error. This ensure that no unintented updates will be applied.
  2. Wouldn't it be nice, to avoid transferring files, to allow a simple rename instead of a replace? This would be worth considering. however, a secure rename requires the checksum, the old filename, and the new filename. That makes it complicated again. Since the case would probably not be too frequent, this could be specified later if needed.
  3. How is it ensured that of several files with the same checksum, the wrong file is not deleted or replaced? Since for "delete" the checksum and the path of the already existing file must be specified, a mix-up is impossible.
  4. Is it correct that when I pass metadata in baginfo.txt, it overwrites the metadata in the referenced object? If yes, why? Yes, that is so. It simplifies the design to focus only on the payload. By the way, the purpose of differential BagIt is to reduce the cost of complete transfer of all files in case of supplement deliveries. And most of the costs are usually incurred in the transfer of the payload.

Freitag, 3. Dezember 2021

Detectorist - Part two "A crumb of knowledge"

A crumb of knowledge


In the first part I described how I came to know how to read the floppy disks (using kryoflux). Now I would like to give an intermediate state about the floppy disk format of the Panasonic typewriter - in the quiet hope that someone could uncover the last secret.


I found the most important clue while researching a successor model - the Panasonc KX-W1000. I stumbled across the follow old blog post https://surrey.lug.org.uk/panasonic-kx-w1000.

My findings

Even if it didn't lead to a full success, there were some interesting insights. The floppy image is strongly related to FAT12.

Here is my summary.

The filesystem is based on FAT12 with proprietary extensions. 

Header / MBR

The first bytes are: 0x00 00 00 4B 58 2D 57 31 35 31 30 20 31 2E 30 30 20, which corresponds to the string "KX-W1510 1.00" from the third byte onwards.

The first 256 bytes are very similar to a MBR of old DOS floppies:

0000:0000 | 00 00 00 4B  58 2D 57 31  35 31 30 20  31 2E 30 30 | ...KX-W1510 1.00
0000:0010 | 20 F9 00 00  00 00 00 00  00 00 00 00  00 00 00 00 |  ù..............
0000:0020 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0030 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0040 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0050 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0060 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0070 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0080 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0090 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00A0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00B0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00C0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00D0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00E0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00F0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................

FATs

There are two equal blocks which probably represent FATs, once at address 0x200:

0000:0200 | F9 FF FF 03  40 00 05 B0  00 07 80 00  09 A0 00 FF | ùÿÿ.@..°..... .ÿ
0000:0210 | FF FF 0D E0  00 0F 00 01  FF 8F 01 13  40 01 15 60 | ÿÿ.à....ÿ...@..`
0000:0220 | 01 17 F0 FF  19 90 02 1B  10 02 1D E0  01 1F 00 02 | ..ðÿ.......à....
0000:0230 | FF 2F 02 23  F0 FF 25 60  02 2D 80 02  2C A0 02 2B | ÿ/.#ðÿ%`.-.., .+
0000:0240 | F0 FF FF EF  02 35 00 03  31 20 03 33  F0 FF 36 F0 | ðÿÿï.5..1 .3ðÿ6ð
0000:0250 | FF 37 80 03  39 F0 FF 3B  C0 03 3D E0  03 FF 0F 04 | ÿ7..9ðÿ;À.=à.ÿ..
0000:0260 | 41 20 04 43  F0 FF 45 60  04 47 80 04  49 F0 FF 4B | A .CðÿE`.G..IðÿK
0000:0270 | C0 04 4D E0  04 4F F0 FF  51 20 05 53  40 05 55 F0 | À.Mà.OðÿQ .S@.Uð
0000:0280 | FF 57 80 05  59 A0 05 5B  F0 FF 5D E0  05 69 B0 07 | ÿW..Y .[ðÿ]à.i°.
0000:0290 | 7F 20 06 63  40 06 65 F0  FF 6E 80 06  6B A0 06 FF | . .c@.eðÿn..k .ÿ
0000:02A0 | CF 06 6D F0  FF 7E 00 07  71 20 07 73  40 07 FF 6F | Ï.mðÿ~..q .s@.ÿo
0000:02B0 | 07 77 80 07  79 A0 07 FF  CF 07 7D 60  08 80 30 08 | .w..y .ÿÏ.}`..0.
0000:02C0 | 81 20 08 FF  4F 08 85 80  08 87 F0 FF  FF AF 08 8B | . .ÿO.....ðÿÿ¯..
0000:02D0 | F0 08 8D E0  08 90 20 09  91 40 09 93  F0 FF FF 6F | ð..à.. ..@..ðÿÿo
0000:02E0 | 09 A2 F0 09  99 A0 09 9B  C0 09 9D F0  FF A1 00 0A | .¢ð.. ..À..ðÿ¡..
0000:02F0 | B2 60 0A A3  40 0A A5 F0  FF AC 80 0A  A9 A0 0A AB | ²`.£@.¥ðÿ¬..© .«
0000:0300 | F0 FF AD E0  0A AF 00 0B  B1 B0 0B BA  40 0B B5 C0 | ðÿ.à.¯..±°.º@.µÀ
0000:0310 | 0C B7 80 0B  B9 60 0C C3  C0 0B BD E0  0B BF 00 0C | .·..¹`.ÃÀ.½à.¿..
0000:0320 | C1 20 0C CA  40 0C C5 80  0C C7 90 0C  E8 00 0D CB | Á .Ê@.Å..Ç..è..Ë
0000:0330 | F0 FF CD E0  0C CF B0 0D  D1 50 0D D3  40 0D DE F0 | ðÿÍà.ϰ.ÑP.Ó@.Þð
0000:0340 | FF D7 E0 0E  D9 70 0E E2  C0 0D DD F0  FF DF 00 0E | ÿ×à.Ùp.âÀ.Ýðÿß..
0000:0350 | E1 50 0E E3  40 0E E6 90  0E FB C0 0E  FF AF 0E EB | áP.ã@.æ..ûÀ.ÿ¯.ë
0000:0360 | F0 FF ED 60  0F EF 00 0F  F1 20 0F F3  40 0F F5 F0 | ðÿí`.ï..ñ .ó@.õð
0000:0370 | FF F7 80 0F  F9 A0 0F FF  CF 0F FD E0  0F 08 01 00 | ÿ÷..ù .ÿÏ.ýà....

once at 0x800:

0000:0800 | F9 FF FF 03  40 00 05 B0  00 07 80 00  09 A0 00 FF | ùÿÿ.@..°..... .ÿ
0000:0810 | FF FF 0D E0  00 0F 00 01  FF 8F 01 13  40 01 15 60 | ÿÿ.à....ÿ...@..`
0000:0820 | 01 17 F0 FF  19 90 02 1B  10 02 1D E0  01 1F 00 02 | ..ðÿ.......à....
0000:0830 | FF 2F 02 23  F0 FF 25 60  02 2D 80 02  2C A0 02 2B | ÿ/.#ðÿ%`.-.., .+
0000:0840 | F0 FF FF EF  02 35 00 03  31 20 03 33  F0 FF 36 F0 | ðÿÿï.5..1 .3ðÿ6ð
0000:0850 | FF 37 80 03  39 F0 FF 3B  C0 03 3D E0  03 FF 0F 04 | ÿ7..9ðÿ;À.=à.ÿ..
0000:0860 | 41 20 04 43  F0 FF 45 60  04 47 80 04  49 F0 FF 4B | A .CðÿE`.G..IðÿK
0000:0870 | C0 04 4D E0  04 4F F0 FF  51 20 05 53  40 05 55 F0 | À.Mà.OðÿQ .S@.Uð
0000:0880 | FF 57 80 05  59 A0 05 5B  F0 FF 5D E0  05 69 B0 07 | ÿW..Y .[ðÿ]à.i°.
0000:0890 | 7F 20 06 63  40 06 65 F0  FF 6E 80 06  6B A0 06 FF | . .c@.eðÿn..k .ÿ
0000:08A0 | CF 06 6D F0  FF 7E 00 07  71 20 07 73  40 07 FF 6F | Ï.mðÿ~..q .s@.ÿo
0000:08B0 | 07 77 80 07  79 A0 07 FF  CF 07 7D 60  08 80 30 08 | .w..y .ÿÏ.}`..0.
0000:08C0 | 81 20 08 FF  4F 08 85 80  08 87 F0 FF  FF AF 08 8B | . .ÿO.....ðÿÿ¯..
0000:08D0 | F0 08 8D E0  08 90 20 09  91 40 09 93  F0 FF FF 6F | ð..à.. ..@..ðÿÿo
0000:08E0 | 09 A2 F0 09  99 A0 09 9B  C0 09 9D F0  FF A1 00 0A | .¢ð.. ..À..ðÿ¡..
0000:08F0 | B2 60 0A A3  40 0A A5 F0  FF AC 80 0A  A9 A0 0A AB | ²`.£@.¥ðÿ¬..© .«
0000:0900 | F0 FF AD E0  0A AF 00 0B  B1 B0 0B BA  40 0B B5 C0 | ðÿ.à.¯..±°.º@.µÀ
0000:0910 | 0C B7 80 0B  B9 60 0C C3  C0 0B BD E0  0B BF 00 0C | .·..¹`.ÃÀ.½à.¿..
0000:0920 | C1 20 0C CA  40 0C C5 80  0C C7 90 0C  E8 00 0D CB | Á .Ê@.Å..Ç..è..Ë
0000:0930 | F0 FF CD E0  0C CF B0 0D  D1 50 0D D3  40 0D DE F0 | ðÿÍà.ϰ.ÑP.Ó@.Þð
0000:0940 | FF D7 E0 0E  D9 70 0E E2  C0 0D DD F0  FF DF 00 0E | ÿ×à.Ùp.âÀ.Ýðÿß..
0000:0950 | E1 50 0E E3  40 0E E6 90  0E FB C0 0E  FF AF 0E EB | áP.ã@.æ..ûÀ.ÿ¯.ë
0000:0960 | F0 FF ED 60  0F EF 00 0F  F1 20 0F F3  40 0F F5 F0 | ðÿí`.ï..ñ .ó@.õð
0000:0970 | FF F7 80 0F  F9 A0 0F FF  CF 0F FD E0  0F 08 01 00 | ÿ÷..ù .ÿÏ.ýà....
0000:0980 | 00 00 00 00  00 00 00 00  00 00 00 00  FF 0F 00 00 | ............ÿ... 

Directory

The main directory always starts from address 0xe00:

0000:0E00 | 20 20 20 20  20 20 44 49  5B 54 20 FF  00 00 00 00 |       DI[T ÿ....
0000:0E10 | 00 00 00 00  00 00 06 00  21 00 02 00  F5 13 00 00 | ........!...õ...
0000:0E20 | 20 20 20 20  20 20 41 46  46 45 20 FF  00 00 00 00 |       AFFE ÿ....
0000:0E30 | 00 00 00 00  00 00 06 00  21 00 06 00  F6 13 00 00 | ........!...ö...
0000:0E40 | 20 20 20 54  52 5D 46 46  45 4C 20 FF  00 00 00 00 |    TR]FFEL ÿ....
0000:0E50 | 00 00 00 00  00 00 06 00  21 00 0C 00  C4 13 00 00 | ........!...Ä...
0000:0E60 | 20 20 45 52  42 50 52 49  4E 5A 20 FF  00 00 00 00 |   ERBPRINZ ÿ....
0000:0E70 | 00 00 00 00  00 00 06 00  21 00 11 00  17 14 00 00 | ........!.......
0000:0E80 | 20 20 20 20  42 49 53 54  52 4F 20 FF  00 00 00 00 |     BISTRO ÿ....
0000:0E90 | 00 00 00 00  00 00 06 00  21 00 12 00  61 14 00 00 | ........!...a...
0000:0EA0 | 20 20 20 48  55 48 4E 20  49 49 20 FF  00 00 00 00 |    HUHN II ÿ....
0000:0EB0 | 00 00 00 00  00 00 06 00  21 00 1C 00  CC 13 00 00 | ........!...Ì...
0000:0EC0 | 20 20 20 20  57 41 43 48  41 55 20 FF  00 00 00 00 |     WACHAU ÿ....
0000:0ED0 | 00 00 00 00  00 00 06 00  21 00 1A 00  C0 13 00 00 | ........!...À...
0000:0EE0 | 20 20 20 20  20 4B 41 4B  41 4F 20 FF  00 00 00 00 |      KAKAO ÿ....
0000:0EF0 | 00 00 00 00  00 00 06 00  21 00 24 00  1D 14 00 00 | ........!.$.....
0000:0F00 | 20 20 20 20  20 20 4D 5D  4C 4C 20 FF  00 00 00 00 |       M]LL ÿ....
0000:0F10 | 00 00 00 00  00 00 06 00  21 00 27 00  22 0B 00 00 | ........!.'."...
0000:0F20 | 20 46 52 41  55 20 4D 4F  44 45 20 FF  00 00 00 00 |  FRAU MODE ÿ....
0000:0F30 | 00 00 00 00  00 00 06 00  21 00 2F 00  AC 13 00 00 | ........!./.¬...
0000:0F40 | 20 20 20 20  53 55 50 50  45 4E 20 FF  00 00 00 00 |     SUPPEN ÿ....
0000:0F50 | 00 00 00 00  00 00 06 00  21 00 34 00  C7 13 00 00 | ........!.4.Ç...
0000:0F60 | 55 4E 53 45  52 20 42 52  4F 54 20 FF  00 00 00 00 | UNSER BROT ÿ....
0000:0F70 | 00 00 00 00  00 00 06 00  21 00 3A 00  B8 13 00 00 | ........!.:.¸...
0000:0F80 | 20 20 20 20  20 20 31 39  39 34 20 FF  00 00 00 00 |       1994 ÿ....
0000:0F90 | 00 00 00 00  00 00 06 00  21 00 3F 00  AA 13 00 00 | ........!.?.ª...
0000:0FA0 | 20 20 20 20  20 4B 5D 43  48 45 20 FF  00 00 00 00 |      K]CHE ÿ....
0000:0FB0 | 00 00 00 00  00 00 06 00  21 00 44 00  3D 14 00 00 | ........!.D.=...
0000:0FC0 | 20 55 43 4B  45 52 4D 41  52 4B 20 FF  00 00 00 00 |  UCKERMARK ÿ....
0000:0FD0 | 00 00 00 00  00 00 06 00  21 00 4A 00  2B 14 00 00 | ........!.J.+...
0000:0FE0 | 20 20 52 49  45 53 4C 49  4E 47 20 FF  00 00 00 00 |   RIESLING ÿ....
0000:0FF0 | 00 00 00 00  00 00 06 00  21 00 50 00  31 14 00 00 | ........!.P.1...
0000:1000 | 43 48 49 4E  41 54 52 5D  46 46 20 FF  00 00 00 00 | CHINATR]FF ÿ....
0000:1010 | 00 00 00 00  00 00 06 00  21 00 56 00  25 14 00 00 | ........!.V.%...
0000:1020 | 20 4B 5B 53  45 52 45 53  54 45 20 FF  00 00 00 00 |  K[SERESTE ÿ....
0000:1030 | 00 00 00 00  00 00 06 00  21 00 5C 00  E3 12 00 00 | ........!.\.ã...
0000:1040 | 4B 41 54 5A  45 4E 46 55  54 54 20 FF  00 00 00 00 | KATZENFUTT ÿ....
0000:1050 | 00 00 00 00  00 00 06 00  21 00 61 00  CC 12 00 00 | ........!.a.Ì...
0000:1060 | 20 20 52 4F  42 55 43 48  4F 4E 20 FF  00 00 00 00 |   ROBUCHON ÿ....
0000:1070 | 00 00 00 00  00 00 06 00  21 00 5F 00  55 14 00 00 | ........!._.U...
0000:1080 | 20 20 20 4D  41 4E 41 47  45 52 20 FF  00 00 00 00 |    MANAGER ÿ....
0000:1090 | 00 00 00 00  00 00 06 00  21 00 67 00  FC 13 00 00 | ........!.g.ü...
0000:10A0 | 20 20 4D 49  43 48 45 4C  49 4E 20 FF  00 00 00 00 |   MICHELIN ÿ....
0000:10B0 | 00 00 00 00  00 00 06 00  21 00 6F 00  8C 14 00 00 | ........!.o.....
0000:10C0 | 20 20 50 49  4D 45 4E 54  4F 53 20 FF  00 00 00 00 |   PIMENTOS ÿ....
0000:10D0 | 00 00 00 00  00 00 06 00  21 00 75 00  14 14 00 00 | ........!.u.....
0000:10E0 | 54 48 4F 4D  41 53 4D 41  4E 4E 20 FF  00 00 00 00 | THOMASMANN ÿ....
0000:10F0 | 00 00 00 00  00 00 06 00  21 00 66 00  20 14 00 00 | ........!.f. ...
0000:1100 | 20 20 38 2D  4D 41 49 2D  34 35 20 FF  00 00 00 00 |   8-MAI-45 ÿ....
0000:1110 | 00 00 00 00  00 00 06 00  21 00 60 00  2A 14 00 00 | ........!.`.*...
0000:1120 | 20 20 43 4F  51 41 55 56  49 4E 20 FF  00 00 00 00 |   COQAUVIN ÿ....
0000:1130 | 00 00 00 00  00 00 06 00  21 00 89 00  0B 14 00 00 | ........!.......
0000:1140 | 20 47 55 44  45 20 53 54  55 42 20 FF  00 00 00 00 |  GUDE STUB ÿ....
0000:1150 | 00 00 00 00  00 00 06 00  21 00 8C 00  A0 14 00 00 | ........!... ...
0000:1160 | 20 20 4D 4F  4E 54 43 41  55 44 20 FF  00 00 00 00 |   MONTCAUD ÿ....
0000:1170 | 00 00 00 00  00 00 06 00  21 00 95 00  63 15 00 00 | ........!...c...
0000:1180 | 20 53 50 41  52 47 45 4C  45 49 20 FF  00 00 00 00 |  SPARGELEI ÿ....
0000:1190 | 00 00 00 00  00 00 06 00  21 00 98 00  BD 14 00 00 | ........!...½...
0000:11A0 | 53 45 4D 49  42 45 4C 47  49 45 20 FF  00 00 00 00 | SEMIBELGIE ÿ....
0000:11B0 | 00 00 00 00  00 00 06 00  21 00 97 00  B3 25 00 00 | ........!...³%..
0000:11C0 | 20 53 45 4D  49 4E 41 52  39 35 20 FF  00 00 00 00 |  SEMINAR95 ÿ....
0000:11D0 | 00 00 00 00  00 00 06 00  21 00 9E 00  C9 4B 00 00 | ........!...ÉK..
0000:11E0 | 20 20 54 41  4E 54 41 4C  55 53 20 FF  00 00 00 00 |   TANTALUS ÿ....
0000:11F0 | 00 00 00 00  00 00 06 00  21 00 A7 00  DC 12 00 00 | ........!.§.Ü...

In contrast to FAT12 each directory entry consists of 10bytes for the file name, left padded with Spaces. Umlauts in filenames are possible (see below). A filename suffix does not exist. This corresponds with the findings in the typewriter manual.

Sometimes there is a special directory at Offset 0x100, this could hold the adress-lists or dictionaries:

0000:0100 | 20 20 20 20  57 41 53 53  45 52 20 FF  00 00 00 00 |     WASSER ÿ....
0000:0110 | 00 00 00 00  00 00 06 00  21 00 48 00  36 0A 00 00 | ........!.H.6...
0000:0120 | 20 20 20 20  20 4B 5D 43  48 45 20 FF  00 00 00 00 |      K]CHE ÿ....
0000:0130 | 00 00 00 00  00 00 06 00  21 00 49 00  3D 14 00 00 | ........!.I.=...
0000:0140 | 20 20 20 41  55 53 54 45  52 4E 20 FF  00 00 00 00 |    AUSTERN ÿ....
0000:0150 | 00 00 00 00  00 00 06 00  21 00 4E 00  D2 0A 00 00 | ........!.N.Ò...
0000:0160 | 20 20 20 20  54 52 5B 55  4D 45 20 FF  00 00 00 00 |     TR[UME ÿ....
0000:0170 | 00 00 00 00  00 00 06 00  21 00 50 00  59 25 00 00 | ........!.P.Y%..
0000:0180 | 20 20 52 45  43 48 4E 55  4E 47 20 FF  00 00 00 00 |   RECHNUNG ÿ....
0000:0190 | 00 00 00 00  00 00 06 00  21 00 54 00  C3 08 00 00 | ........!.T.Ã...
0000:01A0 | 20 20 20 20  20 48 45 4E  52 59 20 FF  00 00 00 00 |      HENRY ÿ....
0000:01B0 | 00 00 00 00  00 00 06 00  21 00 59 00  66 11 00 00 | ........!.Y.f...
0000:01C0 | 53 43 48 57  41 52 5A 41  44 4C 20 FF  00 00 00 00 | SCHWARZADL ÿ....
0000:01D0 | 00 00 00 00  00 00 06 00  21 00 57 00  94 0A 00 00 | ........!.W.....
0000:01E0 | 20 50 4C 41  43 48 55 54  54 41 20 FF  00 00 00 00 |  PLACHUTTA ÿ....
0000:01F0 | 00 00 00 00  00 00 06 00  21 00 5C 00  49 09 00 00 | ........!.\.I...

But sometimes there are textfragments (from other floppy):

0000:0100 | 64 20 73 63  68 E9 64 6C  69 63 68 21  C9 20 20 20 | d schédlich!É   
0000:0110 | 20 20 20 20  20 20 20 20  20 20 20 20  20 20 20 20 |                 
0000:0120 | 20 20 20 20  20 20 20 20  20 20 20 20  20 20 20 20 |                 
0000:0130 | 20 20 20 20  20 20 20 20  20 20 20 20  20 20 20 20 |                 
0000:0140 | 55 64 6F 20  50 6F 6C 6C  6D 65 72 2C  20 65 69 6E | Udo Pollmer, ein
0000:0150 | 20 4C 65 62  65 6E 73 6D  69 74 74 65  6C 63 68 65 |  Lebensmittelche
0000:0160 | 6D 69 6B 65  72 20 75 6E  64 20 65 72  66 6F 6C 67 | miker und erfolg
0000:0170 | 72 65 69 63  68 65 72 20  20 20 20 20  20 20 20 20 | reicher         
0000:0180 | 20 20 20 20  20 20 20 20  20 20 20 20  20 20 20 20 |                 
0000:0190 | 46 61 63 68  62 75 63 68  61 75 74 6F  72 20 68 61 | Fachbuchautor ha
0000:01A0 | 74 20 69 6E  20 65 69 6E  65 6D 20 5A  65 69 74 75 | t in einem Zeitu
0000:01B0 | 6E 67 73 69  6E 74 65 72  76 69 65 77  20 65 72 6B | ngsinterview erk
0000:01C0 | 6C E9 72 74  3A 20 20 20  20 20 20 20  20 20 20 20 | lért:           
0000:01D0 | 20 20 20 20  20 20 20 20  20 20 20 20  20 20 20 20 |                 
0000:01E0 | 22 44 69 E9  74 65 6E 20  6D 61 63 68  65 6E 20 64 | "Diéten machen d
0000:01F0 | 69 63 6B 22  2E 20 57 65  69 6C 20 64  65 72 20 4B | ick". Weil der K

Umlauts ans Special chars

Umlauts and Special chars are mapped as follows:

ä → 0x7b
ö → 0x7c
ü → 0x7d
Ä → 0x5b
Ö → 0x5c
Ü → 0x5d
ß → 0x85
hyphen → 0xbc

Open Questions

What is still completely unclear is how the FATs are constructed. They do look like FAT12 entries, the first bytes 0xf9 0xff 0x03... and the frequently occurring 0xff suggest this, yet there seems to be no connection between the addresses of the text fragments in the image and the FAT byte sequences.


In the directory entries everything points to the fact that byte 26 indicates the start cluster and bytes 28-29 the file size, the connection with the FAT and the actual offset (or cluster) to the data I could not decipher yet.

The meaning of offset 0x100 is unclear. 

If you have any ideas how to read the FATs, or how to interpret the bytes 26, 28-29 of the directory entries, or what the cluster size should be, feel free to write me.

If you are the owner of such an old typewriter, it would be helpful to have a clean-room floppy copy, i.e. a freshly formatted floppy with a small test text, so that I can reverse engineer the data format even better.

Just contact me at art1piratatgoogledotcom 

 

Supportive Links 

https://archive.org/details/MSXTechnicalDataBook/page/n269/mode/2up

https://github.com/Konamiman/MSX2-Technical-Handbook/blob/master/md/Chapter3.md#3--structure-of-disk-files

https://manualsbrain.com/ja/products/panasonic-kx-w1510/

Thanks

my thanks goes to 

Sonntag, 28. November 2021

Detectorist - Part one "First indications"

First indications

In an estate there are several CDROMs, DVDs and especially floppy disks. We were able to read most of them with Linux, including the floppy disks. Only on the last 8 floppy disks did we have a hard time. 

On one of the floppy disks a small inscription peeked out, referring to a Panasonic electronic typewriter. There was none in the estate, no other information was available. 

Eight floppy disks, 3,5", double density, not readable. 

Time passed, constantly haunted by the voice in my mind: "There's something on the disks, only what?" 

We managed to purchase a Kryoflux controller (see https://kryoflux.com/, there are also free opensource alternatives). This is a special disk controller that allows you to record the magnetic flux as the read heads move over the disk.

 

After the first attempts, I was able to create an image file with the following command:

./dtc -fIMAGEFILE -dd1 -g2 -i4

The options mean:

  • "-dd1" - double density
  • "-g2" - double sided
  • "-i4" - MFM sector image 40/80+ tracks

 

A look at the image using the hex editor showed that I was right with my intention. After the first three zero bytes, the string "KX-W1510 v1.00" followed (and to my happy surprise, a lot of readable text fragments). 

Yep, there is exactly one electronic typewriter series from Panasonic.

Disillusionment

I was able to find a manual at https://manualsbrain.com/en/manuals/1814281/. And yes, the machine used 3.5" floppy disks, double sided, double density with a capacity of 713,000 characters, but unfortunately without an exact description of the disk format and the file system.

I then contacted Panasonic support - no success. I started researching patent databases in Japan, the USA and Germany - nothing. I wrote to the Panasonic museum in Japan, but unfortunately they could not help me.


A proprietary disk format, which was forgotten after 30 years.


In the next part I report what I could find out about the disk system of the Panasonic typewriter KX-W1510, and where I (still) fail...


Donnerstag, 1. April 2021

Backup is digital long-term preservation!

Exponential growth

https://www.statista.com/chart/17727/global-data-creation-forecasts/

An important observation is that the number of files produced each year continues to increase worldwide (see https://en.wikipedia.org/wiki/Information_explosion). And with it the number of digital objects increases in the same measure, for which we must decide: Keep or throw away? 

The truth is, the discard scenario becomes the more likely one with each passing year.

Magnificent diversity

 Another observation is that about 90 new file formats are added every year.
And the file formats that are being dropped are already in place. 

 

The truth is, no one can build up format knowledge for this yet.

 

A fuzzy concept

When talking to colleagues, the topic of validation does not play a role. For one thing, no one is clear about what "valid" means. Valid against a specification? Valid against a profile? Valid because it can be opened by programs? On the other hand, nothing happens after that. If a file is broken, it is still archived. If it is not broken, fine. 

The truth is, validation is useless.

 

Success factors

Do you know how the success of digital preservation is measured? I'll tell you, in terabytes per year. If the numbers go up, that's a good thing to sell to politicians. Whether it was difficult to prepare digital objects for long-term availability doesn't matter. Whether born-digitals are more at risk, never mind. 

Is that the truth?

Overrated

It used to be said that long-term digital archiving could only be handled by organizations with a minimum of resources. Look around and you'll find dozens of one-man orchestras and part-time archives. And do you think that as the amount of data increases, so do the human resources? Oh, come on! 

 You know the truth!

That's too exhausting

If you've ever heard of format migration as a principle of long-term preservation, you've read in textbooks phrases like 

To ensure format migration, the significant properties of groups of objects that must be preserved must be determined. 

Have you ever seen an archive that has actually determined and documented significant properties

The truth is, significant properties are determined after the fact from technical metadata.

Summary

So what is digital long-term preservation? Only an expensive backup.

Mittwoch, 27. Januar 2021

Impossible - or how I learned to read data storage media at the speed of light and what it's good for


When I receive data carriers from an inheritance, I want to get a quick overview of what is on the floppy disk, the CDROM, the USB stick or the hard disk drive so that I can look at the interesting things first.

But I only know what is there when I read the media, right? A typical chicken and egg problem. 

https://openclipart.org/detail/212857/sci-fi-scanner-device
I discovered the crucial clue to the solution in a 2014 talk by Simon Garfinkel "Digital Forensics Innovation: Searching A Terabyte of Data in 10 minutes" (http://simson.net/ref/2014/2014-02-21_RPI_Forensics_Innovation.pdf)

What is Random Sampling?

Random sampling is nothing more than looking at only every n-th part of a total set and inferring the big picture.

To find out what is on a medium, it would be sufficient to look at random blocks and determine for them, based on their byte structure, whether they fall into the categories "empty", "random", "text", "video" or "undef".

Exactly this approach is implemented in the Perl module File::FormatIdentification::RandomSampling, which can be found on CPAN under https://metacpan.org/pod/File::FormatIdentification::RandomSampling.

The category "empty" is dominated by sequences of zero bytes, in the category "random" the byte values are almost equally distributed, in the category "text" values for the characters "a-z" from the ASCII character set appear frequently, "video" contains frequent byte sequences resulting from the basic structure of MPEG. And under "undef" everything else is subsumed.

Example

The above Perl module contains the program crazy_fast_image_scan.pl. The following simple call:

perl -I lib bin/crazy_fast_image_scan.pl --percent=0.000001 --image=/dev/mapper/laptop--vg-home

provides the following output:

Scanning Image /dev/mapper/laptop--vg-home with size 728982618112, checking 1423 sectors
scanning [...]   
Estimate, that the image '/dev/mapper/laptop--vg-home'
has percent of following data types:
    44.6% random/encrypted/compressed
    35.6% undef
    11.0% empty
     5.4% video/audio
     3.5% text

The complete output is even more extensive. It is important to note that the examined partition was 668GB in size and was scanned in just 15s.

Limits

Importantly, the output provides only a rough estimate of what might be on the media. The choice of the sample size (here: via the --percentage parameter) determines the informative value of the estimate, as well as the duration until a result can be delivered.

More ideas

In the above module, I have implemented an experimental output of the MIME-Types potentially present on the media. This is not very stable yet and needs more work, but it can help to estimate even better whether the files on a disk are interesting enough to prioritize it. Here is an example output:

The next mimetype estimation is experimental and needs further work:
    87.9% unknown
     3.5% application/pdf
     1.1% video/quicktime
     0.8% image/gif
     0.8% text/java
     0.7% application/msword
     0.6% text/markdown
     0.6% application/vnd.openxmlformats-officedocument.wordprocessingml.document
     0.6% application/xml
     0.4% application/msaccess
     0.4% application/navimap
     0.4% application/rtf
     0.3% image/png
     0.2% application/arj
     0.1% application/vnd.ms-powerpoint
     0.1% text/html

The approach is to determine the MIME-Type of the files for a test corpus using other tools, determine typical bytegram values and pass the whole thing to a decision tree learner. If you are interested, you are welcome to contribute to the module. 

Happy scanning!