The problem
BagIt (RFC 8493) forms the basis for Submission Information Packages (SIP) and Archival Information Packages (AIP) in many digital archives.
Especially in the library environment, it is necessary to support supplemental submissions in the Archival Information System (AIS) software. Supplements may be limited to metadata or may add new files, remove existing files, or replace existing files.
Unfortunately, there is no way to implement a differential SIP cleanly and easily in the BagIt specification.
The constraints
A design of a differential BagIt (dBagIt) should meet the following conditions:
1. existing BagIt should not be touched
2. it should be based on the BagIt structure so that the conversion effort is minimal
3. it should be easy to implement
4. it should support the "add" and "delete" operations
5. the checksum protection should be guaranteed
6. the referenced bag should be specified explicitly
The proposal of dBagIt
The basis is the structure of BagIt. The following are the changes that are mandatory.
Bag Declaration: dbagit.txt
In contrast to 2.1.1 of RFC8493 the filename is dbagit.txt
Payload Manifest
In contrast to 2.1.3 of RFC8493 each line of a payload manifest file MUST be of the form
sign checksum filepath
where sign is either + for adding a file or - for deleting a file.
The replacement of files is simulated by one entry each for deleting and adding.
Bag Metadata: bag-info.txt
Additional to RFC8493 the key Updates-External-Identifier becomes mandatory. It is used to reference to the original data object, which will be updated by this dBagIt.
Optional Tag Manifest
The Tag Manifest is similar to RFC8493.
Although tag manifest files in BagIt could be used to describe additional proprietary subdirectories of a bag not specified in the RFC, it is not defined here to support changes as in the previous section on payload manifest. This facilitates the creation and processing of dBagIts.
Implementation of the behavior
The implementation must ensure that:
- the target object referenced by key Updates-External-Identifier exists
- the dBagIt is valid
- the add/delete operations are atomic and rollback-able
- the checksums of files which should be added are correct and part of the current payload
- the checksum of files which should be deleted are similar to the checksum of the files in the referenced digital object
- the files in tag manifests handled correctly if proprietary extensions used
- the metadata content in bag-info.txt replaced previous versions in referenced object completely
Future
If there is interest, I would be happy to receive feedback via art1pirat ATgmail.com. Maybe a new RFC can grow out of it.
Alternate consideration
A very simple solution could also be the use of unified 'diff'. This also allows partial changes in files, but would hardly bring any advantages with binary data and is not quite as intuitive for users who are not familiar with IT.
FAQ (Update 2022-05-18)
- What if "delete" references a non-existing file? The complete operations via differential BagIt should be atomar and consistent. In this case the operations are rollbacked and aborted with an error. This ensure that no unintented updates will be applied.
- Wouldn't it be nice, to avoid transferring files, to allow a simple rename instead of a replace? This would be worth considering. however, a secure rename requires the checksum, the old filename, and the new filename. That makes it complicated again. Since the case would probably not be too frequent, this could be specified later if needed.
- How is it ensured that of several files with the same checksum, the wrong file is not deleted or replaced? Since for "delete" the checksum and the path of the already existing file must be specified, a mix-up is impossible.
- Is it correct that when I pass metadata in baginfo.txt, it overwrites the metadata in the referenced object? If yes, why? Yes, that is so. It simplifies the design to focus only on the payload. By the way, the purpose of differential BagIt is to reduce the cost of complete transfer of all files in case of supplement deliveries. And most of the costs are usually incurred in the transfer of the payload.