Dienstag, 22. November 2016

Some thoughts about risks in TIFF file format

Introduction

TIFF in general is a very simple fileformat. It starts with a constant header entry, which indicates that the file is a TIFF and how it is encoded (byteorder).
The header contains an offset entry which points to the first image file directory (IFD). Each IFD has a field which counts the number of associated tags, followed by an array of these tags and an offset entry to the next IFD or to zero, which means there is no further IFD.
Each tag in the array is 12 Bytes long. The first 4 bytes indicate the tag itself, the next 2 bytes declare the value-type, followed by 2 bytes counting the values. The last 4 bytes are either an offset or hold the values themselves.

What makes a TIFF robust?

In the TIFF specification, there are some hints which help us to repair broken TIFFs.
The first hint is that all offset-addresses must be even. The second important rule is that the tags in an IFD must be sorted in an ascending order.
At last, the TIFF spec defines different areas in the tag range. This is to guarantee that the important values are well defined.
If we guarantee that a valid TIFF was stored, there is a good chance to detect and repair broken TIFFs using these three hints.

What are the caveats of TIFF?

As a proof of concept there is also a tool "checkit_tiff_risks" provided in this repository. Using this tool, users can analyze the layout of any baseline TIF file.
The most risky memory ranges are the offsets. If a bitflip occurs there, the user must search the complete 4GB range. In practise, the TIF files are smaller, and so this size is the searchspace for offsets.
The most risky offsets are the ones which are indirect offsets. This means the IFD0 offset and the StripOffset tag (code 273).
Here an example of a possible complex StripOffset encoding:
The problem in this example is that TIFF has no way to find out how many bytes are part of the pixel-data stream. The existing StripByteCounts tag only stores the expected pixel data length after decompression.
This makes the StripOffset tag very fragile. If a bitflip changes the offset of the StripOffset tag, the whole pixel information might be lost.
Also, if a bitflip occurs in the offset area that the StripOffset tag points to, the partial pixel data of the affected stripe is lost.
If compression is used, the risk of losing the whole picture is even higher, because the compression methods do not use an end-symbol. Instead, the buffer sizes as stored in the StripByteCount tag are used. Therefore, a bit-error in the Compression tag, the StripOffset tag, the StripByteCount tag or in the memory-map where StripOffset points to, could destroy the picture information.

Upcoming next…


In upcoming versions of checkit_tiff, we would provide a tool to analyze the distribution of risky offsets in given TIFF-files.  This will objectify the discussion about robust fileformats vs. compression.

Here a short preview:

$>  ./checkit_tiff_risk ../tiffs_should_pass/minimal_valid.tiff

This reports this kind of statistics:

[00], type=                  unused/unknown, bytes=         0, ratio=0.00000
[01], type=                        constant, bytes=         4, ratio=0.01238
[02], type=                             ifd, bytes=       130, ratio=0.40248
[03], type=                  offset_to_ifd0, bytes=         4, ratio=0.01238
[04], type=                   offset_to_ifd, bytes=         4, ratio=0.01238
[05], type= ifd_embedded_standardized_value, bytes=        52, ratio=0.16099
[06], type=   ifd_embedded_registered_value, bytes=         0, ratio=0.00000
[07], type=      ifd_embedded_private_value, bytes=         0, ratio=0.00000
[08], type=ifd_offset_to_standardized_value, bytes=        12, ratio=0.03715
[09], type=  ifd_offset_to_registered_value, bytes=         0, ratio=0.00000
[10], type=     ifd_offset_to_private_value, bytes=         0, ratio=0.00000
[11], type=      ifd_offset_to_stripoffsets, bytes=         0, ratio=0.00000
[12], type=               stripoffset_value, bytes=        30, ratio=0.09288
[13], type=              standardized_value, bytes=        87, ratio=0.26935
[14], type=                registered_value, bytes=         0, ratio=0.00000
[15], type=                   private_value, bytes=         0, ratio=0.00000
counted: 323 bytes, size: 323 bytes


In this example the StripOffset is encoded directly (there are only one stripe). The problematic bytes are the offset-addresses (affected 20 Bytes of 323 Bytes).

In opposite to this example, here a special file using multiple strips:

$>  ./checkit_tiff_risk ../tiffs_should_pass/minimal_valid_multiple_stripoffsets.tiff

This reports this kind of statistics:

[00], type=                  unused/unknown, bytes=         0, ratio=0.00000
[01], type=                        constant, bytes=         4, ratio=0.01250
[02], type=                             ifd, bytes=       122, ratio=0.38125
[03], type=                  offset_to_ifd0, bytes=         4, ratio=0.01250
[04], type=                   offset_to_ifd, bytes=         4, ratio=0.01250
[05], type= ifd_embedded_standardized_value, bytes=        44, ratio=0.13750
[06], type=   ifd_embedded_registered_value, bytes=         0, ratio=0.00000
[07], type=      ifd_embedded_private_value, bytes=         0, ratio=0.00000
[08], type=ifd_offset_to_standardized_value, bytes=        16, ratio=0.05000
[09], type=  ifd_offset_to_registered_value, bytes=         0, ratio=0.00000
[10], type=     ifd_offset_to_private_value, bytes=         0, ratio=0.00000
[11], type=      ifd_offset_to_stripoffsets, bytes=        40, ratio=0.12500
[12], type=               stripoffset_value, bytes=        30, ratio=0.09375
[13], type=              standardized_value, bytes=        56, ratio=0.17500
[14], type=                registered_value, bytes=         0, ratio=0.00000
[15], type=                   private_value, bytes=         0, ratio=0.00000
counted: 320 bytes, size: 320 bytes


Here you can see we have the type 11, which points StripOffset to an array of offset adresses, where the pixel data could be found. This is similar to the diagram above. In this case we have 40 bytes with high bitflipping risk.