Freitag, 5. Oktober 2018

Here's a tool, make it work!

In the last post you may have already noticed it. To analyze the hits of DROID signatures I wrote a small Perl script which converts Droid signatures into Perl Regular Expressions and writes the matches into tag files of the hex editor wxHexEdit so that you can see which signatures were used where in a file.

From this small script a bigger Perl module called "File::FormatIdentification::Pronom" was created. It should not replace Droid, Fido or Siegfried. It only serves to analyze which patterns can be optimized and gives statistics about how to improve the Pronom database in the future.
In the following a statistic of the current Droid signature is shown, so that you get a feeling, what is possible.
perl -I lib/ bin/pronom_statistics.pl ../DROID_SignatureFile_V94.xml
Statistics of file ../DROID_SignatureFile_V94.xml
=======================================

Countings
---------------------------------------
Count of PUIDs:                        1670
         internal IDs:                 1441
         regular expressions:          1730
         file endings:                 1167
         PUIDs with file endings only: 503
         (56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)
         orphaned internal IDs:        20
         (56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)

Quality of internal IDs
---------------------------------------
1-best quality internal ID (PUID, name):       110 (fmt/75, Drawing Interchange File Format (ASCII)) -> 4.882;3.135
        combined regex: (?=((\x0A)|(\x0D\x0A)(0))SECTION((\x0A)|(\x0D\x0A)(\x20\x202)((\x0A)|(\x0D\x0A)(HEADER)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(9))\$ACADVER((\x0A)|(\x0D\x0A)(\x20\x201)((\x0A)|(\x0D\x0A)(AC1009)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(0))ENDSEC((\x0A)|(\x0D\x0A)))(?=(((\x0A)|(\x0D\x0A)(0))EOF((\x0A)|(\x0D\x0A)))\Z)
2-best quality internal ID (PUID, name):       105 (fmt/70, Drawing Interchange File Format (ASCII)) -> 4.736;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC((1001)|(2\x2E21)|(2\x2E22)(\x0D\x0A))0
ENDSEC
)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
3-best quality internal ID (PUID, name):       104 (fmt/69, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC2\x2E10\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
4-best quality internal ID (PUID, name):       103 (fmt/68, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E50\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
5-best quality internal ID (PUID, name):       102 (fmt/67, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E40\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)

1-worst quality internal ID (PUID, name):       1299 (fmt/950, MIME Email) -> -1.993;-2.91;-2.776;-2.776;-2.29
        combined regex: (?=\A.{0,16384}(((V)|(v)(\x2D)((IME)|(ime)(M)))ersion: 1\.0))(?=\A.{0,16384}(To\x3A\x20))(?=\A.{0,16384}(From\x3A\x20))(?=\A.{0,16384}(Date\x3A\x20))(?=\A.{0,16384}(Content\x2DType\x3A\x20))
2-worst quality internal ID (PUID, name):       527 (fmt/358, Internet Data Query File) -> -2.806;-2.743;-2.629;-2.981
        combined regex: (?=\A.{0,3424}(\x5BQuery\x5D).*(((S)|(s)(i)((C)|(c)))cope=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((C)|(c)(i)((C)|(c)))olumns=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((T)|(t)(i)((C)|(c)))emplate=\/))(?=\A.{0,3424}(\x5BQuery\x5D).*(((R)|(r)(i)((C)|(c)))estriction=.?(\x25)))
3-worst quality internal ID (PUID, name):       532 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
        combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
4-worst quality internal ID (PUID, name):       533 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
        combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
5-worst quality internal ID (PUID, name):       835 (fmt/532, Drawing Interchange File Format (ASCII)) -> -3.614;-3.842
        combined regex: (?=\A.{1,3}((0).{1,2}SECTION.{1,2}(\x20\x202).{1,2}(HEADER)).+((9).{1,2}\$ACADVER.{1,2}(\x20\x201).{1,2}(AC1027)).+((0).{1,2}ENDSEC))(?=((0).{1,2}EOF).{1,3}\Z)


Regular expressions
---------------------------------------
Count of multiple used regular expressions: 67
         common regex group no 0:
            regex='(((\x0A)|(\x0D)|(\x0D\x0A)(0))EOF).{0,2}\Z'
            internal IDs: 111,112,113
[…]



I would be pleased about feedback. The code is available under http://andreas-romeyke.de/software.html#_file_formatidentification_pronom .

Have fun!