From this small script a bigger Perl module called "File::FormatIdentification::Pronom" was created. It should not replace Droid, Fido or Siegfried. It only serves to analyze which patterns can be optimized and gives statistics about how to improve the Pronom database in the future.
In the following a statistic of the current Droid signature is shown, so that you get a feeling, what is possible.
perl -I lib/ bin/pronom_statistics.pl ../DROID_SignatureFile_V94.xml Statistics of file ../DROID_SignatureFile_V94.xml ======================================= Countings --------------------------------------- Count of PUIDs: 1670 internal IDs: 1441 regular expressions: 1730 file endings: 1167 PUIDs with file endings only: 503 (56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435) orphaned internal IDs: 20 (56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435) Quality of internal IDs --------------------------------------- 1-best quality internal ID (PUID, name): 110 (fmt/75, Drawing Interchange File Format (ASCII)) -> 4.882;3.135 combined regex: (?=((\x0A)|(\x0D\x0A)(0))SECTION((\x0A)|(\x0D\x0A)(\x20\x202)((\x0A)|(\x0D\x0A)(HEADER)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(9))\$ACADVER((\x0A)|(\x0D\x0A)(\x20\x201)((\x0A)|(\x0D\x0A)(AC1009)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(0))ENDSEC((\x0A)|(\x0D\x0A)))(?=(((\x0A)|(\x0D\x0A)(0))EOF((\x0A)|(\x0D\x0A)))\Z) 2-best quality internal ID (PUID, name): 105 (fmt/70, Drawing Interchange File Format (ASCII)) -> 4.736;2.833 combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC((1001)|(2\x2E21)|(2\x2E22)(\x0D\x0A))0 ENDSEC )(?=(0\x0D\x0AEOF\x0D\x0A)\Z) 3-best quality internal ID (PUID, name): 104 (fmt/69, Drawing Interchange File Format (ASCII)) -> 4.644;2.833 combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC2\x2E10\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z) 4-best quality internal ID (PUID, name): 103 (fmt/68, Drawing Interchange File Format (ASCII)) -> 4.644;2.833 combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E50\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z) 5-best quality internal ID (PUID, name): 102 (fmt/67, Drawing Interchange File Format (ASCII)) -> 4.644;2.833 combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E40\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z) 1-worst quality internal ID (PUID, name): 1299 (fmt/950, MIME Email) -> -1.993;-2.91;-2.776;-2.776;-2.29 combined regex: (?=\A.{0,16384}(((V)|(v)(\x2D)((IME)|(ime)(M)))ersion: 1\.0))(?=\A.{0,16384}(To\x3A\x20))(?=\A.{0,16384}(From\x3A\x20))(?=\A.{0,16384}(Date\x3A\x20))(?=\A.{0,16384}(Content\x2DType\x3A\x20)) 2-worst quality internal ID (PUID, name): 527 (fmt/358, Internet Data Query File) -> -2.806;-2.743;-2.629;-2.981 combined regex: (?=\A.{0,3424}(\x5BQuery\x5D).*(((S)|(s)(i)((C)|(c)))cope=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((C)|(c)(i)((C)|(c)))olumns=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((T)|(t)(i)((C)|(c)))emplate=\/))(?=\A.{0,3424}(\x5BQuery\x5D).*(((R)|(r)(i)((C)|(c)))estriction=.?(\x25))) 3-worst quality internal ID (PUID, name): 532 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196 combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00)))) 4-worst quality internal ID (PUID, name): 533 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196 combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00)))) 5-worst quality internal ID (PUID, name): 835 (fmt/532, Drawing Interchange File Format (ASCII)) -> -3.614;-3.842 combined regex: (?=\A.{1,3}((0).{1,2}SECTION.{1,2}(\x20\x202).{1,2}(HEADER)).+((9).{1,2}\$ACADVER.{1,2}(\x20\x201).{1,2}(AC1027)).+((0).{1,2}ENDSEC))(?=((0).{1,2}EOF).{1,3}\Z) Regular expressions --------------------------------------- Count of multiple used regular expressions: 67 common regex group no 0: regex='(((\x0A)|(\x0D)|(\x0D\x0A)(0))EOF).{0,2}\Z' internal IDs: 111,112,113[…]
I would be pleased about feedback. The code is available under http://andreas-romeyke.de/software.html#_file_formatidentification_pronom .
Have fun!
Keine Kommentare:
Kommentar veröffentlichen