From this small script a bigger Perl module called "File::FormatIdentification::Pronom" was created. It should not replace Droid, Fido or Siegfried. It only serves to analyze which patterns can be optimized and gives statistics about how to improve the Pronom database in the future.
In the following a statistic of the current Droid signature is shown, so that you get a feeling, what is possible.
perl -I lib/ bin/pronom_statistics.pl ../DROID_SignatureFile_V94.xml
Statistics of file ../DROID_SignatureFile_V94.xml
=======================================
Countings
---------------------------------------
Count of PUIDs: 1670
internal IDs: 1441
regular expressions: 1730
file endings: 1167
PUIDs with file endings only: 503
(56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)
orphaned internal IDs: 20
(56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)
Quality of internal IDs
---------------------------------------
1-best quality internal ID (PUID, name): 110 (fmt/75, Drawing Interchange File Format (ASCII)) -> 4.882;3.135
combined regex: (?=((\x0A)|(\x0D\x0A)(0))SECTION((\x0A)|(\x0D\x0A)(\x20\x202)((\x0A)|(\x0D\x0A)(HEADER)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(9))\$ACADVER((\x0A)|(\x0D\x0A)(\x20\x201)((\x0A)|(\x0D\x0A)(AC1009)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(0))ENDSEC((\x0A)|(\x0D\x0A)))(?=(((\x0A)|(\x0D\x0A)(0))EOF((\x0A)|(\x0D\x0A)))\Z)
2-best quality internal ID (PUID, name): 105 (fmt/70, Drawing Interchange File Format (ASCII)) -> 4.736;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC((1001)|(2\x2E21)|(2\x2E22)(\x0D\x0A))0
ENDSEC
)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
3-best quality internal ID (PUID, name): 104 (fmt/69, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC2\x2E10\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
4-best quality internal ID (PUID, name): 103 (fmt/68, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E50\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
5-best quality internal ID (PUID, name): 102 (fmt/67, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E40\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
1-worst quality internal ID (PUID, name): 1299 (fmt/950, MIME Email) -> -1.993;-2.91;-2.776;-2.776;-2.29
combined regex: (?=\A.{0,16384}(((V)|(v)(\x2D)((IME)|(ime)(M)))ersion: 1\.0))(?=\A.{0,16384}(To\x3A\x20))(?=\A.{0,16384}(From\x3A\x20))(?=\A.{0,16384}(Date\x3A\x20))(?=\A.{0,16384}(Content\x2DType\x3A\x20))
2-worst quality internal ID (PUID, name): 527 (fmt/358, Internet Data Query File) -> -2.806;-2.743;-2.629;-2.981
combined regex: (?=\A.{0,3424}(\x5BQuery\x5D).*(((S)|(s)(i)((C)|(c)))cope=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((C)|(c)(i)((C)|(c)))olumns=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((T)|(t)(i)((C)|(c)))emplate=\/))(?=\A.{0,3424}(\x5BQuery\x5D).*(((R)|(r)(i)((C)|(c)))estriction=.?(\x25)))
3-worst quality internal ID (PUID, name): 532 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
4-worst quality internal ID (PUID, name): 533 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
5-worst quality internal ID (PUID, name): 835 (fmt/532, Drawing Interchange File Format (ASCII)) -> -3.614;-3.842
combined regex: (?=\A.{1,3}((0).{1,2}SECTION.{1,2}(\x20\x202).{1,2}(HEADER)).+((9).{1,2}\$ACADVER.{1,2}(\x20\x201).{1,2}(AC1027)).+((0).{1,2}ENDSEC))(?=((0).{1,2}EOF).{1,3}\Z)
Regular expressions
---------------------------------------
Count of multiple used regular expressions: 67
common regex group no 0:
regex='(((\x0A)|(\x0D)|(\x0D\x0A)(0))EOF).{0,2}\Z'
internal IDs: 111,112,113
[…]
I would be pleased about feedback. The code is available under http://andreas-romeyke.de/software.html#_file_formatidentification_pronom .
Have fun!