Identify digital derivatives - Georgetown-University-Libraries/File-Analyzer GitHub Wiki

Given the set of Digital Objects, find the complete sets of derivative objects.

  • "GTW_brosnan_b106_e0008_1.tif"
  • "GTW_brosnan_b106_e0008_2.jpg"
  • "GTW_brosnan_b106_e0008_3.jpg"
  • "GTW_brosnan_b106_e0054_1.tif"
  • "GTW_brosnan_b106_e0054_2.jpg"
  • "GTW_brosnan_b106_e0054_3.jpg"
  • "GTW_brosnan_b106_e0055_1.tif"
  • "GTW_brosnan_b106_e0055_2.jpg"
  • "GTW_brosnan_b106_e0055_3.jpg"
  • "GTW_brosnan_b106_e0065_1.tif"
  • "GTW_brosnan_b106_e0065_2.jpg"
  • "GTW_brosnan_b106_e0065_3.jpg"
  • "GTW_brosnan_b106_e0n01_1.tif"
  • "GTW_brosnan_b106_e0n01_2.jpg"
  • "GTW_brosnan_b106_e0n01_3.jpg"
  • "Thumbs.db"

The Digital Derivatives rule will help perform this function.

screen-shot

Default Match Rule

In order to make this rule flexible, the user must enter a regular expression match to help identify common sets of items. The default pattern is shown here. Match: ^(.*?).[^.]+$

  • ^ - beginning of name
  • (.*?) - grab any character, the parentheses make this group #1
  • .[^.]+ - grab everything following the final period (including the period)
  • $ - end of the name Replacement:
  • pull the contents that match the first parentheses group File Extensions Required: .tif,.jpg
  • look for tif files and jpg files Note that the following results are not very helpful because the derivatives are not grouped together.

screen-shot

Customized Match Rule

screen-shot

Match: ^(.*?)(_\d)?.[^.]+$

  • ^ - beginning of name
  • (.*?) - grab any character, the parentheses make this group #1
  • Note the *? makes this a non-greedy rule
  • (_\d)? - Look for _ + a digit preceding the file extension
  • .[^.]+ - grab everything following the final period (including the period)
  • $ - end of the name Replacement: $1
  • pull the contents tha* t match the first parentheses group File Extensions Required: .tif,.jpg
  • look for tif files and jpg files

The new results are an improvement screen-shot

Note that the duplicate items were found for .jpg.

One additional refinement

screen-shot

Note the results screen-shot

Note the results if the required file extensions are modified

Required extensions: _1.tif,_2.jpg screen-shot

##Add a required extension that does not exist

Required extensions: _1.tif,_2.jpg,_4.bmp screen-shot

Make the new extension optional

Required extensions: _1.tif,_2.jpg,_3.jpg Optional extensions: _4.bmp screen-shot