Identify digital derivatives - Georgetown-University-Libraries/File-Analyzer GitHub Wiki
Given the set of Digital Objects, find the complete sets of derivative objects.
- "GTW_brosnan_b106_e0008_1.tif"
- "GTW_brosnan_b106_e0008_2.jpg"
- "GTW_brosnan_b106_e0008_3.jpg"
- "GTW_brosnan_b106_e0054_1.tif"
- "GTW_brosnan_b106_e0054_2.jpg"
- "GTW_brosnan_b106_e0054_3.jpg"
- "GTW_brosnan_b106_e0055_1.tif"
- "GTW_brosnan_b106_e0055_2.jpg"
- "GTW_brosnan_b106_e0055_3.jpg"
- "GTW_brosnan_b106_e0065_1.tif"
- "GTW_brosnan_b106_e0065_2.jpg"
- "GTW_brosnan_b106_e0065_3.jpg"
- "GTW_brosnan_b106_e0n01_1.tif"
- "GTW_brosnan_b106_e0n01_2.jpg"
- "GTW_brosnan_b106_e0n01_3.jpg"
- "Thumbs.db"
The Digital Derivatives rule will help perform this function.
Default Match Rule
In order to make this rule flexible, the user must enter a regular expression match to help identify common sets of items. The default pattern is shown here. Match: ^(.*?).[^.]+$
- ^ - beginning of name
- (.*?) - grab any character, the parentheses make this group #1
- .[^.]+ - grab everything following the final period (including the period)
- $ - end of the name Replacement:
- pull the contents that match the first parentheses group File Extensions Required: .tif,.jpg
- look for tif files and jpg files Note that the following results are not very helpful because the derivatives are not grouped together.
Customized Match Rule
Match: ^(.*?)(_\d)?.[^.]+$
- ^ - beginning of name
- (.*?) - grab any character, the parentheses make this group #1
- Note the *? makes this a non-greedy rule
- (_\d)? - Look for _ + a digit preceding the file extension
- .[^.]+ - grab everything following the final period (including the period)
- $ - end of the name Replacement: $1
- pull the contents tha* t match the first parentheses group File Extensions Required: .tif,.jpg
- look for tif files and jpg files
The new results are an improvement
Note that the duplicate items were found for .jpg.
One additional refinement
Note the results
Note the results if the required file extensions are modified
Required extensions: _1.tif,_2.jpg
##Add a required extension that does not exist
Required extensions: _1.tif,_2.jpg,_4.bmp
Make the new extension optional
Required extensions: _1.tif,_2.jpg,_3.jpg Optional extensions: _4.bmp