File Analyzer Training Code4Lib 2014 - Georgetown-University-Libraries/File-Analyzer GitHub Wiki
- Install and build the File Analyzer (required): Installation instructions
- Send Terry a quick note confirming that you were able to complete the installs. At the end of the pre-conference session, we will code a custom File Analyzer rule. In your email, indicate your level of experience/comfort programming in Java. This portion of the session will be tailored to the experience of the audience.
- A Java IDE is recommended for last portion of the pre-conference. If you do not already have a Java IDE available, consider installing the Eclipse Standard Edition: https://www.eclipse.org/downloads/
- File Analyzer Overview
- Try it yourself
- Demonstration of highly customized File Analyzer Rules
- Your ideas for future customizations
- Coding a File Analyzer rule
User documentation is available at the link listed above.
- User Interface - Search the File System
- User interface - viewing results
- Sorting results
- Filtering results
- Exporting results
- User interface - import records from a file
- User interface - Merging and Comparing Results
Sample data files corresponding to these exercises will be provided at the start of the pre-conference session. Download the exercise test files from GitHub. Extract the contents of the zip file after you download it.
Run "Count Files by Type" on the "01_Flash Drive Inventory" folder.
- Sort the results from highest count to lowest count. What file type occurs most frequently?
Run "Match by Name" on the "01_Flash Drive Inventory" folder.
- Which file names have been duplicated?
- Remove your open tabs
Run "Match by Base Name"
- on the PDFs folder
- run it again on the Word Docs folder
- Which word document does not have a corresponding PDF?
Remove the tabs from all of your prior tests.
Run "Sort by Checksum" looking only at image files
- on the Checksum Tests folder.
- run it again on the Checksum Tests2 folder.
- Which files are not identical between the 2 folders?
- Remove the tab for your test on the Checksum Tests2 folder.
- Export the results from your first "Sort by Checksum" task as a tab-delimited file. Export only the key and data fields.
- Import your checksum results using "Import Delimited File"
- Use the merge tool to compare your imported file to the results from your checksum test
- No differences should exist
- Sample text: https://www.nga.gov/collection/anA5.htm
- Regex:
^([^,]+), ([^\t]+)\t([^,]+).*(\d\d\d\d).*(\d\d\d\d).*$
- Sample text: http://en.wikipedia.org/wiki/Internet_media_type - save source as text
- Regex:
^.*<code>(.*)</code>:.*$
- http://catalog.data.gov/dataset/public-library-survey-pls-2011 (US Public libraries, 2011)
- Key column 1,2,8
- Image Properties
- Page Count
- Counter compliant report validation
- Output to Bursar processing*
- Invoice processing*
- Identify digital derivatives
- ETD Processing for DSpace ingest
*institution specific solution
The project to be implemented will be determined by the interest of the group.
Parse MARC records and validate custom business logic
- MARC-File-Analyzer
- Sample MARC files: https://archive.org/details/unc_catalog_marc
Analyze Digital Image Properties
- Enhance the Image Properties Task
- Some sample images: http://commons.wikimedia.org/wiki/Libraries (click through and download a handful)
PDF Introspection
- Enhance the Page Count Task
- Some sample PDF's: http://code4lib.org/conference/2009/schedule (download a handful from the bottom of the page)