December 01, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Demo Post-Processing
Issues Review
To Database or Not to Database

Meeting Notes

Post-Processing Demo from Amy
- loads in domain output, twitter output, and the scope to perform matching
- "referring record id" field contains the IDs for all entities with mention for given record
- when run on Twitter output, issues raised with newlines in Twitter
  - quotations around the string with newline will help read it, but the quotations aren't consistently present
  - incorporate check for quotation marks and adding them where they do not appear?
  - check original repository to see if there is a resolver for it
  - for information that is already scraped, will need to check by domain and do a regex replacement
- To Dos
  - feeding individual CSVs to post-processor
  - fix formatting in already-scraped content
  - incorporate fix for formatting into processor
Domain crawler & monitoring its progress
- by looking at the results directory which contains the JSONs of the crawl, can outputs # of links found (NOT domains)
  - monitoring the results, instead of the debug.log may take less time
- failed_links_list.json outputs referred-to links that are not in-scope, sorted by domain
  - this is created at the end of the crawl - should this be made incrementally instead?
  - want the total count # of referrals included alongside out-of-scope domains
Accepting a .csv file from the parser to populate initial queue
- completed, needs to be tested
Scope parser validation
- URL validation is complete, need to create pull request
New repository with API based code
- need to verify that functions of the "pre-post processor" work on this API version of the crawler
Pre-processor issue closed
Metascraper crawl
- from output directory of JSONs, will read and write new information to files OR create new files, while monitoring progress
- challenges
  - keeping track of which files have been metascraped
  - would using a reserved character (ex. UUID starts as 123, once scraped is m123) to track be valid?
- database approach to also be researched