December 01, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki
- Demo Post-Processing
- Issues Review
- To Database or Not to Database
Meeting Notes
- Post-Processing Demo from Amy
- loads in domain output, twitter output, and the scope to perform matching
- "referring record id" field contains the IDs for all entities with mention for given record
- when run on Twitter output, issues raised with newlines in Twitter
- quotations around the string with newline will help read it, but the quotations aren't consistently present
- incorporate check for quotation marks and adding them where they do not appear?
- check original repository to see if there is a resolver for it
- for information that is already scraped, will need to check by domain and do a regex replacement
- To Dos
- feeding individual CSVs to post-processor
- fix formatting in already-scraped content
- incorporate fix for formatting into processor
- Domain crawler & monitoring its progress
- by looking at the results directory which contains the JSONs of the crawl, can outputs # of links found (NOT domains)
- monitoring the results, instead of the debug.log may take less time
- failed_links_list.json outputs referred-to links that are not in-scope, sorted by domain
- this is created at the end of the crawl - should this be made incrementally instead?
- want the total count # of referrals included alongside out-of-scope domains
- Accepting a .csv file from the parser to populate initial queue
- completed, needs to be tested
- Scope parser validation
- URL validation is complete, need to create pull request
- New repository with API based code
- need to verify that functions of the "pre-post processor" work on this API version of the crawler
- Pre-processor issue closed
- Metascraper crawl
- from output directory of JSONs, will read and write new information to files OR create new files, while monitoring progress
- challenges
- keeping track of which files have been metascraped
- would using a reserved character (ex. UUID starts as 123, once scraped is m123) to track be valid?
- database approach to also be researched