December 01, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

  • Demo Post-Processing
  • Issues Review
  • To Database or Not to Database

Meeting Notes

  • Post-Processing Demo from Amy
    • loads in domain output, twitter output, and the scope to perform matching
    • "referring record id" field contains the IDs for all entities with mention for given record
    • when run on Twitter output, issues raised with newlines in Twitter
      • quotations around the string with newline will help read it, but the quotations aren't consistently present
      • incorporate check for quotation marks and adding them where they do not appear?
      • check original repository to see if there is a resolver for it
      • for information that is already scraped, will need to check by domain and do a regex replacement
    • To Dos
      • feeding individual CSVs to post-processor
      • fix formatting in already-scraped content
      • incorporate fix for formatting into processor
  • Domain crawler & monitoring its progress
    • by looking at the results directory which contains the JSONs of the crawl, can outputs # of links found (NOT domains)
      • monitoring the results, instead of the debug.log may take less time
    • failed_links_list.json outputs referred-to links that are not in-scope, sorted by domain
      • this is created at the end of the crawl - should this be made incrementally instead?
      • want the total count # of referrals included alongside out-of-scope domains
  • Accepting a .csv file from the parser to populate initial queue
    • completed, needs to be tested
  • Scope parser validation
    • URL validation is complete, need to create pull request
  • New repository with API based code
    • need to verify that functions of the "pre-post processor" work on this API version of the crawler
  • Pre-processor issue closed
  • Metascraper crawl
    • from output directory of JSONs, will read and write new information to files OR create new files, while monitoring progress
    • challenges
      • keeping track of which files have been metascraped
      • would using a reserved character (ex. UUID starts as 123, once scraped is m123) to track be valid?
    • database approach to also be researched