June 9, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  1. Shensong to write documentation on the following: crawler numbers for error registering and pausing the crawler, brake when queue goes to 0 and apify crawl in rounds.
  2. Shensong to send NYT archive politics crawl to Alejandro after postprocessing.
  3. Shensong to comment back to Apify developers so they are aware of limitations of error reporting.
  4. Shensong to continue working on the post-processor refactoring.
  5. Shensong to send Kirsta and Alejandro info re: data structure.

postprocessor refactoring

  • old postprocessor: over thousand lines
  • now 4 parts
    • 1: input: load the scope csv into a dictionary; saved in /saved/ as JSON (eeasier to debug); another scope for Twitter: for postprocessor
      • results: saved in parquet format
    • 2: postprocessor: first postprocess twitter and domain separately to find citation alias, propagate tags, name, etc; then: cross reference domain and twitter data
    • 3: post-postprocessor:
    • 4: post-utils: helper files - write to files, given dictionary and row parser
  • everything is now written in dask, dataframe partitions; everything is imported before
  • dask allows for visualizations and graphs
  • metascraper now saves to CSV - working fine
  • benchmark: 40000 kpp data: 1 min to load, 1min3 sec to postprocess, and create output few seconds

data structure

  • same structure but adding twitter counts (retweets/likes etc)

Crawl updates

  • small domain still crawling - 308,000 crawled thus far
  • NYT politics archive - done, will postprocess with new postprocessor - benchmark each part
  • theguardian - still going - about 800,000 urls crawled - 2 weeks with a few breaks and need to slow down

documentation

  • done on stealth mode

action items:

  • documentation and new repo for new postprocessor
  • Twitter: embedded tweet issue
  • testing new postprocessor on KPP & old NYT and new NYT data to see if discrepancy

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets