June 9, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

Shensong to write documentation on the following: crawler numbers for error registering and pausing the crawler, brake when queue goes to 0 and apify crawl in rounds.
Shensong to send NYT archive politics crawl to Alejandro after postprocessing.
Shensong to comment back to Apify developers so they are aware of limitations of error reporting.
Shensong to continue working on the post-processor refactoring.
Shensong to send Kirsta and Alejandro info re: data structure.

old postprocessor: over thousand lines
now 4 parts
- 1: input: load the scope csv into a dictionary; saved in /saved/ as JSON (eeasier to debug); another scope for Twitter: for postprocessor
  - results: saved in parquet format
- 2: postprocessor: first postprocess twitter and domain separately to find citation alias, propagate tags, name, etc; then: cross reference domain and twitter data
- 3: post-postprocessor:
- 4: post-utils: helper files - write to files, given dictionary and row parser
everything is now written in dask, dataframe partitions; everything is imported before
dask allows for visualizations and graphs
metascraper now saves to CSV - working fine
benchmark: 40000 kpp data: 1 min to load, 1min3 sec to postprocess, and create output few seconds

small domain still crawling - 308,000 crawled thus far
NYT politics archive - done, will postprocess with new postprocessor - benchmark each part
theguardian - still going - about 800,000 urls crawled - 2 weeks with a few breaks and need to slow down

documentation and new repo for new postprocessor
Twitter: embedded tweet issue
testing new postprocessor on KPP & old NYT and new NYT data to see if discrepancy

Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
finding language function
image_reference function
dealing with embedded versus cited tweets