June 9, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
Shensong to write documentation on the following: crawler numbers for error registering and pausing the crawler, brake when queue goes to 0
and apify crawl in rounds.
Shensong to send NYT archive politics crawl to Alejandro after postprocessing.
Shensong to comment back to Apify developers so they are aware of limitations of error reporting.
Shensong to continue working on the post-processor refactoring.
Shensong to send Kirsta and Alejandro info re: data structure.
postprocessor refactoring
old postprocessor: over thousand lines
now 4 parts
1: input: load the scope csv into a dictionary; saved in /saved/ as JSON (eeasier to debug); another scope for Twitter: for postprocessor
results: saved in parquet format
2: postprocessor: first postprocess twitter and domain separately to find citation alias, propagate tags, name, etc; then: cross reference domain and twitter data
3: post-postprocessor:
4: post-utils: helper files - write to files, given dictionary and row parser
everything is now written in dask, dataframe partitions; everything is imported before
dask allows for visualizations and graphs
metascraper now saves to CSV - working fine
benchmark: 40000 kpp data: 1 min to load, 1min3 sec to postprocess, and create output few seconds
data structure
same structure but adding twitter counts (retweets/likes etc)
Crawl updates
small domain still crawling - 308,000 crawled thus far
NYT politics archive - done, will postprocess with new postprocessor - benchmark each part
theguardian - still going - about 800,000 urls crawled - 2 weeks with a few breaks and need to slow down
documentation
done on stealth mode
action items:
documentation and new repo for new postprocessor
Twitter: embedded tweet issue
testing new postprocessor on KPP & old NYT and new NYT data to see if discrepancy
Backburner
Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.