May 12, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • look at 403 - verify that problem is not way of crawling
  • re-send KPP data with tags
  • postprocess the small domain crawl - without the domains that didn't work
  • look at using dask for postprocessing
  • start crawl of cnn.com

Domain Crawler

  • 403 issue
    • blocked our IP address -- another instance, but probably speed is an issue
    • Apify stealth mode: changes the fingerprint (combo of data points, like browser/ip/etc)
      • maybe try this with mondoweiss and middleeasteye with random wait time of 1-2 sec
    • is it possible to get a block of IPs or proxies using IPs
    • additional problem is getting blocked with postprocessor call: stringifying html with crawler so that metascraper doesn't make additional crawl
    • another possibility: do a lot of domains and sequence the calls, but this requires customizing apify
  • question: adding text aliases and re-running scope?
  • CNN.com crawl?
    • like NYT, stuck crawling a lot of less useful stuff, less than 3000
    • could it be that we are blocked without 403

twitter crawler

  • one example from KPP data about embedded tweets -- not urgent

postprocessor

  • dask?
  • small domain crawl postprocessor?
  • postprocessor is very messy, including many different data structures and old stuff that isn't useful

Action Items

  • try slower crawl with single call procedure (as discussed above)
  • Alejandro: look at proxies for crawling: https://www.blackdown.org/best-datacenter-proxies/
  • Monday meeting:
    • finish documenting where different data are on our server
    • question: adding text aliases and re-running scope?
    • one example from KPP data about embedded tweets -- not urgent
  • postprocessor refactoring -- to check back next week

Backburner

  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets