May 19, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • try slower crawl with single call procedure (as discussed above)
  • Alejandro: look at proxies for crawling: https://www.blackdown.org/best-datacenter-proxies/
  • Monday meeting:
    • finish documenting where different data are on our server
    • question: adding text aliases and re-running scope?
    • one example from KPP data about embedded tweets -- not urgent
  • postprocessor refactoring -- to check back next week

Crawl Strategies

  • insights from crawling small sites & The Guardian
    • stealthy mode works well - no 403
    • using 2 threads also worked - no 403, even on middleeasteye
      • wait time of 3-4 seconds
      • still got 100,000 per day with body html
      • problem is that it isn't possible to control speed
      • 1 thread with wait time 3-4 seconds: very slow, running into problems with slow load times, eg loading video, where can take 1-3 minutes at times
      • 2 threads allows the other thread to keep going
    • using the single call to get body html doesn't really slow down the crawler
    • try crawling multiple domains with 2 threads each
  • possibility of a brake?
    • email in util crawl: send email to user and pause the crawler if receive 10 x 403 error or 10 x 429 error
  • worth re-crawling nytimes.com?
    • probably not
  • use proxies? - later
  • start NYT Archive/keyword crawl on politics (sent by email) -- will set up

postprocessor refactoring - start now?

  • all method and all code in a single file
    • to debug, really hard to find
    • recommendation: divide into multiple parts: input, find citations refs, and others
    • also: more object oriented approach
    • wrt data structure: data frame operation will help -- think of them to load full data frame, then operating as columns/rows
      • start with dask data frame, built in function: sorting etc
      • common functions: load scope (small), load domain data (json - looping and creating dicitonary), load twitter data,
        • instead of dictionary, create data frame

server

  • finalized deletion of different old datasets - done

Twitter issue: embedded tweet

  • not a hurry

Action Items:

  • Alejandro: finish updating server documentation
  • Alejandro: send sites to crawl
  • design crawl brake (pause and email)
  • test new method above (2 threads, etc) on small domain list of domains
  • main item is postprocessor refactoring along lines stated above
  • NYT archive politics crawl

Backburner

  • Twitter: embedded tweet issue
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets