March 3, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • re-do al-monitor.com crawl & benchmarking speed
  • Twitter API
  • testing new puppeteer filter code on 50 domains before documenting and committing
  • creating second version of postprocessor and make master, to preserve Amy's version
  • adjust domain crawler set up script to add a variable for heap memory
    • document in mediacat domain crawler
  • Alejandro will update 2 tickets: readability & JS Heap memory (couldn't find them)

Twitter API

  • Shengsong managed to get the Twitter API crawler working
  • will meet with Alejandro to go over which keys to include in output
  • it may take a few days to coordinate the postprocessor for a combined twitter and domain crawler output
  • 6000 tweets retrieved in 1-2 mins, and it is possible to increase speed with multi-processing, as TWitter API allows for each handle to be processed separately.

Re-Do al-monitor.com & Benchmarking

  • 60,000 per day
  • JS heap memory issues were resolved for this crawl

testing new puppeteer filter code

  • filter code: selects html content to find plain text
  • tested on 20 domains and worked without an issue on 18, and with a little adjustment, Shengsong was able to get the crawler working on the other 2
  • google chrome: inspect the html content, and see what selector is

change to processing as a result: grab raw_content and postprocessor will determine plain text & hyperlinks

making v2 of postprocessor

  • done
  • will need an update to reflect new function of finding plain-text

adjust domain crawler set up script to add a variable for heap memory, document:

  • done

Tickets:

  • readability: done
  • JS Heap Memory: done

Action Items

  • twitter API finalize crawl
  • need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
  • get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
  • once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
    • Alejandro will provide domain url's for 5 smaller domains
  • plain text extraction move to postprocessor (as described above)

Backburner

  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function