October 06, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Discuss sample output and ensure it is properly structured (with June).
  • Jacqueline & Amy presenting research on infrastructure technologies for the project
  • Demo of Parser (Danhua)
  • Demo of Crawler (Raiyan)
  • Twitter Crawler - what comes next
  • Parsing sample data from crawler (backend component - Postprocessor)
  • Ticket Update

Notes

  • Relation for sample scope clarified: the "referring ID"s for a given article would be the IDs for entities which mention the article or link to it

  • Jacqueline: Node.js Express outperforms Python and Flask in terms of input and output operations (when communicating with the crawlers and the application), previous developers on the team also suggested that the Node.js Express framework would be optimal

  • Danhua's demo: the processing script notebook walk-through shows

    • inserting unique IDs
    • creating new rows for associated twitter handles, this will link related sources together through a matching unique id (ex. twitter accounts debkafile and debka_english)
    • new records with linked ids create the processed version of the dataframe
    • sorts items into "News Source" object and "Twitter Handle" object
    • test checks
      • csv input
      • check for valid twitter handle (if run into error, this row will be indicated)
      • check for valid type (e.g. "Twitter Handle" or "News Source")
      • check for valid url
      • check for unicode text
  • Recommendations for Danhua

    • explore existing functions to check for valid URLs or create own URL check
    • continuing with splitting the logic of the processing script into individual functions
  • Raiyan's demo of crawler

    • using Readability.js to extract metadata of an article from its HTML (tested with NYTimes article)
      • sometimes in the text extracted there are photo descriptions
      • in the HTML there are hyperlinks identified - should links be taken from the whole webpage or the article itself?
      • in article bylines there may be multiple authors listed
      • downside: date of publication isn't retrieved using Readability.js
    • can build in different handlings/filters for crawling a news source homepage VS an news source article URL
    • separate crawls for each domain, where the domains may be discovered in the different sites
    • MediaCloud sitemap parser and feed seeker may be useful tools for extracting links of a given domain
  • Twitter Crawler - what comes next

    • selenium or puppeteer?
    • as puppeteer is already being used, priority is exploring this tool
    • puppeteer resources: here & here
  • Amy exploring the date recognition, with help from Jacqueline

    • Date Guesser from MediaCloud
    • lookup with Google may be able to retrieve the date

Tickets update

  • Constructing framework for application - Jacqueline
  • Domain crawler - Raiyan conducting sample crawl; Alex and Raiyan to meet up and discuss splitting up the work as well as rules for crawling (filters for within-domain and out-of-domain)
  • Design database/datastore approach - wait for further infrastructure design
  • Twitter ticket - Danhua will look into puppeteer as a way to retrieve user tweets