December 2, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Action Items from last week

  • update on Twitter crawler?
  • John to work on issue of slow processing for Found URLs (utility script populating JSON)
  • John will look into getting the domain from link for twitter crawler to resolve issue of missing URLs
  • John to run some scripts on the output from al-monitor crawl after stopping the domain crawl to see if there are duplicates. If there are no duplicates, run through URL-finding utility script and then start the start post-processor on this data.
  • John to run the new, database-free metascraper script against 10 JSON from the NYT crawl that have the date in the URL, and respond back about success rate. Share these JSON with Colin, who will run html date against the same sample so we can compare the success.
  • John to delete old metascraper and commit new metascraper to utils folder for the post-processor.
  • Colin to document and commit his .csv processing script to a utils folder in the Front-End repository
  • Colin to look at the work involved in running this python crawler again: https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/Crawler.py particularly the underlying library. What would it take to refactor or redevelop this crawler, and what benefits would it bring to the project. One key piece is understanding the domains supported by the library and how many of those appear in Alejandro's scope. See documentation here: https://newspaper.readthedocs.io/en/latest/

Twint issue:

Found_URL's slow process:

  • utility to get title was hanging on the URLs -- added a time-out and if unable to get title, leaves it blank
  • John will commit this to the util folder once he's done testing, and add documentation to read me

Getting link from twitter crawler:

  • John added this functionality to the postprocessor, and once postprocessor is done, he'll push it together with other changes

Al-Monitor Crawl:

  • 117,000 files, about 30,000 duplicates, which is about the same as the second crawl, so we'll delete the first crawl

current order of pre-postprocessor:

  • for domain crawl
  • script to extract all urls in article (found URL function)
  • run date scraper (either metascraper or html-date)
  • then this output can go into the postprocessor
  • for twitter crawl
    • have utility to expand short url built into crawler
    • if not used, then use utility after crawl

date extraction:

  • metascraper (javascraper) vs html-date: metascraper on nytimes 10 articles only got 4 right, but could be picking up on updated article date; html-date got all 10 right
  • we will test on 10 articles from 972mag.com to see how they do.
  • John removed all metascraper, and updated his

.csv processing script:

  • Colin made a pull request, and John will look

postprocessor Twitter issue:

  • currently, tweets with relevant citations are put into the interest.output because the url of the tweet is outside of scope
  • change the postprocessor to put the tweet itself in the output (as opposed to interest.output) -- not sure yet?

Python crawler:

  • python crawler by itself needs work on a bunch of filters
  • Colin installed the main branch version, and was not able to find links running on 972mag.com
  • probably not worth it, but Colin will try on old-fashioned aljazeerah.info
  • found links, but then failed mostly due to empty queue, kept getting caught on wordpress admin page
    • takes a strange path
    • might need filters to avoid getting stuck on some of these urls

Action Items

  • John commit utility for title
  • John run postprocessor on almonitor.com crawl
  • test metascraper vs html-date on 972mag.com almonitor.com articles
  • colin will look at python crawler with different domains, including aljazeerah.info
  • Colin will produce a CSV with all the tweets with in-scope citations from NYT twitter data: some headings will be name of source, date, found_url (if included), name of twitter user (from NYTimes Twitter handle spreadsheet tab), and then any twitter counts (like likes or retweets).
  • John will look at whether it's possible to include as many data points as possible in output (e.g., like counts, re-tweet counts, etc).
  • John will look at whether it's possibele to change the postprocessor to place relevant tweets directly in output (as opposed to interest.output)
  • Alejandro will write an email to Kirsta et al about possibility of multiprocessing
  • Alejandro will send a list of sites for domain crawling to John
  • Alejandro will talk to Shengsong about meeting up with team or with John in prep for next semester.