December 2, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
Action Items from last week
update on Twitter crawler?
John to work on issue of slow processing for Found URLs (utility script populating JSON)
John will look into getting the domain from link for twitter crawler to resolve issue of missing URLs
John to run some scripts on the output from al-monitor crawl after stopping the domain crawl to see if there are duplicates. If there are no duplicates, run through URL-finding utility script and then start the start post-processor on this data.
John to run the new, database-free metascraper script against 10 JSON from the NYT crawl that have the date in the URL, and respond back about success rate. Share these JSON with Colin, who will run html date against the same sample so we can compare the success.
John to delete old metascraper and commit new metascraper to utils folder for the post-processor.
Colin to document and commit his .csv processing script to a utils folder in the Front-End repository
Colin to look at the work involved in running this python crawler again: https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/Crawler.py particularly the underlying library. What would it take to refactor or redevelop this crawler, and what benefits would it bring to the project. One key piece is understanding the domains supported by the library and how many of those appear in Alejandro's scope. See documentation here: https://newspaper.readthedocs.io/en/latest/
utility to get title was hanging on the URLs -- added a time-out and if unable to get title, leaves it blank
John will commit this to the util folder once he's done testing, and add documentation to read me
Getting link from twitter crawler:
John added this functionality to the postprocessor, and once postprocessor is done, he'll push it together with other changes
Al-Monitor Crawl:
117,000 files, about 30,000 duplicates, which is about the same as the second crawl, so we'll delete the first crawl
current order of pre-postprocessor:
for domain crawl
script to extract all urls in article (found URL function)
run date scraper (either metascraper or html-date)
then this output can go into the postprocessor
for twitter crawl
have utility to expand short url built into crawler
if not used, then use utility after crawl
date extraction:
metascraper (javascraper) vs html-date: metascraper on nytimes 10 articles only got 4 right, but could be picking up on updated article date; html-date got all 10 right
we will test on 10 articles from 972mag.com to see how they do.
John removed all metascraper, and updated his
.csv processing script:
Colin made a pull request, and John will look
postprocessor Twitter issue:
currently, tweets with relevant citations are put into the interest.output because the url of the tweet is outside of scope
change the postprocessor to put the tweet itself in the output (as opposed to interest.output) -- not sure yet?
Python crawler:
python crawler by itself needs work on a bunch of filters
Colin installed the main branch version, and was not able to find links running on 972mag.com
probably not worth it, but Colin will try on old-fashioned aljazeerah.info
found links, but then failed mostly due to empty queue, kept getting caught on wordpress admin page
takes a strange path
might need filters to avoid getting stuck on some of these urls
Action Items
John commit utility for title
John run postprocessor on almonitor.com crawl
test metascraper vs html-date on 972mag.com almonitor.com articles
colin will look at python crawler with different domains, including aljazeerah.info
Colin will produce a CSV with all the tweets with in-scope citations from NYT twitter data: some headings will be name of source, date, found_url (if included), name of twitter user (from NYTimes Twitter handle spreadsheet tab), and then any twitter counts (like likes or retweets).
John will look at whether it's possible to include as many data points as possible in output (e.g., like counts, re-tweet counts, etc).
John will look at whether it's possibele to change the postprocessor to place relevant tweets directly in output (as opposed to interest.output)
Alejandro will write an email to Kirsta et al about possibility of multiprocessing
Alejandro will send a list of sites for domain crawling to John
Alejandro will talk to Shengsong about meeting up with team or with John in prep for next semester.