December 2, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

update on Twitter crawler?
John to work on issue of slow processing for Found URLs (utility script populating JSON)
John will look into getting the domain from link for twitter crawler to resolve issue of missing URLs
John to run some scripts on the output from al-monitor crawl after stopping the domain crawl to see if there are duplicates. If there are no duplicates, run through URL-finding utility script and then start the start post-processor on this data.
John to run the new, database-free metascraper script against 10 JSON from the NYT crawl that have the date in the URL, and respond back about success rate. Share these JSON with Colin, who will run html date against the same sample so we can compare the success.
John to delete old metascraper and commit new metascraper to utils folder for the post-processor.
Colin to document and commit his .csv processing script to a utils folder in the Front-End repository
Colin to look at the work involved in running this python crawler again: https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/Crawler.py particularly the underlying library. What would it take to refactor or redevelop this crawler, and what benefits would it bring to the project. One key piece is understanding the domains supported by the library and how many of those appear in Alejandro's scope. See documentation here: https://newspaper.readthedocs.io/en/latest/

utility to get title was hanging on the URLs -- added a time-out and if unable to get title, leaves it blank
John will commit this to the util folder once he's done testing, and add documentation to read me

John added this functionality to the postprocessor, and once postprocessor is done, he'll push it together with other changes

117,000 files, about 30,000 duplicates, which is about the same as the second crawl, so we'll delete the first crawl

for domain crawl
script to extract all urls in article (found URL function)
run date scraper (either metascraper or html-date)
then this output can go into the postprocessor
for twitter crawl
- have utility to expand short url built into crawler
- if not used, then use utility after crawl

metascraper (javascraper) vs html-date: metascraper on nytimes 10 articles only got 4 right, but could be picking up on updated article date; html-date got all 10 right
we will test on 10 articles from 972mag.com to see how they do.
John removed all metascraper, and updated his

currently, tweets with relevant citations are put into the interest.output because the url of the tweet is outside of scope
change the postprocessor to put the tweet itself in the output (as opposed to interest.output) -- not sure yet?

python crawler by itself needs work on a bunch of filters
Colin installed the main branch version, and was not able to find links running on 972mag.com
probably not worth it, but Colin will try on old-fashioned aljazeerah.info
found links, but then failed mostly due to empty queue, kept getting caught on wordpress admin page
- takes a strange path
- might need filters to avoid getting stuck on some of these urls

John commit utility for title
John run postprocessor on almonitor.com crawl
test metascraper vs html-date on 972mag.com almonitor.com articles
colin will look at python crawler with different domains, including aljazeerah.info
Colin will produce a CSV with all the tweets with in-scope citations from NYT twitter data: some headings will be name of source, date, found_url (if included), name of twitter user (from NYTimes Twitter handle spreadsheet tab), and then any twitter counts (like likes or retweets).
John will look at whether it's possible to include as many data points as possible in output (e.g., like counts, re-tweet counts, etc).
John will look at whether it's possibele to change the postprocessor to place relevant tweets directly in output (as opposed to interest.output)
Alejandro will write an email to Kirsta et al about possibility of multiprocessing
Alejandro will send a list of sites for domain crawling to John
Alejandro will talk to Shengsong about meeting up with team or with John in prep for next semester.