December 9, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

question about doubled-entries
John commit utility for title
John run postprocessor on almonitor.com crawl
test metascraper vs html-date on 972mag.com almonitor.com articles
colin will look at python crawler with different domains, including aljazeerah.info
Colin will produce a CSV with all the tweets with in-scope citations from NYT twitter data: some headings will be name of source, date, found_url (if included), name of twitter user (from NYTimes Twitter handle spreadsheet tab), and then any twitter counts (like likes or retweets).
John will look at whether it's possible to include as many data points as possible in output (e.g., like counts, re-tweet counts, etc).
John will look at whether it's possibele to change the postprocessor to place relevant tweets directly in output (as opposed to interest.output)

Notes

We confirmed meeting time for the new year and moved to Alejandro's Zoom room
question about doubled-entries: Recommend managing duplicates in Excel using Excel built-in features for filtering duplicates and for formatting duplicates
John committed utility for title
John ran postprocessor on almonitor.com crawl and sent to Colin, who will process it. Note that this is taking some time and Colin will be reviewing this.
test metascraper vs html-date on 972mag.com almonitor.com articles. Tested both on NY times. Metascraper was wrong 60% of the time. html-date was much better but it works with the URL. So, we tried it on the domains that don't include date in URL, and on 972 metascraper got 3 perfectly, and 7 within two days. HTML date for 972mag it got 90% correct and then was wrong on all of almonitor.com. Kirsta suggests we look into the https://newsapi.org/s/google-news-api to see if it's viable to use the google API to bring back dates.
Colin will look at python crawler with different domains, including aljazeerah.info. It ran and was promising. He will see if he can run it on 972 to see if it works and is faster than the javascript crawler.
Colin will produce a CSV with all the tweets with in-scope citations from NYT twitter data: some headings will be name of source, date, found_url (if included), name of twitter user (from NYTimes Twitter handle spreadsheet tab), and then any twitter counts (like likes or retweets).
John will look at whether it's possible to include as many data points as possible in output (e.g., like counts, re-tweet counts, etc).
John will look at whether it's possible to change the postprocessor to place relevant tweets directly in output (as opposed to interest.output)

Action Items

John is going to monitor crawls and put them through the post-processor. Clean-up post-processor as there are multiple versions on Graham that is well documented. John to see what is involved (or how best to implement in our resources) multiple instances of the crawler running concurrently.
Colin to generate .csv for Al-monitor and look into viable Python crawler. See if he can find a pattern for failure.