November 03, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Ticket Review
What should we do with all this extra Twitter Data?
Beginning the Crawl on SciNet!

Notes

We are delayed on finalizing the scope - deadline for end of this week
Twitter crawler code
- Additional columns to be incorporated in the output, ex. date, hashtag, language, mentions at other twitter users, retweets_count, likes_count
- the mentions and URLs have to be understood as links and references
- we need a way to handle errors and understand progress of the crawler - when an error has happened, right now the thread just exits
- in old Mediacat, even if a link was faulty or had an exception, they found a way to bypass it so the crawl could still continue
- exit signal may trigger domain crawler exiting - need to have a way of keeping track of which handle was crawled
Mediacat domain crawler
- implemented different regex for different domains - detects different links in each crawled site
- accepting CSVs
- test crawl on sample of 20 pages - grabbed links from each page, filtered outside-of-domain links, data outside of scope are in a separate JSON
Integrate PDF capture
- has been completed with UUIDs generated from the URL of a page
- PDF representation of pages is more costly - so this may need to be implemented as another service optionally run after post-processing (when only interlinked items in scope remain)
Review integration of metascraper into domain crawler
- Gives an issue with async so it will need to be incorporated as a separate crawl
Integrate date detection into crawler
- Resolve / reject issue has been resolved but the detection hangs after receiving many requests (unhandled promise exception)
- Jacqueline will work with Alex to examine this issue
Add twitter_crawler complier
- Travis tests are not passing (this was before project made private)
- waiting for response before merging
- Travis is only free on public repositories, so these tests will not work in the future
Create post-processor framework
- regex completed for text aliases, twitter handles of domain crawler
- waiting on Danhua's output as well as extraction of twitter handles using @ pattern to do the following:
  - creating linkages between references of twitter output and domain output data
  - creating JSON & CSV output that contains top twitter handles and domains out of scope
Modification of crawler to gather plain text of crawled articles
- for each link crawled, same URL added inside of it so every URL can check back to see what URL they were found on
- regex implementation to store domain name rather than full URL