November 03, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

  • Ticket Review
  • What should we do with all this extra Twitter Data?
  • Beginning the Crawl on SciNet!

Notes

  • We are delayed on finalizing the scope - deadline for end of this week

  • Twitter crawler code

    • Additional columns to be incorporated in the output, ex. date, hashtag, language, mentions at other twitter users, retweets_count, likes_count
    • the mentions and URLs have to be understood as links and references
    • we need a way to handle errors and understand progress of the crawler - when an error has happened, right now the thread just exits
    • in old Mediacat, even if a link was faulty or had an exception, they found a way to bypass it so the crawl could still continue
    • exit signal may trigger domain crawler exiting - need to have a way of keeping track of which handle was crawled
  • Mediacat domain crawler

    • implemented different regex for different domains - detects different links in each crawled site
    • accepting CSVs
    • test crawl on sample of 20 pages - grabbed links from each page, filtered outside-of-domain links, data outside of scope are in a separate JSON
  • Integrate PDF capture

    • has been completed with UUIDs generated from the URL of a page
    • PDF representation of pages is more costly - so this may need to be implemented as another service optionally run after post-processing (when only interlinked items in scope remain)
  • Review integration of metascraper into domain crawler

    • Gives an issue with async so it will need to be incorporated as a separate crawl
  • Integrate date detection into crawler

    • Resolve / reject issue has been resolved but the detection hangs after receiving many requests (unhandled promise exception)
    • Jacqueline will work with Alex to examine this issue
  • Add twitter_crawler complier

    • Travis tests are not passing (this was before project made private)
    • waiting for response before merging
    • Travis is only free on public repositories, so these tests will not work in the future
  • Create post-processor framework

    • regex completed for text aliases, twitter handles of domain crawler
    • waiting on Danhua's output as well as extraction of twitter handles using @ pattern to do the following:
      • creating linkages between references of twitter output and domain output data
      • creating JSON & CSV output that contains top twitter handles and domains out of scope
  • Modification of crawler to gather plain text of crawled articles

    • for each link crawled, same URL added inside of it so every URL can check back to see what URL they were found on
    • regex implementation to store domain name rather than full URL