Architecture - GonzaloUlla/unlp-dbd-newsler GitHub Wiki
Next images describe the C4 model architecture building blocks for unlp-dbd-newsler.
In the general context diagram, 2 external systems and a user interacts with the system. Since the goal is to collect information from news portals and correlate that information with Twitter messages related to a specific news title.
Twitter access has been done through its REST API (free-tier access), which is limited according to Twitter Rate Limits constraints.
News Sites will be scraped using single-thread Spiders that will capture the main news titles, descriptions and share links.
At a Container level, 4 subsystems can be found:
News-crawler will generate single-thread spiders to access and get news information from news portals.
Twitter-crawler will access Twitter API and get tweets related to scraped news.
Logstash will collect information from both crawlers and set them in ElasticSearch database.
Kibana will help end-users to access information, perform searches and generate reports.
In a lower level, Twitter crawler component can access Twitter through 2 different ways:
- By polling last tweets periodically according to a specific query.
- By stremming tweets in real time according to specific keywords.
Sentiment analysis is made for each tweet, by using the Natural Language Toolkit package, in order to get the polarity, subjectivity and a final label (positive, negative, neutral).
News-crawler consists of several Spiders that run in order to access news sites and get main news information using XPATH to search for main titles, descriptions and links. Using Scrapy we are able to lauch spiders and collect all necessary information from news portals.