Crawler poster - mogelbrod/TweetRank GitHub Wiki
- Uses the Twitter REST API
- Twitter limit: 150 queries / hour, Crawler designed to minimize queries
- We also use proxies to skip the limit (but make the crawling slower).
- Data fetched as XML and processed using SAX (fast and memory efficient).
- Designed to run in multiple threads or multiple machines.
- How does it work?
- Start with some users, put them in a queue.
- Get the first user in the queue, and crawl its last tweets and its friends.
- Parse tweets looking for mentions, hashtags.
- Add friends, mentions and the user again to the queue.
- Send data to Solr and Ranker modules.
- GOTO 2