Shengsong managed to get the Twitter API crawler working
will meet with Alejandro to go over which keys to include in output
it may take a few days to coordinate the postprocessor for a combined twitter and domain crawler output
6000 tweets retrieved in 1-2 mins, and it is possible to increase speed with multi-processing, as TWitter API allows for each handle to be processed separately.
Re-Do al-monitor.com & Benchmarking
60,000 per day
JS heap memory issues were resolved for this crawl
testing new puppeteer filter code
filter code: selects html content to find plain text
tested on 20 domains and worked without an issue on 18, and with a little adjustment, Shengsong was able to get the crawler working on the other 2
google chrome: inspect the html content, and see what selector is
change to processing as a result: grab raw_content and postprocessor will determine plain text & hyperlinks
Shengsong will delete in the domain crawler the code which attempts to find plain-text from the collected html, and create a new key called raw_content which will be part of the domain crawler output
Shengsong will add the function of finding plain-text from the raw_content key to the postprocessor
making v2 of postprocessor
done
will need an update to reflect new function of finding plain-text
adjust domain crawler set up script to add a variable for heap memory, document:
done
Tickets:
readability: done
JS Heap Memory: done
Action Items
twitter API finalize crawl
need a list of JSON outputted from crawler, with keys, and other documentation as mentioned in padlet
get started on puppeteer update: 2.2 and we have 1.5, take 2-3 days
once readability is stripped from domain crawler and domain crawler is updated, run small domain crawl
Alejandro will provide domain url's for 5 smaller domains
plain text extraction move to postprocessor (as described above)
Backburner
Benchmarking
finish documenting where different data are on our server