Meeting: Thursday, July 2nd, 2015 - UTMediaCAT/projectdocs GitHub Wiki

discussion:

how to deal with duplicate articles - comparison isn't easy
parse url tokenize dashes underscores slashes compare path part
look at duplicates from same domain or across different domains?
site count
I would be wise to submit a bug report the Newspaper highlighting the issue we have been facing with their library
We discussed having a warning system be made for letting the use know when Newspaper found very few articles and that plan B crawler should be ran.
We discussed the issue of how url shortening and query strings affected our crawler. Since there is no way of differentiating between each url, it is possible for the system to save the same article more than once because of url differences.