Meeting: Thursday, July 2nd, 2015 - UTMediaCAT/projectdocs GitHub Wiki

discussion:

  • how to deal with duplicate articles - comparison isn't easy

  • parse url tokenize dashes underscores slashes compare path part

  • look at duplicates from same domain or across different domains?

  • site count

  • I would be wise to submit a bug report the Newspaper highlighting the issue we have been facing with their library

  • We discussed having a warning system be made for letting the use know when Newspaper found very few articles and that plan B crawler should be ran.

  • We discussed the issue of how url shortening and query strings affected our crawler. Since there is no way of differentiating between each url, it is possible for the system to save the same article more than once because of url differences.