Crawling - couragesuper/couragesuper-ds GitHub Wiki

Crawling

Frameworks

  • Common Libraries

  • crawler base

  • txt file writer

  • txt file reader

  • txt preprocessor

  • selenium driver

Supported Sites

  • bookcosmos , obtain pdfs
  • joins keywords
  • ytn
  • naver ranking news

Crawler detailed

apis

  • init : create selenium web driver create logger create txt file writer
  • createlogger : initalize the logger.
  • createTxt : initalize the txt file
  • setTxtColumn : set the column lists
  • run : starts crawling
  • close : close the crawling module
  • openpage : openpage with web driver
  • login : login sites if it is needs
  • makeCateLinks : make the url lists about site map
  • naviSites : navigates site with url lists about site map
  • navigate : open some category and iterate the pages and iterate the open the individual articles