Tutorials - ez-apis/ez-web-scraping GitHub Wiki

0. Table of content

1. Installation

  • Go to github, and copy/paste (or download) EzWebScraping.py from the release page. Then, do the same thing for requirements.txt.
  • Open a terminal in the folder where is located requirements.txt, and type:
pip install -r requirements.txt
  • Now that you have the required libraries, you can put EzWebScraping.py in your project, by writing:
from EzWebScraping import EzWebScraping

2. Scrape website without authentification

  • Create an instance of EzWebScraping:
ez_scraper = EzWebScraping()
  • Now, you can connect to a website of your choice (it returns false if it failed):
ez_scraper.connect('https://github.com/)
  • You can scrape the content of the website with your favorite library (mine is BeautifulSoup)
bs_scraper = BeautifulSoup(web.get_html_page(), "html.parser")

3. Scrape website with authentification

Before doing anything, you will need to find specific information in your website. For this tutorial, we will use the https://github.com/. My browser display language is French, I'm sorry for that (not really important though).

illustration

  • Now, right click on the username field, and click on "inspect". It should open the developer window.

illustration

  • Few things interest us in the <form> tag:
    • (1) The action attribute of <form>,
    • (2) The method attribute of <form>,
    • (3) The name attribute of something like "authenticity_token" or "csrfmiddlewaretoken" (it could be different, it depends on the website's implementation). Its value is a random set of letters and numbers. However, some websites do not require it, so you can try to authenticate without auth_token_name keyword)
    • (4) The name attribute of the username input,
    • (5) The name attribute of the password input.

illustration

  • With (4) and (5), we can create a payload variable (which is a dictionary):
payload = {
    "login": "my_login",      # <username-input-name>: <your-username>,
    "password": "my_password" # <password-input-name>: <your-password>
}
  • With (1), we can update the login URL to the URL where the login is performed. In this case, it will be https://github.com/session instead of https://github.com/login because action="/session"

  • With (2) and (3), we can finally call the connect() function with the right values (it will return false if it failed):

ez_scraper.connect('http://github.com/session',
                   payload=payload,
                   auth_token_name="authenticity_token")
  • You can scrape the content of the website with your favorite library (mine is BeautifulSoup)
bs_scraper = BeautifulSoup(web.get_html_page(), "html.parser")

4. Things to know

  • After logging in, it is possible that you have been redirected. To know the current URL, you can use ez_scraper.get_url(),
  • If for some reason, it does not work, make sure to use a logger to troubleshoot your issue. Here is an example:
import time
import logging

def main():
    # Your code...

if __name__ == '__main__':
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    formatter = logging.Formatter('%(asctime)s ' +
                                  '-- %(levelname)s ' +
                                  '-- [%(filename)s:%(lineno)s ' +
                                  '-- %(funcName)s() ] ' +
                                  '-- %(message)s')
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.DEBUG)
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)

    main()
  • If you are experienced, you can access to the session object or even the response object from the last connect() call. To do so, you can use respectively ez_scraper.get_session() and ez_scraper.get_response()

There are some other advanced functionalities that are not described here because it is only useful for specific cases. Take a look at the EzWebScraping class section above for details.

⚠️ **GitHub.com Fallback** ⚠️