Tutorials - ez-apis/ez-web-scraping GitHub Wiki

0. Table of content

Installation
Scrape website without authentification
Scrape website with authentification
Things to know

1. Installation

Go to github, and copy/paste (or download) EzWebScraping.py from the release page. Then, do the same thing for requirements.txt.
Open a terminal in the folder where is located requirements.txt, and type:

pip install -r requirements.txt

Now that you have the required libraries, you can put EzWebScraping.py in your project, by writing:

from EzWebScraping import EzWebScraping

2. Scrape website without authentification

Create an instance of EzWebScraping:

ez_scraper = EzWebScraping()

Now, you can connect to a website of your choice (it returns false if it failed):

ez_scraper.connect('https://github.com/)

You can scrape the content of the website with your favorite library (mine is BeautifulSoup)

bs_scraper = BeautifulSoup(web.get_html_page(), "html.parser")

3. Scrape website with authentification

Before doing anything, you will need to find specific information in your website. For this tutorial, we will use the https://github.com/. My browser display language is French, I'm sorry for that (not really important though).

First of all, find the login page. For github, we have https://github.com/login.

Now, right click on the username field, and click on "inspect". It should open the developer window.

Few things interest us in the <form> tag:
- (1) The action attribute of <form>,
- (2) The method attribute of <form>,
- (3) The name attribute of something like "authenticity_token" or "csrfmiddlewaretoken" (it could be different, it depends on the website's implementation). Its value is a random set of letters and numbers. However, some websites do not require it, so you can try to authenticate without auth_token_name keyword)
- (4) The name attribute of the username input,
- (5) The name attribute of the password input.

With (4) and (5), we can create a payload variable (which is a dictionary):

payload = {
    "login": "my_login",      # <username-input-name>: <your-username>,
    "password": "my_password" # <password-input-name>: <your-password>
}

With (1), we can update the login URL to the URL where the login is performed. In this case, it will be https://github.com/session instead of https://github.com/login because action="/session"
With (2) and (3), we can finally call the connect() function with the right values (it will return false if it failed):

ez_scraper.connect('http://github.com/session',
                   payload=payload,
                   auth_token_name="authenticity_token")

You can scrape the content of the website with your favorite library (mine is BeautifulSoup)

bs_scraper = BeautifulSoup(web.get_html_page(), "html.parser")

4. Things to know

After logging in, it is possible that you have been redirected. To know the current URL, you can use ez_scraper.get_url(),
If for some reason, it does not work, make sure to use a logger to troubleshoot your issue. Here is an example:

import time
import logging

def main():
    # Your code...

if __name__ == '__main__':
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    formatter = logging.Formatter('%(asctime)s ' +
                                  '-- %(levelname)s ' +
                                  '-- [%(filename)s:%(lineno)s ' +
                                  '-- %(funcName)s() ] ' +
                                  '-- %(message)s')
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.DEBUG)
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)

    main()

If you are experienced, you can access to the session object or even the response object from the last connect() call. To do so, you can use respectively ez_scraper.get_session() and ez_scraper.get_response()

There are some other advanced functionalities that are not described here because it is only useful for specific cases. Take a look at the EzWebScraping class section above for details.