Tutorials - ez-apis/ez-web-scraping GitHub Wiki
- Installation
- Scrape website without authentification
- Scrape website with authentification
- Things to know
- Go to github, and copy/paste (or download) EzWebScraping.py from the release page. Then, do the same thing for requirements.txt.
- Open a terminal in the folder where is located requirements.txt, and type:
pip install -r requirements.txt- Now that you have the required libraries, you can put EzWebScraping.py in your project, by writing:
from EzWebScraping import EzWebScraping- Create an instance of EzWebScraping:
ez_scraper = EzWebScraping()- Now, you can connect to a website of your choice (it returns false if it failed):
ez_scraper.connect('https://github.com/)- You can scrape the content of the website with your favorite library (mine is BeautifulSoup)
bs_scraper = BeautifulSoup(web.get_html_page(), "html.parser")
Before doing anything, you will need to find specific information in your website. For this tutorial, we will use the https://github.com/. My browser display language is French, I'm sorry for that (not really important though).
- First of all, find the login page. For github, we have https://github.com/login.
- Now, right click on the username field, and click on "inspect". It should open the developer window.
- Few things interest us in the
<form>tag:- (1) The
actionattribute of<form>, - (2) The
methodattribute of<form>, - (3) The
nameattribute of something like "authenticity_token" or "csrfmiddlewaretoken" (it could be different, it depends on the website's implementation). Its value is a random set of letters and numbers. However, some websites do not require it, so you can try to authenticate withoutauth_token_namekeyword) - (4) The
nameattribute of the username input, - (5) The
nameattribute of the password input.
- (1) The
- With (4) and (5), we can create a
payloadvariable (which is a dictionary):
payload = {
"login": "my_login", # <username-input-name>: <your-username>,
"password": "my_password" # <password-input-name>: <your-password>
}-
With (1), we can update the login URL to the URL where the login is performed. In this case, it will be
https://github.com/sessioninstead ofhttps://github.com/loginbecause action="/session" -
With (2) and (3), we can finally call the connect() function with the right values (it will return false if it failed):
ez_scraper.connect('http://github.com/session',
payload=payload,
auth_token_name="authenticity_token")- You can scrape the content of the website with your favorite library (mine is BeautifulSoup)
bs_scraper = BeautifulSoup(web.get_html_page(), "html.parser")
- After logging in, it is possible that you have been redirected. To know the current URL, you can use
ez_scraper.get_url(), - If for some reason, it does not work, make sure to use a logger to troubleshoot your issue. Here is an example:
import time
import logging
def main():
# Your code...
if __name__ == '__main__':
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s ' +
'-- %(levelname)s ' +
'-- [%(filename)s:%(lineno)s ' +
'-- %(funcName)s() ] ' +
'-- %(message)s')
stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.DEBUG)
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
main()- If you are experienced, you can access to the session object or even the response object from the last connect() call. To do so, you can use respectively
ez_scraper.get_session()andez_scraper.get_response()
There are some other advanced functionalities that are not described here because it is only useful for specific cases. Take a look at the EzWebScraping class section above for details.