How I Approach to Create this App using Streamlit and Curler? - arv-anshul/99acres-scrape GitHub Wiki

Introduction

In this documentation, I will guide you through the process of creating a web scraping application using Python and the Streamlit library. The goal of this project is to scrape real estate data from 99acres.com using their API. I'll share my experiences and the steps I took to create this application, from identifying the API to overcoming challenges like IP blocking. Please note that some specific details, such as replicating API requests, have been omitted for brevity but may be added in future updates to this documentation.

Table of Contents

  1. Identifying the 99acres API
  2. Replicating API Requests
  3. Data Filtering and Selection
  4. Creating a Streamlit Web App
  5. Overcoming IP Address Blocking
  6. Curl Command Parsing with Curler
  7. Storing and Retrieving Curl Commands
  8. Handling API Limitations
  9. Conclusion

1. Identifying the 99acres API

To begin our journey, I discovered that 99acres.com uses an API to fetch property data. This API was found in the "Network" tab while inspecting the website. The API was instrumental in gathering property data for our project.

2. Replicating API Requests

Once the API was identified, my next challenge was to replicate the API requests using Python. By examining the network traffic in the browser's Developer Tools, I found the GET requests responsible for fetching new property data. Although I can't provide all the details here, I followed a specific pattern to replicate these requests programmatically.

3. Data Filtering and Selection

The API provides a wealth of data, but not all of it is relevant to our project. Therefore, the next step was to filter and select the most important data for our application. This data will be used to display real estate listings in the Streamlit web app.

4. Creating a Streamlit Web App

To present the scraped data in an accessible and user-friendly manner, I opted for the Streamlit library. Streamlit simplifies the process of creating a web application using Python. With Streamlit, I designed a web interface that allows users to make multiple requests for property data.

5. Overcoming IP Address Blocking

While developing the app, I encountered a significant challenge: IP address blocking. Initially, I could fetch data for approximately 10,000 properties in Gurgaon. However, subsequent requests were blocked. To address this issue, I experimented with various solutions, including adding user-agent and additional headers like referer, auth_keys, and cookies, but these methods proved temporary.

6. Curl Command Parsing with Curler

In search of a more robust solution, I looked for a library to parse cURL commands, which I could obtain from the website. Unfortunately, I found no reliable and lightweight libraries for this purpose. To resolve this, I forked the uncurl package, refactored it, and published it on PyPI as curler. This library allowed us to parse cURL commands and use them in Python.

7. Storing and Retrieving Curl Commands

With the curler library in place, I created a page where users could submit the latest cURL command. This command was then stored in a JSON file, which could be used to load parameters for making multiple requests. This approach helped bypass IP address blocking.

8. Handling API Limitations

During the development process, a peculiar challenge emerged—the API's limitation on fetching the number of properties at a time. Initially, I could fetch up to 1,200 properties with a single request. However, this limit was later reduced to 50 properties per request. Despite this limitation, the application remains functional and capable of scraping large amounts of data without getting blocked by the website.

Conclusion

In conclusion, this project involved several challenges, from identifying the API and replicating requests to overcoming IP address blocking and handling API limitations. Through the use of Streamlit and the curler library, we successfully developed a web scraping application that can efficiently gather real estate data from 99acres.com. Feel free to refer to this documentation as you embark on your own web scraping adventures.