Getting started - StackUnderflowProject/Scraper GitHub Wiki

Dependencies

  • Skrape{it} Skrapeit » 1.1.5

    • A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered).
    • It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL.
    • First and foremost it aims to be a testing lib, but it can also be used to scrape websites in a convenient fashion.
  • Skrape{it} HTTP Fetcher » 1.1.5

    • Used in combination with Skrape{it} to fetch and scrape SSR web pages.
  • Selenium Java » 4.20.0

    • Selenium provides support for the automation of web browsers.
    • It provides extensions to emulate user interaction with browsers, a distribution server for scaling browser allocation, and the infrastructure for implementations of the W3C WebDriver specification.
    • Used to fetch and scrape dynamic web pages.
  • MongoDB Driver » 5.1.0

    • The MongoDB Synchronous Driver
    • Used for ObjectId implementation
  • Gson » 2.10.1

    • Gson is a Java library that can be used to convert Java Objects into their JSON representation.
    • It can also be used to convert a JSON string to an equivalent Java object.
  • Google geocoding API

    • Used to transform address to coordinates
    • Requires own api key

Follow the steps below to set up and run the project on your local machine.

Tools

Prerequisites

Before you begin, ensure you have the following installed on your system:

Steps

1. Clone the Repository

First, clone the repository from GitHub to your local machine using the following command:

git clone https://github.com/StackUnderflowProject/Scraper.git

2. Navigate to the Project Directory

Change into the project directory:

cd Scraper

3. Build the Project

Use Gradle to build the projet. Run the following command in project direcotry:

./gradlew build

4. Run the project

Once the build is successful, you can run the project using Gradle, e.g. scrape all football data from PLT for the 2024 season:

./gradlew run --args='PLT 2024'

This outputs four .json files (teams, matches, stadiums, standings) containing the requested data.