Cleansing Tool - aegisbigdata/documentation GitHub Wiki

Deployment

To set up the cleansing tool, docker and docker-compose should be installed. Then, clone the corresponding repository and cd into it. There you will notice a .env file which holds the environmental variables that will are needed during the container startup. Those variables are:

  • MONGO_INITDB_ROOT_USERNAME: The username to access the MongoDB instance (MongoDB is used to store the rules configuration). Set it to value of your choice, or leave the default one.
  • MONGO_INITDB_ROOT_PASSWORD: The password of the MongoDB instance.
  • MONGO_PORT: The port at which MongoDB will be running on the host machine.
  • SECRET_KEY: A secret to be used for securing Flask sessions. Set it to a long alphanumeric value.
  • JWT_SECRET_KEY: A secret which is used to sign the JWT created by the application. Set it to a long alphanumeric value.
  • FLASK_ENV: The environment under which Flask application will be running. Possible values are dev, test and prod. The latter is recommended.
  • APP_PORT: The port under which the application will be running on the host machine.
  • MONGO_URI: The full MongoDB URI which the application will use to connect to the MongoDB instance.

After environmental variable setup is complete, run docker-compose -f docker-compose-prod.yml up --build -d. This will take a couple of minutes. After that visit http://{IP}:{PORT} and you will see the following landing page:

Main Page

Usage

Step 1 (Login)

In order to use the features of the cleansing tool, one must login first.

Login

The default username/password is admin/adminadmin123

Step 2 (Dataset Declaration)

Before starting setting up cleaning rules, the datasets of interest and their variables should be registered first. The tool assumes the following hierarchy. First, we define the providers. A provider or data owner is the name of the company/organization/individual who posses the data. Each provider has a set of datasets and each dataset a set of variables. Note that it is not necessary to register all of the datasets/variables, but only the ones that you are interested to clean.

The tool offers an easy to use UI for creating the aforementioned structure.

Dataset Registration

Step 4 (Rule Registration)

Next, we create the cleaning rules of interest.

Rules

Rules fall under 3 different categories:

  1. Validation Rules: They define constraints that should be checked (e.g. if values of a column are in a desired range)

Validation Rules

  1. Cleaning Rules: They define actions that should be taken, in case of a specific validation rules is being violated (e.g. if values of a column are outside a desired range, replace those values with a predefined value).

Cleaning Rules

  1. Missing Values Rules: Define actions that should be taken, in case where there are empty values in a column.

Missing Rules

Step 5 (Clean Data)

Having registered the necessary rules, you can proceed to the cleaning process. The tool offers a simple UI for choosing the necessary provider and dataset and then upload a file to clean. The only supported file types are CSV and XLSX up to 500MB in size. For larger datasets, it is advised to use the tool's API instead.

Clean

Once the process completes, a new cleaned file will be returned.

Step 6 (Check Logs)

The cleansing tool offers a feature where you can check all the actions in detail that took place during the cleaning process. Actions are stored as log files.

Logs

By opening a specific log file, you can get a detailed explanation of the actions that took place, as well as a nice dashboard which displays some interesting statistics.

Logs Details

Logs Dashboard

API

Several API endpoints are exposed from the tool. Those endpoints are documented using Swagger which can be accessed under http://{IP}:{PORT}/cleaner/api/v1/docs Swagger.