How it works - gabriel-milan/smtr

Prelude

As mentioned previously, this project consists of a simple Dagster pipeline that will fetch data from an API and save it as a CSV file.

Before we dive into the implementation, it's important we go over a few concepts from Dagster. If you're already familiar with'em, please jump to the implementation section.

All implemented things have docstrings for better comprehension.

Concepts

In this section, I'll go over a few core concepts of Dagster. I won't be too long about it and I do recommend the Dagster Docs, they're really clear and succinct.

What is Dagster?

From their own homepage:

Dagster is a data orchestrator for machine learning, analytics, and ETL

Using my own words, Dagster is a Python framework for writing, executing and debugging data pipelines based on DAGs (directed acyclic graphs). It's similar to other frameworks such as Airflow, Luigi and Prefect but, of course, has its own concepts. In the following subsections, I'll talk about them.

Solids

The definition of solids in the Dagster documentation is very clear: "Solids are the functional unit of work in Dagster. A solid's responsibility is to read its inputs, perform an action, and emit outputs". You can kind of think about it as a task, it really works for me.

Configuration schema

Besides inputs, a solid can have a configuration, which is specified by a schema.

Think of a solid that will greet newcomers to this repository:

Should it say "Hi <name>"?
Should it say "Welcome to this repository, <name>!"?
What if I wanted to use this solid for another repository? Should I rewrite it?

Configurations allow solids to be more flexible and reusable. If we keep discussing this example, we could think about this solid containing a configurable welcome_message string and a name input. This would allow it to be used in any environment, just changing the configuration and input, no messing with code.

SolidExecutionResult

Every solid, when executed, returns a SolidExecutionResult. This object can tell whether the execution succeeded, the output data from the solid and much more.

Pipelines

As expected, solids can be connected to each other building a pipeline, which is just a set of solids with explicit dependencies on each other. This is where the DAG concept comes in: the pipeline is nothing but a DAG. Of course, it has awesome features implemented by Dagster, but you get the idea.

Implementation

On this section I'll talk about the implementation of all solids, test cases and the pipeline itself.

Solids

Fetch JSON data

Configuration schema:

{
  "api_endpoint": str,
  "api_path": str,
}

Inputs: None
Outputs: dict or list
Raised exceptions: when the HTTP response is not OK (error code 200)

The following steps are executed:

Join api_endpoint and api_path into a single URL (example: https://api.github.com/ + /users/gabriel-milan = https://api.github.com/users/gabriel-milan
Makes GET request to the URL
If response is OK (error code 200), parses JSON data into Python dictionary and returns it
If response is not OK, logs status code and response message and raises exception

Generate dataframe

Configuration schema:

{
  "data_key": str,
}

Inputs: data: dict or list
Outputs: pd.DataFrame
Raised exceptions: only for unexpected behavior

The following steps are executed:

Gets data_key from configuration
If data_key is an empty string, assume that data is a list of items
If data_key is not empty, extracts a list of items from data at key data_key
Builds a Pandas DataFrame from the data (either from step 2 or 3) and returns it

Save dataframe to CSV file

Configuration schema:

{
  "output_filename": str,
}

Inputs: df: pd.DataFrame
Outputs: None
Raised exceptions: only for unexpected behavior

The following steps are executed:

Gets output_filename from configuration
Checks whether output_filename ends with .csv. If it doesn't, add it
Exports Pandas DataFrame to output_filename CSV file

Pipelines

Extract CSV from API

This is the only pipeline implemented, its functionality is described on the introduction page of this wiki.

Solids (connected to each other in this order):

fetch_json_data
generate_dataframe
save_dataframe_to_csv

Test cases

Test cases were split into two files: test_solids.py and test_pipelines.py. Inside them, test names follow this logic: test_<module>, where <module> can be any solid or pipeline previously described. I'll now describe what each of them tests, one by one.

test_fetch_json_data

Configuration:

{
  "api_endpoint": "http://webapibrt.rio.rj.gov.br/api/v1",
  "api_path": "/brt"
}

Ensures solid executes
Ensures output is a subclass of SolidExecutionResult
Ensures it's ran without exceptions
Ensures output type is either a list or dictionary

test_generate_dataframe

Configuration:

{
  "data_key": "asd"
}

Input data:

"asd": [
  {"a": 123},
  {"b": 456}
]

Ensures solid executes
Ensures output is a subclass of SolidExecutionResult
Ensures it's ran without exceptions
Ensures output type is a Pandas Dataframe
Ensures the dataframe matches pd.DataFrame([{"a": 123}, {"b": 456}]), which is expected

test_save_dataframe_to_csv

Configuration:

{
  "output_filename": "test.csv"
}

Input df: pd.DataFrame([{"a": 1}, {"a": 2}])
Ensures solid executes
Ensures output is a subclass of SolidExecutionResult
Ensures it's ran without exceptions
Ensures output is None
Loads saved file and ensures the content matches the input

test_extract_csv_from_api

Configuration:

"fetch_json_data": {
  "config": {
    "api_endpoint": "http://webapibrt.rio.rj.gov.br/api/v1",
    "api_path": "/brt"
  }
},
"generate_dataframe": {
  "config": {
    "data_key": "veiculos"
  }
},
"save_dataframe_to_csv": {
  "config": {
    "output_filename": "veiculos.csv"
  }
}

Ensures pipeline executes
Ensures output is a subclass of PipelineExecutionResult
Ensures it's ran without exceptions
Ensures types of each solids' output matches their own expected
Ensures fetch_json_data output matches expected for generate_dataframe
Ensures generate_dataframe output matches expected for save_dataframe_to_csv
Ensures saved CSV file contains the same number of rows as the fetch_json_data list

How it works - gabriel-milan/smtr_challenge GitHub Wiki

Prelude

Concepts

What is Dagster?

Solids

Configuration schema

SolidExecutionResult

Pipelines

Implementation

Solids

Fetch JSON data

Generate dataframe

Save dataframe to CSV file

Pipelines

Extract CSV from API

Test cases

test_fetch_json_data

test_generate_dataframe

test_save_dataframe_to_csv

test_extract_csv_from_api

⚠️ GitHub.com Fallback ⚠️

How it works - gabriel-milan/smtr_challenge GitHub Wiki

Prelude

Concepts

What is Dagster?

Solids

Configuration schema

SolidExecutionResult

Pipelines

Implementation

Solids

Fetch JSON data

Generate dataframe

Save dataframe to CSV file

Pipelines

Extract CSV from API

Test cases

test_fetch_json_data

test_generate_dataframe

test_save_dataframe_to_csv

test_extract_csv_from_api

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️