Tutorial - nokia/minifold GitHub Wiki

Tutorial

This tutorial is split in two parts.

  1. The first part explains how to use minifold primitives on list of dictionaries. This part is especially if you are not used with SQL.
  2. The second part illustrates how to build a minifold pipelines using connectors. Such pipelines to separate user needs (the Query) and the processing required to obtain the dictionaries. For instance, through a single pipeline, you can query several end points and aggregate their results corresponding to a given query.

Your first commands

This section starts with a simple example to present minifold primitives. You can run ipython3 and copy/paste the following lines of code to try by yourself.

Minifold primitives process a list of dictionaries supposed to share the same set of keys, and returns a list of dictionaries.

users = [
    {
        "firstname" : "John",
        "lastname" : "Doe"
    }, {
        "firstname" : "John",
        "lastname" : "Connor"
    }, {
        "firstname" : "Peter",
        "lastname" : "Parker"
    }
]

Now, let's see some minifold primitives.

select: fetch a subset of keys

Suppose you want to fetch last names. Run:

from pprint          import pprint
from minifold.select import select

pprint(select(users, ["lastname"]))

Result:

[{'lastname': 'Doe'}, {'lastname': 'Connor'}, {'lastname': 'Parker'}]

Similarly, you could get firstnames as follows:

pprint(select(users, ["firstname"]))

Result:

[{'firstname': 'John'}, {'firstname': 'John'}, {'firstname': 'Peter'}]

unique: get dictionaries distinct according to a subset of keys

Suppose you want to get only distinct lastnames:

from pprint          import pprint
from minifold.select import select
from minifold.unique import unique

pprint(
    unique(
      ["firstname"],
      select(users, ["firstname"])
    )
)

Result:

[{'firstname': 'John'}, {'firstname': 'Peter'}]

where. Filtering entries.

Suppose you only want to keep users having the firstname "John":

  • Using a dedicated function:
from minifold.where  import where

def my_filter(user :dict) -> bool:
    return user["firstname"] == "John"

pprint(where(users, my_filter))
  • Using a lambda function:
from minifold.where  import where

pprint(where(users, lambda user: user["firstname"] == "John"))

Result:

[{'firstname': 'John', 'lastname': 'Doe'},
 {'firstname': 'John', 'lastname': 'Connor'}]

lambdas. Enriching entries.

Supposed you want to add a key is_spiderman in each dictionary, and you want the corresponding value to be True iff the record is related to Peter Parker.

from minifold.lambdas import lambdas

pprint(
    lambdas(
        {
            "is_spiderman" : lambda user: user["firstname"] == "Peter" \
                                      and user["lastname"]  == "Parker"
        },
        users
    )
)

Result:

[{'firstname': 'John', 'is_spiderman': False, 'lastname': 'Doe'},
 {'firstname': 'John', 'is_spiderman': False, 'lastname': 'Connor'},
 {'firstname': 'Peter', 'is_spiderman': True, 'lastname': 'Parker'}]

To go further

To discover other primitives, visit the Framework page.

Your first connector and your first queries

We will start with the simplest connector: EntriesConnector. This is just a wrapper around a collection of dictionaries. Let's start from the previous example:

from minifold.entries_connector import EntriesConnector

users = [
    {
        "firstname" : "John",
        "lastname" : "Doe"
    }, {
        "firstname" : "John",
        "lastname" : "Connor"
    }, {
        "firstname" : "Peter",
        "lastname" : "Parker"
    }
]

connector = EntriesConnector(users)

Simple query

You can now query this connector using the query method. As usual, it returns a list of dictionaries. By default, a Query fetches everything.

from pprint         import pprint
from minifold.query import Query

q = Query()
entries = connector.query(q)
pprint(entries)

Result:

[{'firstname': 'John', 'lastname': 'Doe'},
 {'firstname': 'John', 'lastname': 'Connor'},
 {'firstname': 'Peter', 'lastname': 'Parker'}]

Refined queries

Query object can transport "instructions" to indicate which dictionaries you're interested in. You can basically get subset of keys (using attributes parameter), dictionaries matching some constraints (using filters etc.).

from pprint         import pprint
from minifold.query import Query

q = Query(
    attributes = ["lastname"],
    filters = lambda user: user["firstname"] == "John"
)
entries = connector.query(q)
pprint(entries)

Result:

[{'lastname': 'Doe'}, {'lastname': 'Connor'}]

Your first minifold pipelines

First example: aggregating streams of dictionaries

Suppose we now want to build a pipeline in charge of returning the set of distinct firstnames appearing in two collections. To this end:

  • We need to wrap those two collections, using EntriesConnector.
  • We need to merge them, using UnionConnector
  • We can keep only firstnames, using SelectConnector, depending on if we want to keep lastnames or not.
  • We can remove duplicates, using UniqueConnector
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from pprint                     import pprint
from minifold.entries_connector import EntriesConnector
from minifold.query             import Query
from minifold.union             import UnionConnector
from minifold.unique            import UniqueConnector

boys = [
    {
        "firstname" : "John",
        "lastname" : "Doe"
    }, {
        "firstname" : "John",
        "lastname" : "Connor"
    }, {
        "firstname" : "Peter",
        "lastname" : "Parker"
    }
]

girls = [
    {
        "firstname" : "Sarah",
        "lastname" : "Connor"
    }, {
        "firstname" : "Jane",
        "lastname" : "Doe"
    }
]

pipeline = UniqueConnector(
    ["firstname"],
    UnionConnector([
        EntriesConnector(boys),
        EntriesConnector(girls)
    ])
)

Let's run a simple query:

q = Query()
entries = pipeline.query(q)
pprint(entries)

Results:

[{'firstname': 'John', 'lastname': 'Doe'},
 {'firstname': 'Peter', 'lastname': 'Parker'},
 {'firstname': 'Sarah', 'lastname': 'Connor'},
 {'firstname': 'Jane', 'lastname': 'Doe'}]

Let's run a more evolved query:

q = Query(attributes = ["firstname"])
entries = pipeline.query(q)
pprint(entries)

Results:

[{'firstname': 'John'},
 {'firstname': 'Peter'},
 {'firstname': 'Sarah'},
 {'firstname': 'Jane'}]

Second example: enriching the stream of dictionaries

Now, suppose you to build another pipeline which add gender key on top of those two collections. This can be done using LambdasConnector. Here, we assume that boys and girls are well-separated:

pipeline = UnionConnector([
    LambdasConnector(
        {"gender" : lambda boy: "male"},
        EntriesConnector(boys)
    ),
    LambdasConnector(
        {"gender" : lambda girl: "female"},
        EntriesConnector(girls)
    )
])

q = Query()
entries = pipeline.query(q)
pprint(entries)

Results:

[{'firstname': 'John', 'gender': 'male', 'lastname': 'Doe'},
 {'firstname': 'John', 'gender': 'male', 'lastname': 'Connor'},
 {'firstname': 'Peter', 'gender': 'male', 'lastname': 'Parker'},
 {'firstname': 'Sarah', 'gender': 'female', 'lastname': 'Connor'},
 {'firstname': 'Jane', 'gender': 'female', 'lastname': 'Doe'}]

Of course, if your collections are a mix of men and women, you would require a more evolved lambda. from minifold.lambdas import LambdasConnector

def gender(user :dict) -> str:
    return "male"   if user["firstname"] in {"John", "Peter"} else \
           "female" if user["firstname"] in {"Jane", "Sarah"} else \
           "?"

pipeline = UnionConnector([
    LambdasConnector(
        {"gender" : gender},
        EntriesConnector(boys)
    ),
    LambdasConnector(
        {"gender" : gender},
        EntriesConnector(girls)
    )
])

q = Query()
entries = pipeline.query(q)
pprint(entries)

Results:

[{'firstname': 'John', 'gender': 'male', 'lastname': 'Doe'},
 {'firstname': 'John', 'gender': 'male', 'lastname': 'Connor'},
 {'firstname': 'Peter', 'gender': 'male', 'lastname': 'Parker'},
 {'firstname': 'Sarah', 'gender': 'female', 'lastname': 'Connor'},
 {'firstname': 'Jane', 'gender': 'female', 'lastname': 'Doe'}]

Playing with heterogeneous data sources

The principe remains the same. Instead of using EntriesConnector, you just rely on other connectors, depending on the nature of the data source.

Creating your own connectors

  1. I advise you to start with a simple connector, e.g. EntriesConnector to see a minimal example.
  2. Then, as an exercise, copy this file and try to redevelop JsonConnector using json package.
  3. Once you're satisfied, compare your implementation and the minifold one. If everything is clear, feel free to see how more complex connectors have been implemented.

Good luck!