Configuring and Automatically Creating Joins in PyBIRDAI - eclipse-efbt/efbt GitHub Wiki
Introduction
In PyBIRDAI, we use a computational model for executable transformations, meaning that we generate actionable steps or procedures from metadata to transform data according to BIRD's specifications. This approach ensures that transformations are consistent, repeatable, and aligned with the structured framework of BIRD.
Specifically, we use joins to create flat tables and apply filters to them.
BIRD uses the SMCubes information model to store its metadata, which is explained in detail here.
If you have used the BIRD website, you are already familiar with navigating many concepts from SMCubes, such as Domains, Variables, and Members.
Filters can be described in SMCubes using two concepts:
- Combinations
- Combination Items
These combinations can be visualized very neatly on the BIRD website, where users can navigate through a clear interface showing relationships between variables and their respective domains.
Joins can be represented using the SMCubes Information Model with two concepts:
- Cube Links
- Cube Structure Item Links
In SMCubes, the concept of "cubes" can be considered analogous to tables, which may be more familiar to some users.
The Input Layer of BIRD, or the Logical Data Model (LDM) of BIRD, is essentially a set of cubes with specific relationships, they can be navigated with the BIRD website.
BIRD also contains a set of Output Layers (e.g., one per FINREP report), which are also considered cubes. For a single report templatye they show the subset of columns that are specifically required as a source for filters for that specific report, for example CARRYNG_AMNT,INSTTNTNL_SCTR,HLD_SL_INDCR and some others are needed for the Finrep F05.01 report.
These cubes are all navigable on the BIRD website and can also be downloaded as CSV files from the Metadata Extracts section of the website. These CSV files adhere to the SMCubes Information Model.
The BIRD metamodel manual, which includes guidelines on implementation, demonstrates that there is flexibility in how we choose to join information from the Input Layer. Here, we can balance business and technical trade-offs. For example, creating a join for every report cell might isolate the joins to their absolute minimum set of required inputs but could slow down technical implementations or make lineage harder to visualize.
In PyBIRDAI, we join information from BIRD's Input Layer to create concise, narrow tables that follow the structure of the Output Layer. We then apply the filters of the combinations to calculate report cells.
We note that in PyBIRDAI, we do not calculate "total cells" on reports in this way. Instead, "total cells" are derived following EBA validation rules, which ensures consistency with regulatory requirements and provides banks with the flexibility to incorporate on-face manual adjustments as needed. This approach supports manual adjustments, which are commonly used in most banks.
How Many Joins Do We Make?
In PyBIRDAI, for template-based reports like FINREP, we make one join per product per report.
For each template, we look at the products it reports upon. For example, F01.01 reports on loans and advances, debt securities, goodwill, and many other products. In F05.01, we see that loans and advances are broken down into more detailed products like credit card debt, financial leases, other term loans, etc.
For each report, we make one join for each product used in the report. The result of each join follows the strict, concise structure of the output layer of the report, so they can be unioned together.
Why Do We Make This Many Joins?
We considered approaches where we make more joins (one per cell) or fewer (one per product, or one per regulation). We think we have found the correct balance.
By having joins per product and per report, we can use lineage to find the smaller subset of fields required in the input layer for banks that have a limited subset of reports and products. This is very useful for banks.
For example, we do not want to join information about FRTB requirements for a bank that does not file an FRTB report.
This approach also allows incremental ETL and testing. We can first choose just one report, such as FINREP F01.01, find the minimum set of columns from the input layer required, and perform the ETL and testing for that report. Once testing is stable, we can incrementally add new reports.
How Do We Visualize Those Joins Pre-Execution?
Each join has one product and one report associated with it. It is easy to visualize the metadata of this. We implemented this same methpdodolgy in a previous desktop version of Eclipse Free BIRD Tools written in Java, we show here how it visualised the join for the F05.01 report realted to the non-negotiable bonds product (Non Negotiable Bonds are treated as Loans in Finrep and so required to be treated in the Finrep report F05.01 as 'term loans'). We see that above it has all the columns from the output layer cube for F05.01 , and displays lines of lineage showing how to get each of these columns from the BIRD input layer tables.
How Do We Visualize Those Joins Post-Execution?
After executing joins, it is important to visualize and navigate the data lineage of the numbers. This data lineage is crucial for banks, as it helps answer questions like, "Why is the number in this report cell so high?" The ability to show data lineage has always been considered a primary requirement in PyBIRDAI and previous versions of Eclipse Free BIRD Tools. This requirement drives the model of computation. We can visualize both the attribute lineage (pre-runtime) and data lineage (post-execution) as demonstrated in this video. We call the results of each join a "slice" since they are added together. The video is slightly out of date, as it shows one slice for loans and advances, but we now consider it better to split loans and advances into separate products (e.g., credit card debt, financial leases, other term loans). https://www.youtube.com/watch?v=fN219efkOF4
How Do We Store the Joins?
We store the joins using Cube Links and Cube Structure Item Links.
This provides good lineage.
The technical implementation of each join in Python is auto-created for us in the correct file in the application but usually requires some mild adaptation. For example, the Python implementation of the joins may need additional details, such as "where clauses" to limit information about a product to a specific role, like financial asset instruments.
How Do We Automatically Create Those Joins from Output Layers and Input Layers?
PyBIRDAI has an automated join creation process which automatically identifies how the items required in the output layers can be found by joining specific items in the input layer (LDM). This automated process achieves about 90% accuracy but requires some checking and revising, which we can do in the PyBIRDAI user interface. This process automatically creates the Cube Links and Cube Structure Item Links.
How Do We Configure That Automatic Creation?
Configuring the automated join creation process includes setting up three files:
- In-scope reports
- Product specifications
- Product-to-Main Category Links
Because there is a close correlation between main categories (an EBA dimension found in EBA annotated report templates, such as FINREP), we can automatically identify the products used in each report. Again, this is over 90% accurate but requires some checking, as the automation may struggle with similar dimensions like main category or main category of guarantee.
Conclusion
PyBIRDAI automatically implements 95% of our joins and 100% of our filters based on SMCubes descriptions of BIRD. We still include a human in the loop to edit results, using productivity tools to facilitate editing SMCubes in a visual editor or Python code in a Python code editor. We also have the ability to edit the SMCubes content, test changes, and send them back to the ECB.