General: DeFi: Research and Curation - FlipsideCrypto/fsc-evm GitHub Wiki
How to Research and Curate EVM Protocols
This guide documents how to 1) Research EVM Protocols and 2) Build the Silver/Gold DBT model layers, from a generalized standpoint. For primitive-specific details (DEX swaps, bridging, lending etc.) please see the other guides listed under Guides: General: DeFi
.
Please read the Ethereum Developer Docs for more details on how EVM smart contracts, transactions, emitted events and functions work.
Research Phase
Summary:
Research on building protocol models is performed and collected by triaging Protocol Documentation, community discussions, Block Explorer activity, existing data tables or other data sources such as community dashboards or Defillama. If you are unable to determine the right definitions, contracts, onchain events or traces to analyze, you may need to test the protocol yourself. However, please exercise caution when interacting with smart contracts and proceed at your own risk.
Best Practices, Tips & Tricks:
- Protocol Versions:
- There may be multiple versions of a protocol, each with the same or different contract addresses and event topics. It's crucial to identify each distinct version and determine applicable differences or upgrades to account for in the models.
- If the structure of the events are exactly the same or similar enough, you may be able to combine the versions into one model. Break versions out into their own models or CTEs when dealing with versions that require a new set of columns or unique incremental logic, joins etc.
- Contract Reads:
- Certain models may require contract reads, typically in cases where data cannot be parsed or it is not posted onchain. Example, the
token_address
column for bridging via Hop Protocol. Please see the documentation on Contract Reads for more details.
- Certain models may require contract reads, typically in cases where data cannot be parsed or it is not posted onchain. Example, the
Implementation Steps, Variables & Notes:
- In order to properly analyze, query or build EVM protocol models, we need to first know what exactly we're looking to curate. We do this by defining each action, at it's most atomic level.
- Definitions are crucial in guiding the models, tables and schemas we deploy, but are often the most difficult part of the curation process, as different protocols may have different meanings for similar onchain actions.
- Definitions should include a 1-3 sentence explanation on what the intended action is and how it is generally represented onchain. E.g. "DEX Swap is a direct exchange of one token for another and occurs on a liquidity pool contract. When an address initiates a swap, one token is transferred to the pool and the other token is transferred back to the recipient address, alongside a change in the pool's balance."
- The next step is to identify the applicable contracts, event topics, or contract calls/traces that represent the actions you are looking to curate. In some situations, you may need to combine multiple contracts, events or traces together in one query to output a table that represents the defined action.
- This can be done by checking various sources, such as Block Explorers or the decoded data tables on Flipside, to visual the transaction flow. Be sure to properly collect and note each contract, the official name defined in the contract code, the applicable function calls or events emitted, and any nuances to account for.
- The final step is to write the query.
- Often performed with Trial and Error, the goal is to determine the underlying logic required to produce a table with records that represent the onchain actions or outputs. Depending on the source table(s) used, this may require manual decoding or multiple CTEs.
- Recommended to utilize the raw data tables (
silver.logs
,silver.traces
) rather than the decoded data tables (silver.decoded_logs
,silver.decoded_traces
), as the decoding process relies on collected/submitted ABIs and therefore, may not encapsulate 100% of records for a contract. - When writing the queries, please include all relevant columns necessary to get a complete picture of the onchain action. These queries will be directly converted to incremental DBT models.
Examples, References & Sources:
Example Research Submission
Protocol: Fraxswap
Action: Liquidity Pool Address/Pair Creation
Definition: <insert definition>
Contract:
* v1: '0xb076b06f669e682609fb4a8c6646d2619717be4b' --Frax Finance:Fraxswap FactoryV1
* v2: '0x43ec799eadd63848443e2347c49f5f52e8fe0f6f' --Frax Finance:Fraxswap FactoryV2
Event Topic 0:
* v1/v2: '0x0d3648bd0f6ba80134a33ba9275ac585d9d315f0ad8355cddefde31afa28d0e9' --pairCreated
Example Txn:
* v1: https://etherscan.io/tx/0x817fc5df7d72992f2244f2ee521f4572d86ba79ba49bbdd8ad4067b586cf82e7
* v2: https://etherscan.io/tx/0xa8af5961e4f263c2c4c99387f55339174f20d1d5c001a6df5d8c56641b25d5d4
Notes: <include any notes, nuances, or edge cases worth calling out from the research>
Query:
SELECT
block_number,
block_timestamp,
tx_hash,
event_index,
origin_function_signature,
origin_from_address,
origin_to_address,
contract_address AS factory_address,
regexp_substr_all(SUBSTR(DATA, 3, len(DATA)), '.{64}') AS segmented_data,
CONCAT('0x', SUBSTR(segmented_data [0] :: STRING, 25, 40)) AS pool_address,
CONCAT('0x', SUBSTR(topics [1] :: STRING, 27, 40)) AS token0,
CONCAT('0x', SUBSTR(topics [2] :: STRING, 27, 40)) AS token1,
TRY_TO_NUMBER(utils.udf_hex_to_int(
segmented_data [1] :: STRING
)) AS pool_id,
_log_id,
_inserted_timestamp
FROM ethereum.silver.logs
WHERE
contract_address IN (
'0xb076b06f669e682609fb4a8c6646d2619717be4b', --v1 factory
'0x43ec799eadd63848443e2347c49f5f52e8fe0f6f' --v2 factory
)
AND topics [0] :: STRING = '0x0d3648bd0f6ba80134a33ba9275ac585d9d315f0ad8355cddefde31afa28d0e9' --pairCreated
AND tx_status = 'SUCCESS'
LIMIT 10
;
Silver Layer Modeling
Summary:
- Protocol Specific Models:
- Built with
silver.logs
and/orsilver.traces
, these silver models are designed to capture essential information about various protocols at a granular level. Each model is specific to a set of contracts, events or traces relevant to a specific version of a protocol for the category/primitive being analyzed. Typically, these Protocol Specific models will be rolled up and unioned together in acomplete
model, and for that reason, require a minimum level of standardization that can be found in this guide.
- Built with
- Complete Models:
- The curated
complete
models are built with the intention of creating a centralized model combining all like models into one, by category/primitive. Not only does this help the debugging process, it also improves query performance within both curated and downstream models. It achieves this by;- Materializing a Table that unions all relevant models together in one place, rather than having all of this logic exist in the Gold View surfaced to users
- Reducing the number of joins required on the contracts and prices tables
- The curated
Best Practices, Tips & Tricks:
- It's recommended to add a WHERE clause to each CTE selecting from the core tables to filter out failed transactions, logs or traces, and only include successful ones.
- Standardized columns names are crucial. Each protocol table will be unioned together downstream in the
complete
models. - Utilize CTEs (Common Table Expressions) as much as possible for improved readability and maintainability.
- Use qualify clauses to handle duplicates and ensure data integrity.
- Modeling with
silver.decoded_logs
orsilver.decoded_traces
is an option, but it is not the recommended approach. Please reserve usage of these tables in your models for protocols/contracts that are too difficult or resource intensive to manually parse.- Leverage Snowflake's
regexp_substr_all
function for efficient parsing of event data. See examples. - For details on how to utilize specific UDFs and convert between data types, please reference the LiveQuery README.
- Leverage Snowflake's
- Heal Logic:
- The
complete
models includeheal logic
that handles late-arriving or missing data. Please see the Incremental Heal Logic guide for more information.
- The
Implementation Steps, Variables & Notes:
- Naming conventions:
- Protocol Specific Models:
- Each model must be named according to the protocol and action being curated. The name should be descriptive enough for any engineer to debug and easily understand what it represents, yet standardized, short and condensed. This includes the schema + category/primitive, protocol name, action/event name, and version (where applicable).
[schema]_[category (optional)]__[protocol name]_[action]_[version (optional)]
- e.g.
silver_dex__fraxswap_pools
,silver_dex__fraxswap_swaps
- e.g.
- Each model must be named according to the protocol and action being curated. The name should be descriptive enough for any engineer to debug and easily understand what it represents, yet standardized, short and condensed. This includes the schema + category/primitive, protocol name, action/event name, and version (where applicable).
- Complete Models:
- Each model must be named according to the category/primitive it represents and contain
complete
.[schema]_[category (optional)]__[complete]_[action]
- e.g.
silver_dex__complete_dex_liquidity_pools.sql
,silver_bridge__complete_bridge_activity.sql
- e.g.
- Each model must be named according to the category/primitive it represents and contain
- Protocol Specific Models:
- Folder Structure:
- Each DeFi Primitive should have it's own folder, and each curated protocol should be nested within that. Additionally, curated protocol models are unioned together into a
complete
model, for performance / DBT best practices, which exists within the category/primitive folder but outside of the protocol specific folders.
- Each DeFi Primitive should have it's own folder, and each curated protocol should be nested within that. Additionally, curated protocol models are unioned together into a
- Model Structure:
- Each
.sql
model must have a corresponding.yml
file for tests, table and column descriptions. In the Silver Layer, table and column descriptions are not required, but recommended if there are nuances that need to be documented. Tests are required.
- Each
- Manual Decoding:
- We leverage a combination of Snowflake functions and internal UDFs to manually parse and decode transaction data. Some examples include;
- Parse and segment blockchain event data:
regexp_substr_all(SUBSTR(DATA, 3, len(DATA)), '.{64}') AS segmented_data
SUBSTR(DATA, 3, len(DATA))
: This part extracts a substring from the 'DATA' field, starting from the 3rd character to the end, which removes the '0x' prefix from hexadecimal data.regexp_substr_all(..., '.{64}')
: This applies a regular expression to split the substring into segments of 64 characters each. In Ethereum, each parameter in event data is typically 32 bytes (64 hexadecimal characters) long.- The result,
segmented_data
, is an array where each element is a 64-character segment of the original data. This makes it easier to extract individual parameters from the event data.
- Extract and format an Ethereum address from the segmented event data:
CONCAT('0x', SUBSTR(segmented_data [0] :: STRING, 25, 40))
segmented_data[0] :: STRING
: This accesses the first element of the segmented_data array (created by the regexp_substr_all function above) and casts it to a string. This element typically represents a 32-byte (64 hexadecimal characters) parameter from the event data.SUBSTR(..., 25, 40)
: This extracts a substring of 40 characters, starting from the 25th character of the string. Ethereum addresses are 20 bytes (40 hexadecimal characters) long, and they're often padded to 32 bytes in event dataCONCAT('0x', ...)
: This prepends '0x' to the extracted 40-character string. Ethereum addresses are conventionally represented with a '0x' prefix.
- Extract and format an Ethereum address from event topics:
CONCAT('0x', SUBSTR(topics [1] :: STRING, 27, 40))
topics[1] :: STRING
: This accesses the second topic (index 1) from the event log and casts it to a string. In many events, topics 1 and 2 contain an indexed address parameter.SUBSTR(..., 27, 40)
: This extracts a substring of 40 characters, starting from the 27th character of the topic string.CONCAT('0x', ...)
: Prepends '0x' to the extracted 40-character string.
- Convert hexadecimal value from event data into a numeric value:
TRY_TO_NUMBER(utils.udf_hex_to_int(segmented_data[1] :: STRING))
utils.udf_hex_to_int(...)
: This is a custom user-defined function (UDF) that converts a hexadecimal string to an integer.TRY_TO_NUMBER(...)
: This is a Snowflake function that attempts to convert its argument to a number. If the conversion fails, it returns NULL instead of raising an error.
- Parse and segment blockchain event data:
- We leverage a combination of Snowflake functions and internal UDFs to manually parse and decode transaction data. Some examples include;
- Incremental Logic:
- We add short, hour-based, lookbacks to our curated models to limit them to selecting only the last
X
hours of data. This ensures that the models are running efficiently with a limit, but also let's them capture any late-arriving data from the past few hours. For models built on event logs and trace data, we typically use12 hours
, but a shorter or longer window may be beneficial, depending on the model. - When the model runs in
incremental
mode, indicated by theis_incremental()
jinja check, it filters data based on the_inserted_timestamp
, selecting records from the lastX
hours since the model's last run. - The
{{ this }}
reference points to the current model table.
- We add short, hour-based, lookbacks to our curated models to limit them to selecting only the last
{% if is_incremental() %}
AND _inserted_timestamp >= (
SELECT MAX(_inserted_timestamp) - INTERVAL '12 hours'
FROM {{ this }}
)
{% endif %}
- Qualify Statement:
- The
QUALIFY
clause is used to select the most recent record from the table, based on unique ids. We typically add this to the final SELECT statement in a model if there are duplicate values. Please test thoroughly because it may add significant processing time, as this can be a resource intensive clause. - It works by assigning a row number to each record within groups of the same unique id(s), designated by
PARTITION BY
. In this example, only 1 id is used, but multiple can be added. Within each group, it orders the records by the latest value, in this case_inserted_timestamp
. BothDESC
orASC
order may be used depending on the intended outcome. TheROW_NUMBER() = 1
conditions ensures that the model keeps only the first row in each group, which in this case is the most recent inserted record due to the descending order.
- The
QUALIFY (ROW_NUMBER() OVER (PARTITION BY _log_id ORDER BY _inserted_timestamp DESC)) = 1
Examples, References & Sources:
Example Silver Curated Models:
- Protocol Specific Models:
- Complete Models:
Example Folder Structure and Model Hierarchy:
> models/silver/defi
>> bridge
>> dex
>> balance
...
>> frax
- silver_dex__fraxswap_pools.sql
- silver_dex__fraxswap_pools.yml
- silver_dex__fraxswap_swaps.sql
- silver_dex__fraxswap_swaps.yml
>> kyberswap
...
- silver_dex__complete_dex_liquidity_pools.sql
- silver_dex__complete_dex_liquidity_pools.yml
- silver_dex__complete_dex_swaps.sql
- silver_dex__complete_dex_swaps.yml
>> lending
>> liquid_staking
Gold Layer Modeling
Summary:
- We utilize the Gold Layer to surface data to users and clients via the data studio, data shares or API. Gold curated models are typically views that
SELECT
all columns from thecomplete
curated models, and the structure of the model should be extremely simple.
Best Practices, Tips & Tricks:
- Unless specifically required, we do not recommend applying transformations in the Gold models. That should be handled at the Silver Layer.
- Please apply the majority of tests at the source, in the Silver Layer, unless new transformations or logic are added to the Gold model that may materially impact the model.
- Because these models are surfaced to users, it's recommended to adjust the order of the columns in an intuitive nature for analysts to access with ease. Please align the column order with other curated Gold models. The same goes for column names.
- Do not include any columns that begin with
_
(underscore) in the Gold models, e.g._log_id
,_call_id
,_inserted_timestamp
etc. - Gold curated models are typically materialized as views, however, materializing as an incremental table is an option if required for performance reasons.
Implementation Steps, Variables & Notes:
- Naming conventions:
- Each model must be named in a standardized fashion, as these are the models that are surfaced to Flipside users. They must follow the Star Schema methodology (dim/fact/ez) and represent the category/primitive being curated.
[schema]__[dim/fact/ez)]_[action]
- e.g.
defi__dim_dex_liquidity_pools.sql
,defi__ez_dex_swaps.sql
- Each model must be named in a standardized fashion, as these are the models that are surfaced to Flipside users. They must follow the Star Schema methodology (dim/fact/ez) and represent the category/primitive being curated.
- Folder Structure:
- Folders may be used to nest models categorically for organization purposes only. These folders are not surfaced to users in the Data Studio or in Snowflake.
- Model Structure:
- Pull the
inserted_timestamp
,modified_timestamp
columns through to Gold. Additionally, add a newid
column that is structured as<gold_model_name>_id
.- e.g.
ez_dex_swaps_id
- e.g.
- Each
.sql
model must have a corresponding.yml
file for tests, table and column descriptions. In the Gold Layer, detailed table and column descriptions are required. Tests are not required unless there's new logic that may materially impact the model.- Each description should reference a
doc
file, for standardization purposes.
- Each description should reference a
- Pull the
Examples, References & Sources:
Example Gold Curated Models:
Example Folder Structure and Model Hierarchy:
> models/gold/defi
>> lending
- defi__dim_dex_liquidity_pools.sql
- defi__dim_dex_liquidity_pools.yml
- defi__ez_bridge_activity.sql
- defi__ez_bridge_activity.yml
...