General: DeFi: Research and Curation - FlipsideCrypto/fsc-evm GitHub Wiki

How to Research and Curate EVM Protocols

This guide documents how to 1) Research EVM Protocols and 2) Build the Silver/Gold DBT model layers, from a generalized standpoint. For primitive-specific details (DEX swaps, bridging, lending etc.) please see the other guides listed under Guides: General: DeFi.

Please read the Ethereum Developer Docs for more details on how EVM smart contracts, transactions, emitted events and functions work.

Research Phase

Summary:

Research on building protocol models is performed and collected by triaging Protocol Documentation, community discussions, Block Explorer activity, existing data tables or other data sources such as community dashboards or Defillama. If you are unable to determine the right definitions, contracts, onchain events or traces to analyze, you may need to test the protocol yourself. However, please exercise caution when interacting with smart contracts and proceed at your own risk.

Best Practices, Tips & Tricks:

Protocol Versions:
- There may be multiple versions of a protocol, each with the same or different contract addresses and event topics. It's crucial to identify each distinct version and determine applicable differences or upgrades to account for in the models.
- If the structure of the events are exactly the same or similar enough, you may be able to combine the versions into one model. Break versions out into their own models or CTEs when dealing with versions that require a new set of columns or unique incremental logic, joins etc.
Contract Reads:
- Certain models may require contract reads, typically in cases where data cannot be parsed or it is not posted onchain. Example, the token_address column for bridging via Hop Protocol. Please see the documentation on Contract Reads for more details.

Implementation Steps, Variables & Notes:

In order to properly analyze, query or build EVM protocol models, we need to first know what exactly we're looking to curate. We do this by defining each action, at it's most atomic level.
- Definitions are crucial in guiding the models, tables and schemas we deploy, but are often the most difficult part of the curation process, as different protocols may have different meanings for similar onchain actions.
- Definitions should include a 1-3 sentence explanation on what the intended action is and how it is generally represented onchain. E.g. "DEX Swap is a direct exchange of one token for another and occurs on a liquidity pool contract. When an address initiates a swap, one token is transferred to the pool and the other token is transferred back to the recipient address, alongside a change in the pool's balance."
The next step is to identify the applicable contracts, event topics, or contract calls/traces that represent the actions you are looking to curate. In some situations, you may need to combine multiple contracts, events or traces together in one query to output a table that represents the defined action.
- This can be done by checking various sources, such as Block Explorers or the decoded data tables on Flipside, to visual the transaction flow. Be sure to properly collect and note each contract, the official name defined in the contract code, the applicable function calls or events emitted, and any nuances to account for.
The final step is to write the query.
- Often performed with Trial and Error, the goal is to determine the underlying logic required to produce a table with records that represent the onchain actions or outputs. Depending on the source table(s) used, this may require manual decoding or multiple CTEs.
- Recommended to utilize the raw data tables (silver.logs, silver.traces) rather than the decoded data tables (silver.decoded_logs, silver.decoded_traces), as the decoding process relies on collected/submitted ABIs and therefore, may not encapsulate 100% of records for a contract.
- When writing the queries, please include all relevant columns necessary to get a complete picture of the onchain action. These queries will be directly converted to incremental DBT models.

Examples, References & Sources:

Example Research Submission

Protocol: Fraxswap
Action: Liquidity Pool Address/Pair Creation
Definition: <insert definition>
Contract:
* v1: '0xb076b06f669e682609fb4a8c6646d2619717be4b' --Frax Finance:Fraxswap FactoryV1
* v2: '0x43ec799eadd63848443e2347c49f5f52e8fe0f6f' --Frax Finance:Fraxswap FactoryV2
Event Topic 0: 
* v1/v2: '0x0d3648bd0f6ba80134a33ba9275ac585d9d315f0ad8355cddefde31afa28d0e9' --pairCreated
Example Txn: 
* v1: https://etherscan.io/tx/0x817fc5df7d72992f2244f2ee521f4572d86ba79ba49bbdd8ad4067b586cf82e7
* v2: https://etherscan.io/tx/0xa8af5961e4f263c2c4c99387f55339174f20d1d5c001a6df5d8c56641b25d5d4
Notes: <include any notes, nuances, or edge cases worth calling out from the research>

Query:

SELECT
      block_number,
      block_timestamp,
      tx_hash,
      event_index,
      origin_function_signature,
      origin_from_address,
      origin_to_address,
      contract_address AS factory_address,
      regexp_substr_all(SUBSTR(DATA, 3, len(DATA)), '.{64}') AS segmented_data,
      CONCAT('0x', SUBSTR(segmented_data [0] :: STRING, 25, 40)) AS pool_address,
      CONCAT('0x', SUBSTR(topics [1] :: STRING, 27, 40)) AS token0,
      CONCAT('0x', SUBSTR(topics [2] :: STRING, 27, 40)) AS token1,
      TRY_TO_NUMBER(utils.udf_hex_to_int(
          segmented_data [1] :: STRING
      )) AS pool_id,
      _log_id,
      _inserted_timestamp
FROM ethereum.silver.logs
WHERE
     contract_address IN (
       '0xb076b06f669e682609fb4a8c6646d2619717be4b', --v1 factory
       '0x43ec799eadd63848443e2347c49f5f52e8fe0f6f' --v2 factory
     )
     AND topics [0] :: STRING = '0x0d3648bd0f6ba80134a33ba9275ac585d9d315f0ad8355cddefde31afa28d0e9' --pairCreated
     AND tx_status = 'SUCCESS'
LIMIT 10
;

Silver Layer Modeling

Summary:

Protocol Specific Models:
- Built with silver.logs and/or silver.traces, these silver models are designed to capture essential information about various protocols at a granular level. Each model is specific to a set of contracts, events or traces relevant to a specific version of a protocol for the category/primitive being analyzed. Typically, these Protocol Specific models will be rolled up and unioned together in a complete model, and for that reason, require a minimum level of standardization that can be found in this guide.
Complete Models:
- The curated complete models are built with the intention of creating a centralized model combining all like models into one, by category/primitive. Not only does this help the debugging process, it also improves query performance within both curated and downstream models. It achieves this by;
  - Materializing a Table that unions all relevant models together in one place, rather than having all of this logic exist in the Gold View surfaced to users
  - Reducing the number of joins required on the contracts and prices tables

Best Practices, Tips & Tricks:

It's recommended to add a WHERE clause to each CTE selecting from the core tables to filter out failed transactions, logs or traces, and only include successful ones.
Standardized columns names are crucial. Each protocol table will be unioned together downstream in the complete models.
Utilize CTEs (Common Table Expressions) as much as possible for improved readability and maintainability.
Use qualify clauses to handle duplicates and ensure data integrity.
Modeling with silver.decoded_logs or silver.decoded_traces is an option, but it is not the recommended approach. Please reserve usage of these tables in your models for protocols/contracts that are too difficult or resource intensive to manually parse.
- Leverage Snowflake's regexp_substr_all function for efficient parsing of event data. See examples.
- For details on how to utilize specific UDFs and convert between data types, please reference the LiveQuery README.
Heal Logic:
- The complete models include heal logic that handles late-arriving or missing data. Please see the Incremental Heal Logic guide for more information.

Implementation Steps, Variables & Notes:

Naming conventions:
- Protocol Specific Models:
  - Each model must be named according to the protocol and action being curated. The name should be descriptive enough for any engineer to debug and easily understand what it represents, yet standardized, short and condensed. This includes the schema + category/primitive, protocol name, action/event name, and version (where applicable).
    - [schema]_[category (optional)]__[protocol name]_[action]_[version (optional)]
      - e.g. silver_dex__fraxswap_pools, silver_dex__fraxswap_swaps
- Complete Models:
  - Each model must be named according to the category/primitive it represents and contain complete.
    - [schema]_[category (optional)]__[complete]_[action]
      - e.g. silver_dex__complete_dex_liquidity_pools.sql, silver_bridge__complete_bridge_activity.sql
Folder Structure:
- Each DeFi Primitive should have it's own folder, and each curated protocol should be nested within that. Additionally, curated protocol models are unioned together into a complete model, for performance / DBT best practices, which exists within the category/primitive folder but outside of the protocol specific folders.
Model Structure:
- Each .sql model must have a corresponding .yml file for tests, table and column descriptions. In the Silver Layer, table and column descriptions are not required, but recommended if there are nuances that need to be documented. Tests are required.
Manual Decoding:
- We leverage a combination of Snowflake functions and internal UDFs to manually parse and decode transaction data. Some examples include;
  - Parse and segment blockchain event data: regexp_substr_all(SUBSTR(DATA, 3, len(DATA)), '.{64}') AS segmented_data
    - SUBSTR(DATA, 3, len(DATA)): This part extracts a substring from the 'DATA' field, starting from the 3rd character to the end, which removes the '0x' prefix from hexadecimal data.
    - regexp_substr_all(..., '.{64}'): This applies a regular expression to split the substring into segments of 64 characters each. In Ethereum, each parameter in event data is typically 32 bytes (64 hexadecimal characters) long.
    - The result, segmented_data, is an array where each element is a 64-character segment of the original data. This makes it easier to extract individual parameters from the event data.
  - Extract and format an Ethereum address from the segmented event data: CONCAT('0x', SUBSTR(segmented_data [0] :: STRING, 25, 40))
    - segmented_data[0] :: STRING: This accesses the first element of the segmented_data array (created by the regexp_substr_all function above) and casts it to a string. This element typically represents a 32-byte (64 hexadecimal characters) parameter from the event data.
    - SUBSTR(..., 25, 40): This extracts a substring of 40 characters, starting from the 25th character of the string. Ethereum addresses are 20 bytes (40 hexadecimal characters) long, and they're often padded to 32 bytes in event data
    - CONCAT('0x', ...): This prepends '0x' to the extracted 40-character string. Ethereum addresses are conventionally represented with a '0x' prefix.
  - Extract and format an Ethereum address from event topics: CONCAT('0x', SUBSTR(topics [1] :: STRING, 27, 40))
    - topics[1] :: STRING: This accesses the second topic (index 1) from the event log and casts it to a string. In many events, topics 1 and 2 contain an indexed address parameter.
    - SUBSTR(..., 27, 40): This extracts a substring of 40 characters, starting from the 27th character of the topic string.
    - CONCAT('0x', ...): Prepends '0x' to the extracted 40-character string.
  - Convert hexadecimal value from event data into a numeric value: TRY_TO_NUMBER(utils.udf_hex_to_int(segmented_data[1] :: STRING))
    - utils.udf_hex_to_int(...): This is a custom user-defined function (UDF) that converts a hexadecimal string to an integer.
    - TRY_TO_NUMBER(...): This is a Snowflake function that attempts to convert its argument to a number. If the conversion fails, it returns NULL instead of raising an error.
Incremental Logic:
- We add short, hour-based, lookbacks to our curated models to limit them to selecting only the last X hours of data. This ensures that the models are running efficiently with a limit, but also let's them capture any late-arriving data from the past few hours. For models built on event logs and trace data, we typically use 12 hours, but a shorter or longer window may be beneficial, depending on the model.
- When the model runs in incremental mode, indicated by the is_incremental() jinja check, it filters data based on the _inserted_timestamp, selecting records from the last X hours since the model's last run.
- The {{ this }} reference points to the current model table.

{% if is_incremental() %}
AND _inserted_timestamp >= (
    SELECT MAX(_inserted_timestamp) - INTERVAL '12 hours'
    FROM {{ this }}
)
{% endif %}

Qualify Statement:
- The QUALIFY clause is used to select the most recent record from the table, based on unique ids. We typically add this to the final SELECT statement in a model if there are duplicate values. Please test thoroughly because it may add significant processing time, as this can be a resource intensive clause.
- It works by assigning a row number to each record within groups of the same unique id(s), designated by PARTITION BY. In this example, only 1 id is used, but multiple can be added. Within each group, it orders the records by the latest value, in this case _inserted_timestamp. Both DESC or ASC order may be used depending on the intended outcome. The ROW_NUMBER() = 1 conditions ensures that the model keeps only the first row in each group, which in this case is the most recent inserted record due to the descending order.

QUALIFY (ROW_NUMBER() OVER (PARTITION BY _log_id ORDER BY _inserted_timestamp DESC)) = 1

Examples, References & Sources:

Example Silver Curated Models:

Protocol Specific Models:
- ethereum.silver_dex.fraxswap_pools
- ethereum.silver_dex.fraxswap_swaps
Complete Models:
- ethereum.silver_lsd.complete_lsd_deposits

Example Folder Structure and Model Hierarchy:

> models/silver/defi
    >> bridge
    >> dex
        >> balance
        ...
        >> frax
           - silver_dex__fraxswap_pools.sql
           - silver_dex__fraxswap_pools.yml
           - silver_dex__fraxswap_swaps.sql
           - silver_dex__fraxswap_swaps.yml
        >> kyberswap
        ...
        - silver_dex__complete_dex_liquidity_pools.sql
        - silver_dex__complete_dex_liquidity_pools.yml
        - silver_dex__complete_dex_swaps.sql
        - silver_dex__complete_dex_swaps.yml
    >> lending
    >> liquid_staking

Gold Layer Modeling

Summary:

We utilize the Gold Layer to surface data to users and clients via the data studio, data shares or API. Gold curated models are typically views that SELECT all columns from the complete curated models, and the structure of the model should be extremely simple.

Best Practices, Tips & Tricks:

Unless specifically required, we do not recommend applying transformations in the Gold models. That should be handled at the Silver Layer.
Please apply the majority of tests at the source, in the Silver Layer, unless new transformations or logic are added to the Gold model that may materially impact the model.
Because these models are surfaced to users, it's recommended to adjust the order of the columns in an intuitive nature for analysts to access with ease. Please align the column order with other curated Gold models. The same goes for column names.
Do not include any columns that begin with _ (underscore) in the Gold models, e.g. _log_id, _call_id, _inserted_timestamp etc.
Gold curated models are typically materialized as views, however, materializing as an incremental table is an option if required for performance reasons.

Implementation Steps, Variables & Notes:

Naming conventions:
- Each model must be named in a standardized fashion, as these are the models that are surfaced to Flipside users. They must follow the Star Schema methodology (dim/fact/ez) and represent the category/primitive being curated.
  - [schema]__[dim/fact/ez)]_[action]
  - e.g. defi__dim_dex_liquidity_pools.sql, defi__ez_dex_swaps.sql
Folder Structure:
- Folders may be used to nest models categorically for organization purposes only. These folders are not surfaced to users in the Data Studio or in Snowflake.
Model Structure:
- Pull the inserted_timestamp, modified_timestamp columns through to Gold. Additionally, add a new id column that is structured as <gold_model_name>_id.
  - e.g. ez_dex_swaps_id
- Each .sql model must have a corresponding .yml file for tests, table and column descriptions. In the Gold Layer, detailed table and column descriptions are required. Tests are not required unless there's new logic that may materially impact the model.
  - Each description should reference a doc file, for standardization purposes.

Examples, References & Sources:

Example Gold Curated Models:

Example Folder Structure and Model Hierarchy:

> models/gold/defi
    >> lending
    - defi__dim_dex_liquidity_pools.sql
    - defi__dim_dex_liquidity_pools.yml
    - defi__ez_bridge_activity.sql
    - defi__ez_bridge_activity.yml
    ...