GA4 ingestion pipeline - willsql4food/LakehouseToolkit GitHub Wiki

Ingesting Google Big Query via ADF

Google Big Query imposes a limit of approximately 1M rows returned by a query. In the event your extract anticipates more than the limit, you'll need logic to break it into smaller batches.

---
title: GBQ to Azure
---
flowchart LR
    
    classDef green fill:#44aa44,stroke:#008800,stroke-width:2px;
    classDef blue fill:#222288,stroke:#000088,stroke-width:2px;
    classDef mdblue fill:#5555aa,stroke:#0000aa,stroke-width:2px;
    classDef yellow fill:#888800,stroke:#444400,stroke-width:2px;
    classDef brightYellow fill:#BBBB00,stroke:#444400,stroke-width:2px,color:#0000AA;
    classDef red fill:#880000,stroke:#440000,stroke-width:2px;

    subgraph "Google Big Query"
        obj[GBQ Objects]; class obj green
    end

    subgraph Azure
        lc[(LoadControl)]; class lc mdblue

        subgraph "Data Factory"
            rc[Get Row\nCounts]; class rc blue
            sq[Build\nSELECT\nstatement]; class sq blue

            over{Over\nLimit?}; class over yellow
            over2{Still over\nlimit?}; class over2 yellow
            logic{Know\nAlgorithm?}; class logic yellow

            where[[Simple\nWHERE clause]]; class where blue
            where2[[Compound\nWHERE clause]]; class where2 blue

            full[Load full\nObject]; class full blue
            batch[Load a batch]; class batch blue

            errmsg[[Send Error\nMessage]]; class errmsg red
        end

        lake([Azure\nLakehouse]); class lake mdblue
    end

    obj --Stage\nAvailable\nobjects--> lc
    obj --Metadata\nfor objects--> lc
    lc --> rc --> over
    lc --> sq -.-> full
    sq -.-> where
    sq -.-> where2
    rc --Audit\nTo Load --> lc
    over --No --> full --Audit\nLoaded --> lc

    over --Yes --> logic --Yes --> over2
    logic --No --> errmsg
    
    over2 --No --> where --> batch
    over2 --Yes --> where2 --> batch

    batch --Audit\nLoaded --> lc
    
    full --Write\nParquet--> lake
    full --Write\nAudit--> lake

    batch --Write\nParquet--> lake
    batch --Write\nAudit--> lake
Loading

Example: events

events_yyyymmdd will likely contain several million records. A natural batch condition would be event_name

Generate a count of rows by event_name with a running total that resets every time it would exceed the limit. In this way, smaller data sets can be combined in one copy activity to reduce costs:

group_id id event_name _count running_total
1 1 page_view 563,382 563,382
2 2 image_interactions 462,519 462,519
2 3 view_item_list 436,537 899,056
3 4 header_click 263,528 263,528
3 5 scroll_depth 249,078 512,606
3 6 user_engagement 189,903 702,509
3 7 view_item 176,140 878,649
4 8 session_start 173,927 173,927
4 9 first_visit 104,025 277,952
4 10 Roomvo 73,700 351,652
4 11 select_item 56,965 408,617
... ... ... ... ...
4 84 link_click_export_services 3 810,642
4 85 screen_view 1 810,643

Sample SQL

with recursive cte as (
  /* Count and row_number by event_name and ordered with highest row count first */
  select    event_name, count(*) _count, row_number() over (order by count(*) desc) id
  from      fnd-cloud-project.analytics_250303278.events_20240112
  group by  event_name 
), rt as (
  /* Conditional sum with prior row(s) until just before 1M, then start new running total */
  select    id, event_name, _count, _count running_total, 1 group_id
  from      cte
  where     id = 1
  union all
  select    a.id, a.event_name, a._count,
            case    when a._count + b.running_total < 1000000 
                    then a._count + b.running_total 
                    else a._count 
            end running_total,
            case    when a._count + b.running_total < 1000000 
                    then b.group_id 
                    else b.group_id + 1 
            end group_id
  from      cte a
  join      rt b on b.id = a.id - 1
)

select      group_id, event_name, _count, running_total
from        rt
order by    group_id

You can generate a WHERE clause with a small modification to the above query:

...

select      group_id, 
            concat(' where event_name in (', 
                string_agg(
                    concat('\'', event_name,'\''), ', '), ')') events, 
            max(running_total) running_total
from        rt
group by    group_id
order by    group_id
group_id events running_total
1 where event_name in ('page_view') 563,382
2 where event_name in ('view_item_list', 'image_interactions') 899,056
3 where event_name in ('user_engagement', 'view_item', 'scroll_depth', 'header_click') 878,649
4 where event_name in ('search_tracking', 'store_search_modal_interaction', 'empty_cart', 'slider_visualizer_over', 'button_click', 'remove_from_cart', 'add_payment_info', 'form_tracking', 'sort_by', 'more_information', 'add_over', 'live_chat', 'link_click_blog', 'initiated_add_to_my_project', 'session_start', 'store_link', 'blog_over_click', 'site_error', 'search_help_center', 'add_shipping_info', 'completed_add_to_my_project', 'link_click_my_account', 'footer_click', 'pdp_checkbox', 'link_click_quick_view', 'button_click_help_center', 'design_step', 'open_close_accordion', 'add_to_cart', 'click', 'install_mats_carousel', 'button_click_pro_services', 'installation_service_banner', 'link_click_export_services', 'store_detail_image_interactions', 'button_click_credit_center', 'select_item', 'checkout_customer_type_selection', 'link_click', 'clp_banner', 'screen_view', 'FAQs', 'design_customer_type', 'search_store', 'link_click_inspiration_center', 'select_schedule', 'search_interaction', 'image_interactions_click', 'view_cart', 'design_type_pro', 'button_click_gift_card', 'button_click_contact_us', 'link_click_help_center', 'button_click_workshop', 'videos', 'Roomvo', 'session_expiration', 'button_click_store', 'slider_visualizer_over_click', 'mpow_link_click', 'first_visit', 'pdp_radio_buttons', 'begin_checkout', 'link_click_pro_services', 'remove_over', 'plp_banner', 'select_store', 'file_download', 'button_click_store_detail', 'use_calculator', 'view_cart_button', 'clp_click_tracking', 'store_detail_link', 'content_slots', 'remove_all_over', 'purchase', 'interaction_accordion', 'button_click_social') 810,643

Secondary batching...

It's possible that a single event_name represents more rows than the limit allows; in this case, you will need a compound WHERE clause. A secondary candidate is the event_timestamp field.

This is a modified Unix date (number of seconds since 1970-01-01) with six digits of precision.

Take the ceiling of the row count divided by 1M and divide 86,400,000,000 by this. Use this result to split the event_timestamp into equal parts and use as a secondary bracket for the event_name filter.

# Example:
    page_view has 2,659,247 rows
    ceiling(2689247 / 1000000) = 3
    86400000000 / 2 = 28800000000
    min(event_timestamp) = 1705381200152846
    1705381200152846 + 28800000000 = 1,705,410,000,152,846

    filter1 = " where event_name in ('page_view') and event_timestamp <= 1705424400152846 "
    filter2 = " where event_name in ('page_view') and event_timestamp > 1705424400152846 "
⚠️ **GitHub.com Fallback** ⚠️