GA4 ingestion pipeline - willsql4food/LakehouseToolkit GitHub Wiki
Google Big Query imposes a limit of approximately 1M rows returned by a query. In the event your extract anticipates more than the limit, you'll need logic to break it into smaller batches.
---
title: GBQ to Azure
---
flowchart LR
classDef green fill:#44aa44,stroke:#008800,stroke-width:2px;
classDef blue fill:#222288,stroke:#000088,stroke-width:2px;
classDef mdblue fill:#5555aa,stroke:#0000aa,stroke-width:2px;
classDef yellow fill:#888800,stroke:#444400,stroke-width:2px;
classDef brightYellow fill:#BBBB00,stroke:#444400,stroke-width:2px,color:#0000AA;
classDef red fill:#880000,stroke:#440000,stroke-width:2px;
subgraph "Google Big Query"
obj[GBQ Objects]; class obj green
end
subgraph Azure
lc[(LoadControl)]; class lc mdblue
subgraph "Data Factory"
rc[Get Row\nCounts]; class rc blue
sq[Build\nSELECT\nstatement]; class sq blue
over{Over\nLimit?}; class over yellow
over2{Still over\nlimit?}; class over2 yellow
logic{Know\nAlgorithm?}; class logic yellow
where[[Simple\nWHERE clause]]; class where blue
where2[[Compound\nWHERE clause]]; class where2 blue
full[Load full\nObject]; class full blue
batch[Load a batch]; class batch blue
errmsg[[Send Error\nMessage]]; class errmsg red
end
lake([Azure\nLakehouse]); class lake mdblue
end
obj --Stage\nAvailable\nobjects--> lc
obj --Metadata\nfor objects--> lc
lc --> rc --> over
lc --> sq -.-> full
sq -.-> where
sq -.-> where2
rc --Audit\nTo Load --> lc
over --No --> full --Audit\nLoaded --> lc
over --Yes --> logic --Yes --> over2
logic --No --> errmsg
over2 --No --> where --> batch
over2 --Yes --> where2 --> batch
batch --Audit\nLoaded --> lc
full --Write\nParquet--> lake
full --Write\nAudit--> lake
batch --Write\nParquet--> lake
batch --Write\nAudit--> lake
events_yyyymmdd will likely contain several million records. A natural batch condition would be event_name
Generate a count of rows by event_name with a running total that resets every time it would exceed the limit. In this way, smaller data sets can be combined in one copy activity to reduce costs:
| group_id | id | event_name | _count | running_total |
|---|---|---|---|---|
| 1 | 1 | page_view | 563,382 | 563,382 |
| 2 | 2 | image_interactions | 462,519 | 462,519 |
| 2 | 3 | view_item_list | 436,537 | 899,056 |
| 3 | 4 | header_click | 263,528 | 263,528 |
| 3 | 5 | scroll_depth | 249,078 | 512,606 |
| 3 | 6 | user_engagement | 189,903 | 702,509 |
| 3 | 7 | view_item | 176,140 | 878,649 |
| 4 | 8 | session_start | 173,927 | 173,927 |
| 4 | 9 | first_visit | 104,025 | 277,952 |
| 4 | 10 | Roomvo | 73,700 | 351,652 |
| 4 | 11 | select_item | 56,965 | 408,617 |
| ... | ... | ... | ... | ... |
| 4 | 84 | link_click_export_services | 3 | 810,642 |
| 4 | 85 | screen_view | 1 | 810,643 |
with recursive cte as (
/* Count and row_number by event_name and ordered with highest row count first */
select event_name, count(*) _count, row_number() over (order by count(*) desc) id
from fnd-cloud-project.analytics_250303278.events_20240112
group by event_name
), rt as (
/* Conditional sum with prior row(s) until just before 1M, then start new running total */
select id, event_name, _count, _count running_total, 1 group_id
from cte
where id = 1
union all
select a.id, a.event_name, a._count,
case when a._count + b.running_total < 1000000
then a._count + b.running_total
else a._count
end running_total,
case when a._count + b.running_total < 1000000
then b.group_id
else b.group_id + 1
end group_id
from cte a
join rt b on b.id = a.id - 1
)
select group_id, event_name, _count, running_total
from rt
order by group_idYou can generate a WHERE clause with a small modification to the above query:
...
select group_id,
concat(' where event_name in (',
string_agg(
concat('\'', event_name,'\''), ', '), ')') events,
max(running_total) running_total
from rt
group by group_id
order by group_id| group_id | events | running_total |
|---|---|---|
| 1 | where event_name in ('page_view') | 563,382 |
| 2 | where event_name in ('view_item_list', 'image_interactions') | 899,056 |
| 3 | where event_name in ('user_engagement', 'view_item', 'scroll_depth', 'header_click') | 878,649 |
| 4 | where event_name in ('search_tracking', 'store_search_modal_interaction', 'empty_cart', 'slider_visualizer_over', 'button_click', 'remove_from_cart', 'add_payment_info', 'form_tracking', 'sort_by', 'more_information', 'add_over', 'live_chat', 'link_click_blog', 'initiated_add_to_my_project', 'session_start', 'store_link', 'blog_over_click', 'site_error', 'search_help_center', 'add_shipping_info', 'completed_add_to_my_project', 'link_click_my_account', 'footer_click', 'pdp_checkbox', 'link_click_quick_view', 'button_click_help_center', 'design_step', 'open_close_accordion', 'add_to_cart', 'click', 'install_mats_carousel', 'button_click_pro_services', 'installation_service_banner', 'link_click_export_services', 'store_detail_image_interactions', 'button_click_credit_center', 'select_item', 'checkout_customer_type_selection', 'link_click', 'clp_banner', 'screen_view', 'FAQs', 'design_customer_type', 'search_store', 'link_click_inspiration_center', 'select_schedule', 'search_interaction', 'image_interactions_click', 'view_cart', 'design_type_pro', 'button_click_gift_card', 'button_click_contact_us', 'link_click_help_center', 'button_click_workshop', 'videos', 'Roomvo', 'session_expiration', 'button_click_store', 'slider_visualizer_over_click', 'mpow_link_click', 'first_visit', 'pdp_radio_buttons', 'begin_checkout', 'link_click_pro_services', 'remove_over', 'plp_banner', 'select_store', 'file_download', 'button_click_store_detail', 'use_calculator', 'view_cart_button', 'clp_click_tracking', 'store_detail_link', 'content_slots', 'remove_all_over', 'purchase', 'interaction_accordion', 'button_click_social') | 810,643 |
It's possible that a single event_name represents more rows than the limit allows; in this case, you will need a compound WHERE clause. A secondary candidate is the event_timestamp field.
This is a modified Unix date (number of seconds since 1970-01-01) with six digits of precision.
Take the ceiling of the row count divided by 1M and divide 86,400,000,000 by this. Use this result to split the event_timestamp into equal parts and use as a secondary bracket for the event_name filter.
# Example:
page_view has 2,659,247 rows
ceiling(2689247 / 1000000) = 3
86400000000 / 2 = 28800000000
min(event_timestamp) = 1705381200152846
1705381200152846 + 28800000000 = 1,705,410,000,152,846
filter1 = " where event_name in ('page_view') and event_timestamp <= 1705424400152846 "
filter2 = " where event_name in ('page_view') and event_timestamp > 1705424400152846 "