-
Notifications
You must be signed in to change notification settings - Fork 116
DO-2075 Added fenix and desktop city seen initial table - DO NOT MERGE (DRAFT) #7974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1/schema.yaml
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1/metadata.yaml
Outdated
Show resolved
Hide resolved
sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1/schema.yaml
Outdated
Show resolved
Hide resolved
sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1/schema.yaml
Outdated
Show resolved
Hide resolved
sql/moz-fx-data-shared-prod/firefox_desktop_derived/clients_city_seen_v1/query.sql
Outdated
Show resolved
Hide resolved
clients_city_first_seen_firefox_desktop AS ( | ||
SELECT | ||
client_id, | ||
first_seen_date AS first_seen_geo_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See question above) - first_seen_date and the city captured on that date vs. first city seen and on which date.
sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1/metadata.yaml
Outdated
Show resolved
Hide resolved
sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1/schema.yaml
Outdated
Show resolved
Hide resolved
…/schema.yaml Co-authored-by: Lucia <[email protected]>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
64349bd
to
dc75f81
Compare
This comment has been minimized.
This comment has been minimized.
ce4fe26
to
ff85557
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Integration report for "Added table name to query"
|
@@ -0,0 +1,230 @@ | |||
-- Query generated via sql_generators.clients_city_seen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: update to sql_generators.baseline_clients_city_seen.
friendly_name: Baseline Clients City Seen | ||
description: |- | ||
This table stores the first-seen and last-seen geo attributes for each client_id. | ||
The table was initialized from stable tables (with ~2 years of retention), so the initial dates reflect the earliest/latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add note where initialization code would no longer be relevant once city and subdivision fields in stable tables are nullified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove all the generated files in sql/
before merging because generated sql doesn't need to be checked in.
The is_init query with a 100% sample might be too expensive run in a single query and timeout. This is fine for the POC but something to prepare for later
WHERE | ||
client_info.client_id IS NOT NULL | ||
AND sample_id = 0 | ||
AND DATE(submission_timestamp) <= "2025-08-25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current date is probably more appropriate for the init so it gets everything
AND DATE(submission_timestamp) <= "2025-08-25" | |
AND DATE(submission_timestamp) <= CURRENT_DATE() |
depends_on: | ||
- task_id: copy_deduplicate_all | ||
dag_name: copy_deduplicate | ||
execution_delta: 1h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed because copy_deduplicate
is to populate the stable tables and this uses live. Also execution_delta
would be 3h in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
You should remove all the generated files in sql/ before merging
because generated sql doesn't need to be checked in.
I plan to remove all the generated files in sql/ before merging. I've added them here just as reference/sharing with DS for now.
The is_init query with a 100% sample might be too expensive run in > a single query and timeout. This is fine for the POC but something > to prepare for later
How would you recommend to run this to ensure it would not time out?
This isn't needed because copy_deduplicate is to populate the
stable tables and this uses live. Also execution_delta would be 3h
in this case
can you explain more on why execeution_delta would be 3h? Which DAG should this depend on if not copy_deduplicate to ensure the data is available in the live tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- One way to initialize the tables would be per sample id. You could write a script that just runs the query per sample id
- copy_deduplicate is scheduled at 1am and bqetl_clients_city_seen is scheduled at 4am so the execution delta is 3 hours
- There's no dag for the live tables so this wouldn't depend on anything. It would just assume data is complete after some time past midnight UTC. This is what copy_deduplicate does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if I add @sample_id then the table would be initialized in parallel per sample_id:
bigquery-etl/bigquery_etl/cli/query.py
Line 1460 in 59139fd
To run in parallel per sample_id, include a @sample_id parameter in the query. |
@@ -2490,3 +2490,20 @@ bqetl_bigeye_derived: | |||
retry_delay: 30m | |||
tags: | |||
- impact/tier_3 | |||
|
|||
bqetl_clients_city_seen: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: change to bqetl_baseline_clients_city_seen
table_type: client_level | ||
dag: bqetl_clients_city_seen | ||
scheduling: | ||
dag_name: bqetl_clients_city_seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: change to bqetl_baseline_clients_city_seen
@@ -573,3 +573,5 @@ retention_exclusion_list: | |||
- sql/moz-fx-data-shared-prod/firefox_ios_derived/client_adclicks_history_v1 | |||
- sql/moz-fx-data-shared-prod/acoustic_external/suppression_list_v1 | |||
- sql/moz-fx-data-shared-prod/telemetry_derived/fx_accounts_linked_clients_v1 | |||
- sql/moz-fx-data-shared-prod/firefox_desktop_derived/clients_city_seen_v1 | |||
- sql/moz-fx-data-shared-prod/fenix_derived/clients_city_seen_v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: change to baseline_clients_city_seen_v1
Description
This PR is created as a draft for now and not ready for review. It is set up for initializing the city seen tables, pulling first and last city and subdivision fields from stable tables. The initialization code will not be applicable once the city and subdivision fields are nulled out in the stable tables.
Related Tickets & Documents
Reviewer, please follow this checklist