Xyninja/pipelines 5 (#2182)

xyn1nja · web-flow · commit de2876dda6d2 · 2025-10-24T16:52:12.000+08:00
* Added a new page on common SQL queries

* Added a link to common SQL queries page

* Updated titles and descriptions for common SQL queries

* Removed expanding `properties` JSON query
diff --git a/pages/docs/data-pipelines.mdx b/pages/docs/data-pipelines.mdx
@@ -109,7 +109,7 @@ Discrepancies between the event counts in Mixpanel and those exported to your de
 
 ### What timezone is the data exported in?
 
-The data is exported in UTC timezone. You’ll need to convert it to your project’s timezone when running queries in your warehouse.
+The data is exported in UTC timezone. You’ll need to convert it to your project’s timezone when running queries in your warehouse. Please refer to [this page](/docs/data-pipelines/common-sql-queries) for some common SQL queries.
 
 ### How can I count events exported by Mixpanel in the warehouse?
 
diff --git a/pages/docs/data-pipelines/_meta.ts b/pages/docs/data-pipelines/_meta.ts
@@ -1,5 +1,6 @@
 export default {
   "json-pipelines": "Json Pipelines",
+  "common-sql-queries": "Common SQL Queries",
   "integrations": "Integrations",
   "old-pipelines": "Older Version"
 }
diff --git a/pages/docs/data-pipelines/common-sql-queries.mdx b/pages/docs/data-pipelines/common-sql-queries.mdx
@@ -0,0 +1,88 @@
+# Common SQL Queries
+
+### Total daily count with deduplication logic and timezone adjustment 
+
+Events exported via pipelines (i.e. raw exports) can contain duplicates. Deduplication should be performed using 4 event properties: `event_name`, `time`, `distinct_id`, and `insert_id` (docs [here](https://developer.mixpanel.com/reference/event-deduplication)). This is an example of a total daily count, converted to a specific timezone and deduplicated.
+
+```sql
+SELECT
+  DATE(time, 'America/Los_Angeles') AS event_date,
+  COUNT(DISTINCT CONCAT(event_name, time, distinct_id, insert_id)) AS event_count,
+FROM
+  `<your dataset>.mp_master_event`
+WHERE
+  DATE(time, 'America/Los_Angeles') >= '2025-08-01'
+  AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
+GROUP BY
+  1
+ORDER BY
+  1 ASC
+```
+
+### Unique user count with user ID resolution
+
+Raw events may contain the original `distinct_id` associated with the user at the time of the event instead of the final canonical `distinct_id` for the user after authentication. The `mp_identity_mappings_data_view` contains mappings of the original `distinct_id`s to the resolved ones (i.e. canonical `distinct_id`s). You can use this mapping to make sure that the unique users calculations account for ID management and therefore more accurate.
+
+```sql
+SELECT
+  DATE(time, 'America/Los_Angeles') AS event_date,
+  COUNT(DISTINCT resolved_user_id) AS unique_users
+FROM (
+  SELECT
+    time,
+    IFNULL(id_mappings.resolved_distinct_id, events.distinct_id) AS resolved_user_id
+  FROM
+    `<your dataset>.mp_master_event` AS events
+  LEFT JOIN
+    `<your dataset>.mp_identity_mappings_data_view` AS id_mappings
+  ON
+    events.distinct_id = id_mappings.distinct_id
+  WHERE
+    DATE(time, 'America/Los_Angeles') >= '2025-08-01'
+    AND DATE(time, 'America/Los_Angeles') < '2025-09-16' )
+GROUP BY
+  1
+ORDER BY
+  1 ASC
+```
+
+### Top 20 events by volume
+
+```sql
+SELECT
+  event_name,
+  COUNT(*) AS event_count
+FROM
+  `<your dataset>.mp_master_event`
+WHERE
+  DATE(time, 'America/Los_Angeles') >= '2025-08-01'
+  AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
+GROUP BY
+  1
+ORDER BY
+  2 DESC
+LIMIT
+  20
+```
+
+### Querying duplicate events 
+
+Raw exported events can contain duplicates. You can use these 4 event properties to identify duplicates: `event_name`, `time`, `distinct_id`, and `insert_id` (docs [here](https://developer.mixpanel.com/reference/event-deduplication)). This is an example of a query you can use to identify duplicate events in your raw data.
+
+```sql
+SELECT
+ *,
+ COUNT(*) OVER (PARTITION BY event_name, time, distinct_id, insert_id ) AS dup_group_size
+FROM
+ `<your dataset>.mp_master_event`
+WHERE
+ DATE(time, 'America/Los_Angeles') >= '2025-08-01'
+ AND DATE(time, 'America/Los_Angeles') < '2025-09-16'
+QUALIFY
+ dup_group_size > 1
+ORDER BY
+ DATE(time, 'America/Los_Angeles'),
+ event_name,
+ time
+ ```
+

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,6 @@`
`1`	`1`	`export default {`
`2`	`2`	`"json-pipelines": "Json Pipelines",`
	`3`	`+ "common-sql-queries": "Common SQL Queries",`
`3`	`4`	`"integrations": "Integrations",`
`4`	`5`	`"old-pipelines": "Older Version"`
`5`	`6`	`}`