Skip to main content
Solved

Deduplicated View in BigQuery Destination

  • 17 July 2023
  • 4 replies
  • 243 views

Dear Amplitude Community,


I have set up a successful Amplitude to BigQuery Batch Pipeline (using Destinations). The data gets successfully uploaded.

I would like to resolve duplicate events in my database. According to https://www.docs.developers.amplitude.com/data/destinations/bigquery/, a deduplicated view should be created in the bigquery database.

However, in my database, only ‘EVENTS_12345’ table gets created. I do not see any ‘DEDUPLICATED_EVENTS_12345’.

 

Could you please help?
Thanks!

4 replies

Userlevel 5
Badge +9

@aela welcome and thanks for writing in! In Amplitude, we do provide a deduplicated view of the data. This is done by creating a table function within the provided dataset named deduplicated_<appid>. This function helps to manage potential duplicate events, especially if a backfill exists for a set of already-exported data.

Here's an example of how to use the deduplicate table function:

SELECT * FROM `testProject.testDataset.deduplicated_EVENTS_12345`('2022-01-01', '2022-02-10');

Please replace testProject.testDataset.deduplicated_EVENTS_12345 with your actual project and dataset name, and adjust the dates accordingly.

If you're not seeing this deduplicated view, it might be worth checking your setup and ensuring that the data sync is working correctly. 

Thanks @Esther Trapadoux for your help. Indeed, the normal view is working without any issues. However,, I do not see the deduplicated view and I have followed the Amplitude documentation for the BigQuery Destination setup. 
 

Hi @aela!

I have the same issue now. Please let me know if you managed to find out the cause of the problem? Сould you please share the solution?

Userlevel 3
Badge +5

Hi @aela @Mariia Kyselova sorry for the late response!

To resolve this issue, you can use the following statement to create the table function.
 
* Please replace APP ID with your project ID.

CREATE OR REPLACE TABLE FUNCTION `amplitude_bq_ingestion`.deduplicated_EVENTS_APP ID(start_date DATE, end_date DATE)
AS
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER (
PARTITION BY uuid
) rn
FROM `amplitude_bq_ingestion`.`EVENTS_APP ID`
WHERE DATE(event_time) >= start_date and DATE(event_time) <= end_date) t
WHERE rn = 1;

 I hope this helps. Please let me know if you have any questions.

Best regards,

Thao

Reply