Skip to main content

Hi,

I have set up recurring syncs of your Amplitude event data to BigQuery through the Amplitude UI. It all works really well except that i found a lot of duplicated events (identical event_id with identical amplitude_id). Here is an example of a user with an event with id 623, but there are two rows (!):

amplitude_id event_id event_time device_brand device_manufacturer device_model is_attribution_event processed_time user_creation_time
83043144830 623 2023-04-20 02:54:16.468000 UTC samsung samsung SM-G935F FALSE   2019-02-12 22:50:15.990000 UTC
83043144830 623 2023-04-20 02:54:16.468000 UTC         2023-05-02 21:26:50.277000 UTC  


The two rows are identical, except for: device_brand, device_manufacturer, device_model, is_attribution_event, processed_time and user_creation_time, are identical.

It is of course trivial to remove such duplicate rows but I want to understand why we have and still are getting this duplication (and why these fields above sometimes have values and sometimes not). 

We haven’t backfilled any data and as far as I understand it. Are we missing some vital step here that we need to do in order to correctly ingest this data?

Any help, pointers would be greatly appreciated as this is quite urgent for us.

The goal is simply that we have a dataset where each row is a unique event performed by an amplitude_id and that we have as much information about this event as possible.

Best regards,
Nils
 

Be the first to reply!

Reply