Duplicated events in BigQuery

Hi,

I have set up recurring syncs of your Amplitude event data to BigQuery through the Amplitude UI. It all works really well except that i found a lot of duplicated events (identical event_id with identical amplitude_id). Here is an example of a user with an event with id 623, but there are two rows (!):

amplitude_id	event_id	event_time	device_brand	device_manufacturer	device_model	is_attribution_event	processed_time	user_creation_time
83043144830	623	2023-04-20 02:54:16.468000 UTC	samsung	samsung	SM-G935F	FALSE		2019-02-12 22:50:15.990000 UTC
83043144830	623	2023-04-20 02:54:16.468000 UTC					2023-05-02 21:26:50.277000 UTC

The two rows are identical, except for: device_brand, device_manufacturer, device_model, is_attribution_event, processed_time and user_creation_time, are identical.

It is of course trivial to remove such duplicate rows but I want to understand why we have and still are getting this duplication (and why these fields above sometimes have values and sometimes not).

We haven’t backfilled any data and as far as I understand it. Are we missing some vital step here that we need to do in order to correctly ingest this data?

Any help, pointers would be greatly appreciated as this is quite urgent for us.

The goal is simply that we have a dataset where each row is a unique event performed by an amplitude_id and that we have as much information about this event as possible.

Best regards,
Nils

Be the first to reply!

Recent Tips & Tricks

Welcome to the Amplitude Community!

If you're a current customer, select the domain you use to sign in with Amplitude.

Welcome to the Amplitude Community!

If you're a current customer, select the domain you use to sign in with Amplitude.

Scanning file for viruses.

This file cannot be downloaded