Hi,
I have set up recurring syncs of your Amplitude event data to BigQuery through the Amplitude UI. It all works really well except that i found a lot of duplicated events (identical event_id with identical amplitude_id). Here is an example of a user with an event with id 623, but there are two rows (!):
amplitude_id | event_id | event_time | device_brand | device_manufacturer | device_model | is_attribution_event | processed_time | user_creation_time |
83043144830 | 623 | 2023-04-20 02:54:16.468000 UTC | samsung | samsung | SM-G935F | FALSE | 2019-02-12 22:50:15.990000 UTC | |
83043144830 | 623 | 2023-04-20 02:54:16.468000 UTC | 2023-05-02 21:26:50.277000 UTC |
The two rows are identical, except for: device_brand, device_manufacturer, device_model, is_attribution_event, processed_time and user_creation_time, are identical.
It is of course trivial to remove such duplicate rows but I want to understand why we have and still are getting this duplication (and why these fields above sometimes have values and sometimes not).
We haven’t backfilled any data and as far as I understand it. Are we missing some vital step here that we need to do in order to correctly ingest this data?
Any help, pointers would be greatly appreciated as this is quite urgent for us.
The goal is simply that we have a dataset where each row is a unique event performed by an amplitude_id and that we have as much information about this event as possible.
Best regards,
Nils