Solved

Export delayed events to Redshift

  • 7 September 2022
  • 6 replies
  • 168 views

Hello,

we have just start with data exports and we have a concern with delayed events.

The order_key in the table is event_time, as I can read in the docs this is the difference between server_received_time and client_upload_time, but I cannot understand how it solves the case when the export needs to loads in Redshift delayed data, when the new data has been loaded.

We are worried about using event_time could end on a loss of data in Redshift in case of a conflict between new event just received and old event that needs to be delivered, that Redshift will reject due to the order_key. A practical example could help us in understand the process.

We would like to have an upload_timestamp  as order_key in a new column with a timestamp filled when the event is loaded in Redshift instead of Amplitude. We tried to add it to the table but the export failed because of the table structure. But in case we can be sure that event_time will solves it, this column could be not necessary.

We have been thinking on using S3 export as a more flexible way to export the events.

 

Thank you.

icon

Best answer by Denis Holmes 8 September 2022, 13:15

View original

6 replies

Userlevel 6
Badge +8

Hi @Imploxion ,

 

You are free to use the S3 Export if you wish here. https://www.docs.developers.amplitude.com/data/destinations/amazon-s3/#run-a-manual-export

However, most people who use the redshift export do not have an issue with their times of the event. However, I do see and understand your case and if you do want to add that field to your data table, I think Snowflake would be a better option. 

If the source table in the data warehouse had a timestamp field representing when the event arrived in the data warehouse, you could use this as the timestamp_column.

    • This would (in most cases) eliminate the missing data problem created by the delay between the original timestamp column value and the time the record is inserted into the data warehouse.

    • It guarantees (in most cases) the data will be successfully unloaded in some interval.

  • The insertion timestamp must be in UTC. For snowflake, you must also convert the UTC timestamp to a timestamp NTZ (no time zone metadata) if it’s not already an NTZ timestamp.

Creating an Insertion Timestamp Column in Snowflake

  • To add a snowflake_insertion_time column to a snowflake table, you need to specify a column with a default value.

  • The snowflake function that returns the current timestamp in UTC is SYSDATE(). You will use the SYSDATE() function as the expression in the column’s default statement.

    • Note: SYSDATE() already returns an NTZ timestamp, which is required.

  • Unfortunately, today you cannot use ALTER TABLE to add a column that defaults to a non-deterministic expression. This means that the snowflake_insertion_time column has to be added to the table when it is created

  • In your CREATE TABLE command, specify a column snowflake_insertion_time timestamp default SYSDATE().

    • Then when a row is inserted into the table and no value is provided for the snowflake_insertion_time column, snowflake will set the the value to the output of SYSDATE().


If you do add an insertion timestamp, it is advised to create a new source (and possibly a new project depending how new it is). 

  • Then create a new project.

  • In the new project create a new source with the same sql as before except you will select the new insertion timestamp column you created in the source table, and you will specify that column as the timestamp column.

  • Test and enable the source. All existing data in the table should be imported in the first ingestion run of the source, and any data loaded after should be imported in successive job runs.

I hope this helps! 

 

Kind Regards,
Denis

Hello @Denis Holmes and thanks for your answer.

Actually moving to Snowflake is not an option, I tried the same in Redshift and the export failed, but I don’t know exactly if it for the COPY command used to send the data.

There wouldn’t be any option to do the same or something similar with Redshift? Maybe I didn’t add this column correctly and we should load all the data with this column from the beginning (edit the DDL before start sending data).

Thank you.

Userlevel 6
Badge +8

Hello @Imploxion ,

 

Thanks for the reply! With Redshift, I do not believe you are allowed to edit the export schema in any way and believe we don’t allow for exceptions either. The main issue is you have redshift and this is a data source, correct? And sometimes, not all the data will be available at the source at once. You want to have a column in redshift that determines the time/order of the events? Please correct me if I am misunderstanding. I will then ask my team internally for clarification if anything can be done.

Kind Regards,
Denis

 

Hi @Denis Holmes and thank you.

Yes, we are using Amplitude as Data Source to get the event Data in our Redshift cluster, and we are using your Redshift Export tool. We have investigated this destination and also S3, because with S3 we can explore this possibility of a column with the insertion Timestamp in Redshift. It would be a column in redshift that indicates the time at which a given row is written to redshift. In addition of client_event_time and event_time that any event has.

We have tried also adding this column in the DDL of the Amplitude table in Redshift, but this made the export to fail. It would be super if you can talk with your team about this.

One thing that can help us is having an example in the case the export send delayed data, because using event_time as order key could make it to get lost. The event_time description is:

Amplitude timestamp (UTC) which is the client_event_time adjusted by the difference between server_received_time and client_upload_time, specifically: event_time = client_event_time + (server_received_time - client_upload_time) Amplitude uses this timestamp is used to organize events on Amplitude charts. NOTE: If the difference between server_received_time and client_upload_time is less than 60 seconds, the event_time will not be adjusted and will equal the client_event_time.

But this doesn’t clarify us too much about how it works as order key.

Thank you.

Userlevel 6
Badge +8

Hi @Imploxion ,

 

Let me ask the team and I will get back to you :)

Kind Regards,
Denis

Userlevel 6
Badge +8

Hi @Imploxion ,

As confirmed by the Engineers, you can refer to the server_upload_time column, it’s the value that we use for determining the export (we look back several hours to see what data are ready for export based on server_upload_time), all of data are organized based on server_upload_time within Amplitude. There shouldn’t be any issues with missing data. I hope that helps!

 

Kind Regards,
Denis

Reply