When it comes to Time Travel, let’s face it. We all have a lot of questions. If I go back in time and kill my father, do I kill myself, too? Or does it spawn a new timeline? Can I meet dinosaurs and if I step on a bug, will it end life as we know it? If I meet teenage me, what should I say? Should I say anything?
And maybe the biggest question of all:
If I want to see the oldest data in my Delta table, how far back can I go?
For this question, the answer is, of course, it depends.
Microsoft Documents provides some answers there. In the Table utility commands page for Azure Databricks, we get some explanation. The answers apply to Delta tables in general as it is a standard format, though Databricks may sometimes be ahead of others on the version of Delta tables that it is using. We can also get our answer from Delta.io docs article Table batch reads and writes.
You Have to Have Both Data File and Log File to Time Travel
This is the most important caveat. You can go as far back as you have both the data file and the log file. If you don’t have both the log file and the data file, then there will be no time travel to that version of the data. The data will be gone. But you have to get rid of that data yourself for it to be gone, it won’t just disappear after a certain period of time.
Data Files are Never Deleted Automatically – They Must be Vacuumed
The Good News: The vacuum command must be ran to remove data files no longer referenced by a Delta table.
This command will not be ran automatically, so it would have to be a command that you implement in your solution somewhere. Hence, if you don’t want it to be vacuumed, don’t vacuum it. That is easier said than done as support personnel may inadvertently run a vacuum. There is no way to stop a vacuum from being run.
NOTE: vacuum the default retention is 7 days.
There are a couple of configurations that you can make in the event that a vacuum command is ran.
delta.deletedFileRetentionDuration – How Long Files are Kept
delta.deletedFileRetentionDuration = "interval <interval>". Default is “interval 7 days“.
The delta.deletedFileRetentionDuration command, for example, it can be set to "interval 30 days". That means even if someone runs a vacuum command, it won’t be applied to the first 30 days worth of versions that you have already deleted right now. In other words, files that have been deleted 29 days ago can’t be vacuumed.
Obviously, the more files you have in storage, the more expensive it will be. As with anything, it’s a balancing act to determine what your needs are, and will largely depend on use cases for each organization.
Log Files are Deleted Automatically After Checkpoints
For more information on how the delta log works, check out this article by Databricks, Diving Into Delta Lake: Unpacking the Transaction Log.
Checkpoint?!
“Don’t act so surprised, your highness.”
A checkpoint is essentially a re-write of the Delta table after ten commits are made. This enables Spark to more easily work with new data coming in as it can skip anything that is before the current checkpoint and get to the new data more quickly.
delta.logRetentionDuration – How Long Log History is Kept
delta.logRetentionDuration = "interval <interval>": controls how long the history for a table is kept. The default is interval 30 days.
Having a multitude of log files is not something that will cause contention with reading or writing data files. It is an operation that happens in parallel. Your biggest expense here, again, is storage.
How to Configure Tables to Keep a Long History for Time Travel
First, do this when the tables are not being written to. Schedule some downtime. Delta table properties are write operations, and will conflict with other write operations.
Next, decide on if you want to use Python or SQL to do this and then pick a very large number.
Finally, run the following commands as needed. They are really the same whichever way you run it. The nice part of using the python version, of course, is that it can be easier to iterate through an entire list of tables to get this done quickly.
SQL Version
alter table delta.`/path/to/table` set TBLPROPERTIES ('delta.logRetentionDuration'='interval 999999 days', 'delta.deletedFileRetentionDuration'='interval 999999 days')
Python Version
spark.sql(f"""alter table delta.`/path/to/table` set TBLPROPERTIES ("
"'delta.logRetentionDuration'='interval 999999 days', "
"'delta.deletedFileRetentionDuration'='interval 999999 days');""")