A Quick PartitionBy Switch for Delta Files

Sometimes you need to make a change to how things are partitioned. And that’s okay! You should have the flexibility to change what your delta is partitioned by just like you can change how a table is indexed.

So, here’s the code you need!

A couple of notes on why we are overwriting to the same save_path:

  • It is better to replace the delta file in-place like this because otherwise you will lose all of your data
  • Replacing the delta file in-place allows you to restore to the previous version if needed
  • Deleting a directory of data takes much longer than just overwriting it

A couple of notes on the code itself:

  • save_path is a variable that holds a path to your delta file
  • Note that in the displays, we are using save_path in a formatted string so that way we can have the backticks ` around the path
  • We are using DESCRIBE EXTENDED to show the partitioning currently in place
  • It is run after the path is read into a dataframe and after it is written
  • In the write, it is important to have the mode("overwrite") option. The overwrite is going to overwrite the entire table. But you just loaded all of your data into the dataframe, so that’s fine. We need to rewrite the table so that the partition can be used.
  • In the write, it is also important to have the option("overwriteSchema", "true") option. The overwriteSchema part is going to replace the portion found in DESCRIBE EXTENDED – the metadata – that shows your current partitioning with your new partitioning.
save_path = "abfss://clnd@adls_account_name.dfs.core.windows.net/FolderName/DeltaName/"
df = spark.read.format("delta").load(save_path)
display(spark.sql(f"DESCRIBE EXTENDED delta.`{save_path}`"))
df.write.format("delta").partitionBy("PartitionColumn").mode("overwrite").option("overwriteSchema", "true").save(save_path)
display(spark.sql(f"DESCRIBE EXTENDED delta.`{save_path}`"))

A couple of notes on partitioning:

  • In Delta, there is a feature known as “Data Skipping” that, depending on the amount of data, may work better than partitioning
  • A delta shouldn’t be partitioned unless the total amount of data is at least 1 TB
  • After partitioning, there should be relatively similar amounts of data within each partition, otherwise you get skewing
  • Skewed data can take longer to run because not all of the nodes are being used properly
  • Partitions should have about 100 GB of data each – if that’s not the case, you might want to review the fields you are using for partitioning

Leave a comment