Ad

DataBricks: Fastest Way To Insert Data Into Delta Table?

- 1 answer

I have a handful of tables that are only a few MBs in filesize each that I want to capture as Delta Tables. Inserting new data into them takes an extraordinarily long time, 15+ minutes, which I am astonished at.

The culprit, I am guessing, is that while the table is very small; there are over 300 columns in these tables.

I have tried the following methods, with the former being faster than the latter (unsurprisingly(?)): (1) INSERT INTO , (2) MERGE INTO.

Before inserting data into the Delta Tables, I apply a handful of Spark functions to clean the data and then lastly register it as a temp table (e.g., INSERT INTO DELTA_TBL_OF_INTEREST (cols) SELECT * FROM tempTable

Any recommendations on speeding this process up for trivial data?

Ad

Answer

If you're performing data transformations using PySpark before putting the data into the destination table, then you don't need to go to the SQL level, you can just write data using append mode.

If you're using registered table:

df = ... transform source data ...
df.write.mode("append").format("delta").saveAsTable("table_name")

If you're using file path:

df = ... transform source data ...
df.write.mode("append").format("delta").save("path_to_delta")
Ad
source: stackoverflow.com
Ad