DataBricks: Fastest Way To Insert Data Into Delta Table?
I have a handful of tables that are only a few MBs in filesize each that I want to capture as Delta Tables. Inserting new data into them takes an extraordinarily long time, 15+ minutes, which I am astonished at.
The culprit, I am guessing, is that while the table is very small; there are over 300 columns in these tables.
I have tried the following methods, with the former being faster than the latter (unsurprisingly(?)): (1)
INSERT INTO , (2)
Before inserting data into the Delta Tables, I apply a handful of Spark functions to clean the data and then lastly register it as a temp table (e.g.,
INSERT INTO DELTA_TBL_OF_INTEREST (cols) SELECT * FROM tempTable
Any recommendations on speeding this process up for trivial data?
If you're performing data transformations using PySpark before putting the data into the destination table, then you don't need to go to the SQL level, you can just write data using
If you're using registered table:
df = ... transform source data ... df.write.mode("append").format("delta").saveAsTable("table_name")
If you're using file path:
df = ... transform source data ... df.write.mode("append").format("delta").save("path_to_delta")
- → Summing integers across records in Spark
- → Why does my Spark run slower than pure Python? Performance comparison
- → How to select a particular column from a CSV in pyspark?
- → How to transform vector into array for Frequent Pattern Analysis
- → How to filter null values in pyspark dataframe?
- → Match pyspark dataframe column to list and create a new column
- → Extract specific string from a column in pyspark dataframe
- → In pyspark how to create an array column that is a summation of two or more array columns?
- → Join dataframes matching a column with a range determined by two columns in the other one with PySpark
- → Calculate relative frequency of bigrams in PySpark
- → Why does StringIndexer has no outputCols?
- → after joining two dataframes pick all columns from one dataframe on basis of primary key
- → How to compare two schema in Databricks notebook in python