Bulk Loading Data to Cloud Data Warehouses

What is Bulk Loading of data?

Bulk loading of data refers to the process by which huge volumes of data can be loaded into a database fast. Bulk loading allows data to be imported and exported from one destination to another, much faster than with usual data loading methods. Bulk loading on cloud platforms typically involves parallelization and the use of multiple nodes to speed up the process. Read about BryteFlow for bulk loading

Bulk loading loads data in big chunks

Bulk loading is a method to load data into a database in big chunks. Imagine loading details of purchase transactions in a certain period to your database. With usual methods you may enter one order at a time, but bulk loading will take files with similar information and load hundreds of thousands of records in a very short time. If you are taking data from one DBMS to another, you cannot expect to take across the data as it is to the new DB. You would need to dump the info to a file format that can be read and recognized by the new DB and then replicate the data to the new DB.

Bulk loading -why you need it and the difference from usual data loading

Loading data on indexed tables is usually optimized for inserting rows one at a time. When you are loading a large amount of data at one go, inserting rows one at a time will be time-consuming and inefficient. In fact, it will degrade the performance of the entire database. This calls for the need of bulk loading of data. Bulk loading or bulk insert of data relies on other more efficient processes of data loading.

Why is bulk loading faster than routine data loading?

Bulk loading operations are usually not logged and transactional integrity may not be ideal. Bulk loading operations often bypass triggers and integrity checks like constraints. Bypassing these time-consuming processes speeds up data loading performance to a great degree when you have large amounts of data to transfer.

Bulk loading is also referred to as Bulk Insert

Bulk loading is also called bulk insert. The term bulk insert was first used regarding SQL Server databases, but the principle is the same. Data specialists refer to ‘Oracle Bulk Insert’ for replicating large volumes of Oracle data or SQL Bulk Insert or SQL Server Bulk Insert in context to replication of huge volumes of SQL Server data.

All RDBMs and Data Warehouses have several approaches to load data and bulk data is uploaded through a variety of more efficient methods depending on the structure of the specific database.

Best practices to bulk insert data to popular Cloud Data Warehouses

We will elaborate on the best bulk insert approaches to follow for some popular destination Cloud Data Warehouses in the modern Big Data world i.e. Amazon Redshift, Snowflake, Azure Synapse SQL and Azure SQL DB. Each cloud data warehouse has its own unique technical architecture that we need to consider while bulk loading data.

Amazon Redshift Bulk Insert using COPY command

Bulk Insert using COPY command in Redshift

The most common method of bulk insert into Redshift, involves transferring the data from the on-premise system to cloud storage Amazon S3, and then using the COPY command to load data from S3 to Redshift. The COPY command makes optimal use of Amazon Redshift’s massively parallel processing (MPP) to read and load data in parallel to Redshift from files in an Amazon S3 bucket.

Split large files into multiple files for high performance bulk loading

Amazon Redshift is an MPP (massively parallel processing) data warehouse, where several compute nodes work in parallel to ingest the data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. The table below shows the different node types and the number of slices.

Node type Default Slices per node
Dense Compute DC2
dc2.large 2
dc2.8xlarge 16
Dense Storage DS2
ds2.xlarge 2
ds2.8xlarge 16
RA3 Nodes
ra3.4xlarge 4
ra3.16xlarge 16

Once you extract data into files, compress them and split a single file into multiple files (this should be a multiple of the number of slices in your Cluster). The files are loaded with the compute being distributed evenly across the slices on Redshift. The number of multiple files can be configured and set and depends on the Redshift node types. The COPY command in Redshift that ingests data is configured optimally for fast data loads to Redshift. The COPY command can load data from multiple files in parallel which makes for fast performance and leverages the MPP capability of Redshift. Create an AWS Data lake 10x faster

Automatic Creation of Tables, Default Dist Keys and Distribution Style

How data is distributed across the nodes in a Redshift cluster can make all the difference in increasing bulk load performance. The data distribution and query processing is handled by the Leader node while the other nodes contribute compute power. All the nodes process queries in parallel. The goal is to have the most even distribution of data across all slices in the nodes for fast performance. When data is loaded into a table, Redshift will distribute the rows across the node slices as per the table’s Distribution Style and Distribution Key. An even distribution of data enables Redshift to allocate the workload evenly to slices and maximizes the benefit of Massively Parallel Processing. This also helps for fast ingestion of the data.

About Redshift Distribution Keys

While creating a table you can assign any one among 4 distribution styles: AUTO, EVEN, KEY, or ALL. The AUTO style is default, if no distribution key is specified with the CREATE TABLE statement, AUTO distribution will be used by Redshift.

AUTO Distribution

Amazon Redshift uses the size of the table to assign the distribution style. It may begin with ALL but may change to EVEN as the table grows larger. This happens quite fast in the background. On setting DISTSTYLE to AUTO, Redshift may change the table data distribution to a KEY-based distribution style.

EVEN Distribution

The rows are distributed by the Leader node evenly in a round-robin style, irrespective of values in any specific column. EVEN distribution is suitable when a table does not have joins or when there isn’t a defined option between ALL distribution and KEY distribution.

KEY Distribution

In a KEY distribution, the distribution of rows happens according to the values of one column. The leader node puts values that match on the same node slice. In case there is distribution of a pair of tables on joining keys, the leader node will collocate rows on the slices as per the values of the joining columns. This ensures matching values from common columns are stored together physically.

ALL Distribution

The ALL Distribution style will distribute copies of the complete table to every node in the cluster as opposed to EVEN or KEY distribution which will place only a part of the table rows on each node. ALL distribution ensures every row is collocated for every join in the table. ALL distribution should only be used for tables that are not often changed or when changes are not extensive. ALL distribution demands a large amount of storage, multiplying the storage required by the number of nodes and it takes much more time to update, load and insert data into multiple tables.

Sort Keys used for optimum loads support efficient columnar storage

When a table is created, you can define one or more columns as Sort Keys. While loading data into the table, the rows are in the sorted sequence on the disk. Information is then provided to the query optimizer and this information is used to plan loading in an optimal manner, leveraging the way in which the data is sorted. The Sort Key specifies the order in which rows in a table are stored. The query performance is increased with an optimal use of Sort Keys since it allows the query optimizer to read a lesser quantity of data and filter out major bits. You can basically skip big chunks of data while processing queries leading to reduced processing time and faster performance.
The Amazon Redshift data warehouse stores data in a columnar manner which dramatically cuts down on the I/O on disks. The number of disk I/O requests is also reduced due to the columnar storage and a lesser amount of data is needed to be loaded into the memory to run a query. The query execution gets speeded up by the reduction in I/O and loading less data allows Redshift to perform in-memory processing to a greater degree, thus elevating performance. For efficient bulk loading using optimum Sort Keys is the best practice.

Fast Bulk Loading with Snowflake COPY INTO

Bulk Load with COPY INTO Snowflake command

The most practiced method for Bulk loading to Snowflake is with the COPY INTO Snowflake command. Bulk loading to Snowflake involves moving the data from the on-premise source to cloud storage and then loading the data using the COPY INTO Snowflake command. Before loading data, Snowflake does a check to see if the file has already been loaded. The simplest and most direct method of increasing loading performance is by restricting the COPY command to a specific directory. The code snippet shown displays a COPY INTO using a variety of options:

First approach: Slowest among all options as it scans the entire staging area

copy into final_table  from @Staging_data pattern=’.*[.]csv’;

Second approach: More flexible as it limits staging to a directory and involves a partial scan.

copy into final_table from @Staging_data/sales/transactions/2020/05  pattern=’.*[.]csv’;

Third approach: The fastest approach as a file name is specified, no scan done in the staging area.

copy into final_table from @Staging_data/final/transactions/2020/05/final_050.csv;

Splitting large files into smaller sized files for copying into Snowflake

The most efficient approach of loading data to Snowflake involves breaking up the single file into multiple smaller files. To optimize the number of parallel operations for a load, we recommend producing data files approx. 100-250 MB (or larger) in size (compressed). Loading very large files (e.g. 100 GB or larger) is not advisable. How to load data to Snowflake fast

Minimize Snowflake’s Processing Overheads for faster Bulk Loads

Reduce the processing overhead by aggregating smaller files. Split large files into several smaller files. This will help distribute the load between the servers of your active warehouse. The number of data files that can be processed in parallel depends on the number of servers in the warehouse and their capacities.  For example, if you need to load a 10GB file, create 100 x 100Mb files to leverage Snowflake’s automatic parallel execution and utilize the warehouse capacity to the utmost. It is preferable to split large files by line to prevent records that cover chunks. Note that the number of load operations running in parallel cannot be greater than the number of data files to be loaded. Build a Snowflake Data Lake or Data Warehouse

SQL Server Bulk Insert with BCP Utility

SQL bulk insert to MS SQL Server (all supported versions), Azure SQL Database, Azure SQL Managed Instance, Azure Synapse SQL and Parallel Data Warehouse

When you have to do a SQL Server bulk insert, SQL Server provides a number of approaches to import and export of large amounts of data in or out of SQL Server databases. Below are the most used and efficient methods to do so:

  • BCP Bulk Insert with BCP Utility
  • OPENROWSET (Bulk) Function
  • SQL Server Import/Export Wizard
  • BULK INSERT Statement
  • SELECT INTO Statement

In this article, we’ll go through the most efficient and high performant approach for Bulk SQL Insert.

Bulk SQL Insert with BCP utility

The BCP (Bulk Copy Program) utility is a default tool in SQL Server and can be used to do a Bulk SQL Insert between a SQL instance and data file via a special format file. BCP is a command line utility that can import a sizable number of rows into SQL Server or export SQL Server data to files. You should always create a format file to keep track of the data format. This is important because BCP files do not contain any schema or format information. If there are any glitches, you can refer to the format file to understand the error. The BCP bulk insert is considered the fastest of all bulk insert methods. It can migrate a fixed length file, or any delimited file using the BCP command.

SQL Server Bulk Insert with BCP: Points to consider

  • The way BCP input is logged is different from the usual INSERT statements. The facility used is the same as BULK INSERT and SqlBulkCopy classes.
  • The Bulk loading operation can instruct SQL Server to avoid traditional checks and balances on incoming data. It also logs less data than usual operations. All this combined, serves to make for faster bulk loading of data.
  • BCP is versatile and allows data loading from files to SQL tables or from SQL Queries to files with any delimiter options.
  • When importing data into a table, it is vital to use a format file created specifically for the table or to recognize the table structure and data types valid for the columns.
  • The BCP utility can make assumptions that can apply to every row of data, unlike a traditional SQL INSERT statement. It knows the internals well and is capable of mapping input data to those internals in a more direct manner. It can also avoid resource-intensive operations like logging, deferring indexes, parsing, transactions, optimization, isolation etc.

All in all, BCP is a simple but powerful tool for importing and exporting data between SQL Server tables and text files. It is easy to download and you can use it to export T-SQL queries to a file. BCP is very fast and used by BryteFlow for all extracts and loads to SQL Server sources and destinations, with all best practices in place. It’s one of the fastest solutions for all SQL server-based Extracts and Loads. Learn about SQL Server CDC

Bulk Insert to Azure SQL Synapse using PolyBase

PolyBase was a feature introduced in SQL Server 2016 but Microsoft enhanced it in 2019 allowing it to connect to more data sources than before. PolyBase can be used to connect to SQL Server instances and can be easily setup with a Wizard. It is a key feature in Azure Synapse and can be used to bulk insert data from flat files and relational and non-relational data sources such as other SQL Servers, Oracle, Teradata or MongoDB.

PolyBase provides efficient Bulk Inserts to Azure SQL

PolyBase works well for bulk inserts to Azure Synapse and you can bulk load data with high throughput. The large gain in throughput will be very apparent with PolyBase rather than the default BULKINSERT mechanism. PolyBase also enables import and export of data from Azure Data Lake Store and Azure Blob Storage to Azure Synapse Analytics.

If you’re using Azure Synapse Analytics as a source, you must choose a staging location for your PolyBase batch bulk load. PolyBase allows for batch loading in bulk instead of loading the data row-by-row. PolyBase drastically reduces the data loading time into Azure Synapse Analytics and parallelizes the process for large datasets.

Bulk Insert to Azure SQL with PolyBase: Points to Consider

If your data source is located in Azure Blob, Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2 and if the format is supported by PolyBase, you can use Copy activity to trigger PolyBase to enable Azure Synapse to extract the data from the source.
In case source data and format are not compatible with PolyBase you can use the Staged copy by using the PolyBase feature. The Staged copy feature automatically converts data into a PolyBase supported format, stores data in Azure Blob Storage and then loads data into Azure Synapse by calling PolyBase. The Staged copy feature also delivers higher throughput.

Bulk loading with PolyBase is an effective method to transfer data from Azure Storage to Azure Synapse but It involve going through some steps:

  • Creating Master Key for database
  • Creating Database Scoped Credential
  • Creating External Data Source
  • Creating External File Format
  • Creating External Table
  • Performing the load from External Table to Synapse Native Table.

PolyBase for bulk loading to Azure Synapse – file size, format and file types

  • PolyBase does best with Uncompressed Multiple Files than with Compressed Multiple Files and Uncompressed Single File and Compressed Single File.
  • For delimited uncompressed text files, PolyBase has the capability to do parallel processing at both individual file level as well as at the folder level, when there are multiple files in the folder. There is no real need to split the files but there may be a slight uptick in performance from splitting. It’s up to you to determine if the slight improvement in performance is worth the extra effort of splitting the files.
  • When splitting delimited uncompressed text files, the guidance is that as long as the file size is over 2 GB and count of file is not reaching high 100s( like 1000) it doesn’t really matter if the file size is 50 GB, 100 GB, etc.
  • If you have multiple files in a folder, create an external table at the folder level there is no need to do this at individual file level, PolyBase has parallelization automatically built-in.
  • Final step of PolyBase is to load from External Table to the Synapse Table, CTAS (CREATE TABLE AS SELECT) is observed to perform better than INSERT INTO.
  • PolyBase cannot parallelize processing of Compressed files like Gzip so if using a compressed file, it’s better to have multiple files for PolyBase to load.
  • PolyBase cannot parallelize processing of an individual Parquet file so having multiple Parquet files should deliver better performance.

Bulk Load to Azure SQL Synapse using COPY Command

Bulk loading to Azure Synapse using the COPY command offers a fast, flexible, and simple way for high throughput data ingestion for SQL workloads. The COPY command is the preferred loading utility as it allows for flexible and seamless loading of data with these advantages:

  • You can customize default values for target columns and specify data field sources to load into specified target columns
  • The COPY INTO command can do away with multiple steps in the data loading process and reduce the number of database objects required for the same.
  • With COPY INTO you do not require CONTROL access to the sink SQL DW like PolyBase but only need INSERT and ADMINISTER DATABASE BULK OPERATIONS permissions.
  • You can specify a custom row terminator for CSV files
  • Escape string, field, and row delimiters for CSV files
  • Use SQL Server Date formats for CSV files

COPY Command Syntax

COPY INTO [schema.]table_name
[(Column_list)] FROM '<external_location>' [,...n] WITH
(
[FILE_TYPE = {'CSV' | 'PARQUET' | 'ORC'} ] [,FILE_FORMAT = EXTERNAL FILE FORMAT OBJECT ] [,CREDENTIAL = (AZURE CREDENTIAL) ] [,ERRORFILE = '[http(s)://storageaccount/container]/errorfile_directory[/]]'
[,ERRORFILE_CREDENTIAL = (AZURE CREDENTIAL) ] [,MAXERRORS = max_errors ] [,COMPRESSION = { 'Gzip' | 'DefaultCodec'| 'Snappy'}] [,FIELDQUOTE = 'string_delimiter'] [,FIELDTERMINATOR =  'field_terminator'] [,ROWTERMINATOR = 'row_terminator'] [,FIRSTROW = first_row] [,DATEFORMAT = 'date_format'] [,ENCODING = {'UTF8'|'UTF16'}] [,IDENTITY_INSERT = {'ON' | 'OFF'}] )

Preferred file types:
 

1. Using Parquet files:

The preferred method of using the COPY INTO command for bulk loads of data would be to read Parquet (snappy compressed) files using snappyparquet as the defined File_Format.

Below is the COPY INTO SQL syntax for snappy parquet files:

COPY INTO [Tablename] FROM ' <file_path>'
WITH (
FILE_FORMAT = [snappyparquet],
CREDENTIAL = (IDENTITY='Managed Identity')
)

The snappy parquet file gets copied from ADLS Gen2 into an Azure Synapse table .

2. Using CSV files

Below is the COPY INTO SQL syntax for csv files for Azure Synapse.

COPY INTO [Table1] FROM '<filepath>'
WITH (
FILE_TYPE = 'CSV',
CREDENTIAL = (IDENTITY='Managed Identity'),
ENCODING = 'UTF8',
FIELDTERMINATOR = ','
)

  • Column Delimiter: Comma (,)
  • Row Delimiter: auto detect
  • Encoding: Default (UTF-8). This will need to be set for csv files. Alternatively, the Encoding can be specified in the COPY INTO command syntax.
  • Escape Characters: ” (NOTE that this setting will allow double quotes and commas in text fields)
  • Quote Characters: Double Quote (“) (NOTE that this setting will allow double quotes and commas in text fields)
  • NULL Value: @concat(”) (NOTE that this setting will allow NULL INT datatypes)

COPY command in Azure Synapse: Points to consider

  • When loading CSV, consider splitting your input into multiple files for best loading performance. This applies to gzip compressed files as well.
  • There are no restrictions on the number and size of files but in the interests of performance, it’s advisable to have files that are at least 4MB in size.
  • For Parquet and ORC files there is no need to split them because the COPY command will automatically split files. Parquet and ORC files in the Azure storage account should be at least 256MB or bigger for best performance.

Bulk Load with BryteFlow, Our Real-time, Automated Data Ingestion Tool

BryteFlow XL Ingest, data ingestion tool created especially for Bulk Loads

All the methods we discussed before, can help you bulk load data to cloud data warehouses but there is an even faster way to achieve bulk inserts of data with ease, speed and without coding – BryteFlow. Our data ingestion tool BryteFlow uses a two-pronged approach for ingesting data fast. In case of bulk loads, you can use BryteFlow XL Ingest for the initial full ingest of data. The bulk loading of data is done at super-fast speed with automated, parallel, multi-threaded loading, smart partitioning, and compression.

Incremental Loads handled with log-based CDC (Change Data Capture)

After the initial full ingest, incremental loads are handled by BryteFlow Ingest using log-based CDC (Change Data Capture) delivering extremely high throughput (1,000,000 rows in 30 secs). BryteFlow provides the highest throughput for Oracle (6x faster than GoldenGate) and a unique BryteFlow SAP Data Lake Builder tool that extracts data directly from SAP applications. BryteFlow data ingestion is real-time, automated, and your data is ready to use at destination. BryteFlow also provides automated data reconciliation so data is also validated for completeness. The best part? Everything happens through an easy point-and click-interface and there is NO coding for any process! See how BryteFlow Data Replication Software works