Why you need a data warehouse
In today’s Big Data world, organizations have data coming from diverse sources. In order to integrate this data for analytical purposes or ML, companies build a data warehouse to optimize and benefit from the data they have. The data warehouse provides them a platform that can be used for business intelligence, data analysis, reporting and many other use cases. Learn how to load large volumes of data to Snowflake fast
The Snowflake Data Warehouse is an analytical cloud data warehouse, and is faster, easier to use, and more flexible than most data warehouses (on-premises or Cloud). However it can prove expensive if you don’t follow some simple rules.
What is the Snowflake Data Warehouse?
The Snowflake Data Warehouse is a data warehouse based on the Software-as-a-Service (SaaS) concept. Snowflake offers immense scalability, is extremely easy to use and offers more flexibility than traditional data warehouses. Every user and workload has access to instantly and infinitely scale compute resources with a platform that does not need management. Build a Snowflake Data Lake or Data Warehouse
Snowflake Data Warehouse Benefits
Snowflake makes it easy for organizations to incorporate data into business by integrating business applications and sharing live data with business partners and customers. Organizations can integrate and analyse previously siloed data to gain valuable insights while adhering to policies of compliance and governance. What makes Snowflake unique is that there is no infrastructure to be installed, configured, managed or maintained, no software or hardware to be installed. It is all handled by Snowflake that runs solely on cloud infrastructure. It uses virtual compute instances for compute requirements and a storage service to store data.
Snowflake offers various payment plans for their cloud data platform. ‘Snowflake on DemandTM’ is a usage based, per second Snowflake pricing plan that is fast and easy. Pre-purchased Snowflake capacity plans are also available. Compute usage is billed to users on a per second basis, minimum being 60 seconds. View Snowflake Pricing Details
Here’s how you can save on Snowflake costs with some simple tips:
- Understand the pricing model
Snowflake separates data warehousing into three distinct functions: Compute Resources, Data Storage, and Cloud Services. The costs associated with using Snowflake are based on your usage of each of these functions.
It’s always good to track down the usage which can be done by:
- Analysing usage data to see which workflows are the most expensive.
- Using Account Usage Views for tracking history, performance and cost
- Creating roles by business functions to track spending of various departments
- Using Resource Monitors to cut off DWs when you hit predefined credit amount limits
- Size efficiently to reduce Snowflake costs
Scale, create, and optimize your cloud compute resources. E.g. scale up or out appropriately. Turn on or off the warehouse for certain workloads or when not in use. Use warehouse monitoring to size and limit the cost per workload.
- When you have a queuing up of workloads, then add more compute clusters
- When the workload is slow without queuing, then increase warehouse size
- Query control parameters can help to control Snowflake compute costs
Query processing and concurrency can be controlled by setting certain parameters:
- Statement queued timeout in seconds
- Statement timeout in seconds
- Data Loading needs to be optimal
- Sort on ingestion: Data is automatically partitioned in Snowflake on an ingestion order. Sorting an S3 bucket before bulk load via COPY could be a lot faster than inserting with an ORDER BY
- CSV (Gzipped) is the best format for loading to Snowflake (2 to 3 times faster than Parquet or ORC)
- Use COPY INTO instead of INSERT because it uses the more efficient bulk loading processes.
- Efficient script writing lowers Snowflake compute costs
Poorly written code can cause severe bottlenecks in any database or warehouse and drive up costs. It’s worth your effort to write efficient SQL code rather than adding more and more compute.
- Get rid of temporary and transient tables when done
- Don’t use “CREATE TABLE AS”. Instead, use “CREATE OR REPLACE”
- Use COPY INTO not INSERT INTO
- Use staging tables to manage transformation of imported data
- Use ANSI Joins because they offer more efficient optimization
- Don’t use ORDER BY if you can. Sorting can prove expensive.
- Don’t handle duplicate data using DISTINCT or GROUP BY
- Storage practices for better ROI on Snowflake
Storage plays an important role with Snowflake costs. Optimize storage to keep costs low
- Opt for S3 buckets in the same geographic region
- Your bucket specifications be in line with how files are coming across (e.g. by date or application)
- Files should be kept between 60–100 MB to benefit from parallel loading
- Avoid using materialized views except in specific use cases (e.g. pre-aggregating)
Save big on Snowflake costs by loading data right
The most important cost savings by far, lie in how you load your data and how your queries are written. If you are replicating your data constantly with incremental loads, this becomes a very important factor to saving costs.
BryteFlow enables the lowest compute costs on Snowflake
Our customers tell us that BryteFlow uses the lowest compute when replicating data to Snowflake. This is because it uses all the best practices implicitly without any coding and hence is a massive return on investment. Not only does it get the data to Snowflake really fast, reducing development and implementation time to deliver quick time to value, but also reduces Snowflake compute costs on average by 30%.
The BryteFlow software has two modes of getting ready-to-use data into your Snowflake database:
1) Directly to the Snowflake data warehouse: In this method the source data is ready-to-use with Type2 SCD history or without history for Snowflake hosted on AWS, Azure or GCP. This uses the internal stage on Snowflake for optimization, and saves on setting additional security measures on the cloud platform.
2) Indirectly: Prepares and transforms data on Amazon S3 and then pushes it to your Snowflake database for data analytics. This can include real-time data transformations, after which the data is loaded directly to the Snowflake Data Warehouse. With this method you have the option of data storage on S3 (very economical) and pushing only the data you need for analytics to Snowflake, so in effect you pay only for the compute, reducing your overall Snowflake costs.
BryteFlow follows all recommended best practices when uploading to Snowflake out of the box.
- Provides support for CSV (Gzipped) file format
- Controls the number of files to be prepared for loading when using COPY command
- Multi-thread parallel loading capability to speed up the loading process
- With its Auto-recovery and Resume feature, you can stop-start your warehouses on-demand. No need to keep it up and running all the time. Turn it off when not in use and that alone can save you a ton of money.
Other important features of BryteFlow for data replication tool include:
- Transaction log-based Change Data Capture with high performance and zero impact on source besides other CDC options
- Real-time automated data replication and transformation with real-time access
- Very high throughput – 6x faster than Oracle Goldengate for Oracle sources
- Automated file merges on your Snowflake database with automated SCD Type2 history
- Automated optimisation for the Snowflake Data Warehouse
- Option for remote Oracle log mining
- Full extract and CDC – high performance and support for large volumes
- Automated reconciliation out-of-the-box for your Snowflake data to source (at column level)
- Automated, light weight Change Data Capture for Oracle, SAP, SQL Server, PostgreSQL and MYSQL
- Delivers ready-to-use data or can be used further in the pipeline for real-time data preparation
- Referential integrity, Metadata and Data lineage
- Cost control mechanisms to lower costs and high availability out-of -the-box
- Enterprise grade security using KMS, SSE, masking and tokenization for sensitive data
- Recover automatically from network drop outs and source dropouts with constant retry mechanism
- Customisable alerting and monitoring- Integration to CloudWatch Logs, Metrics and SNS
- Swap instances whenever required with configuration
- Automated dashboard displaying data latency, duration etc.
These are some simple options to reduce data costs on Snowflake. Or you can check out BryteFlow for data integration on Snowflake. BryteFlow automates the best practices you need to keep Snowflake costs well in check, so cost savings are also automated! Get a Free Trial of BryteFlow