In this blog we will talk about how a plain vanilla S3 Data Lake can function as an efficient Data Lakehouse, with the capability to sync and update data continuously using Change Data Capture and enable analytics to be carried out in the Data Lake itself, without needing to load data to a data warehouse. A Data Lakehouse on S3 cuts down on effort and expenses and can be implemented with tools like Apache Hudi, Delta Lake and our very own Bryteflow. Databricks Lakehouse and Delta Lake (A Dynamic Duo!)
Quick Links
- About Data Lakes and Data Warehouses
- The Data Warehouse
- The Data Lake
- The Data Lakehouse: a combination of Data Lake and Data Warehouse
- Tools to convert the S3 Data Lake to an S3 Data Lakehouse
- The Delta Lake on AWS
- Apache Hudi on AWS
- BryteFlow on AWS S3
- Automated UPSERT on the S3 Data Lake: The Biggest Benefit
- S3 Data Lake into Data Lakehouse: Other BryteFlow Benefits
About Data Lakes and Data Warehouses
Data lakes and data warehouses have been around for a long time now. The story started with the data warehouse. Big corporations discovered their data was growing exponentially and could be harnessed profitably for decision support, business intelligence and analytics. The data however needed to be high quality and error free. Moreover, data integrity had to be maintained, it needed to be the right data – this was a challenge since instances of the same data could be found across different applications with varying values. Data Lake vs Data Warehouse
The Data Warehouse
A data warehouse allows multi-source application data to be placed separately in another location with specially designed infrastructure that allows the data to be used easily and securely for analytics and BI by all the business users in the organization. Data in the data warehouse is governed by analytics constructs like data lineage, data models, metadata, KPIs and summarization. The ETL (Extract Transform Load) process enables application data to be extracted, transformed into consumable formats, and loaded into the warehouse for corporate querying. Learn about AWS ETL
Data Warehouse Limitations
However, the data warehouse mostly handles structured data from transactional databases for business intelligence and analytics. It cannot handle the other varieties of data like text, images, audio, video, IoT etc. Machine Learning and AI are on the rise, have algorithms that need direct access to data, are non-SQL based and cannot be handled effectively in a data warehouse. It would also be prohibitively expensive to do this. Data pipelines and 6 reasons to automate them
The Data Lake
Data Lakes are repositories of all kinds of raw data ranging from structured data from relational databases to non-structured data like textual data, IoT data, images, audio, video etc. Data Lakes like Amazon S3 typically have inexpensive storage and can store huge volumes of data. For e.g., Machine Learning needs copious amounts of data for data models and this data is easily stored on a data lake.
How to Create an S3 Data Lake in Minutes
In a data lake, data cannot be analyzed easily since the storage layer contains files partitioned by time, rather than tables with primary and secondary keys and has no indexing. This implies that when a record needs to be updated or deleted, the query tool will have to scan all the records on the data lake – an extremely time-consuming and impractical task. Data in the data lake can be queried with other analytics and data management software to transform the data, analyze it, and gain insights. Build an SAP Data Lake – Extract SAP data at Application Level
Data Lake Limitations
Data lakes have limitations when it comes to analyzing structured data from transactional databases. This data needs to be frequently updated, sometimes in real-time using Change Data Capture to sync with changes in the operational databases and data lakes are difficult to update. Hence data required for business analytics is transformed and loaded to data warehouses to be consumed by business users. Change Data Capture Types and CDC Automation
The Data Lakehouse: a combination of Data Lake and Data Warehouse
The Data Lakehouse enables you to have the best of both worlds – combining the data analytics strengths of the data warehouse with the scalability and inexpensive, huge storage of the data lake. A Data Lakehouse has data structures and data management features akin to a data warehouse which sit on top of an inexpensive storage layer like that of a data lake. A single entity combining these features can reduce storage costs, avoid duplication of data, and present a single system to access data – much faster than having the data team access multiple systems to get to their data. Data Lakehouses have to-the-minute updated data which can be used by business users for BI and analytics and data science professionals for Machine Learning, AI, and data science purposes. ELT in Data Warehouse
Data Lakehouse Advantages:
- Data is saved in open, standardized file formats like Parquet and ORC which renders it easily accessible, instead of proprietary formats. This also avoids vendor lock-in.
- Support for different data types like audio, images, text, clickstream and IoT. Capability to store, cleanse, transform, access, and analyze data.
- Access to huge volumes of data and reading non-SQL data efficiently for machine learning projects using Python and R libraries
- Supports a range of tools, languages and engines for machine learning and business analytics.
- Supports a metadata layer for easy referencing and accessing of data Databricks Lakehouse and Delta Lake (A Dynamic Duo!)
- Access to data versioning and time-stamped data for Machine Learning and data science with the option to roll back or revert to earlier versions of data. Why Machine Learning Models need Schema-on-Read
- Data warehouse schema architectures like star / snowflake schema are supported, making for better governance and auditing. Query Performance is optimized with processes like data compression, data clustering, caching and data skipping and emergence of new query engines.
- Consistency of data is assured by using ACID transactions since multiple users can read or write data at the same time usually with SQL. ACID stands for Atomicity, Consistency, Reliability, and Durability, defining these as the must-have properties of the transaction. ACID transactions ensure the operation completes or does not take place at all – for e.g., a credit card transaction -it either happens or fails completely.
- Affordable storage is a given, considering the data lakehouse is built over a data lake storage object like Amazon S3 or Azure Blob.
- No need to transfer data to a data warehouse for analytics. You can analyze the data within the data lake itself. For e.g. using AWS Athena or Redshift Spectrum to query the data in the S3 data lake. How to make queries on Amazon Athena run 5x faster
Tools to convert the S3 Data Lake to an S3 Data Lakehouse
Here we will examine some tools like Delta Lake, Apache Hudi and BryteFlow that can enable data upserts and high-powered analytics on the data in Amazon S3 data lake, turning it effectively into a Data Lakehouse.
The Delta Lake on AWS
What is a Delta Lake?
Delta Lake is an open-source project launched by Databricks. A Delta Lake is the transactional layer applied on top of the data lake storage layer to get trustworthy data in cloud data lakes like Amazon S3 and ADLS Gen2. Delta Lake ensures consistent, reliable data with ACID transactions, built-in data versioning and control for concurrent reads and writes. It enables easy and reliable report reproduction. Databricks Lakehouse and Delta Lake (A Dynamic Duo!)
How does AWS Delta Lake work?
AWS Delta Lake uses Versioned Parquet files to store data in the data lake. It has a transaction log to keep a record of commits made to the tables and delivers ACID transactions. Think of AWS Delta Lake as a file system that stores batch and streaming data besides the Delta metadata to maintain table structure and enforce schema in the S3 data lake.
Moving data into the S3 data lake is carried out with the Delta ACID API and getting it out is by use of the Delta JDBC Connector. A big limitation of Delta Lake is that the data is Delta and cannot be queried directly with SQL query tools like AWS Athena, Redshift Spectrum, Presto and SparkSQL. Also, a good amount of coding with Apache Spark or PySpark may be required to set up the AWS Delta Lake and implement it in Amazon S3. Learn more about Delta Lakes on their website.
Apache Hudi on AWS
What is Apache Hudi?
Apache Hudi is an open-source data management framework that enables incremental data processing and developing of data pipelines. It helps organizations to create and maintain data lakes at scale. Apache Hudi provides stream-like processing to batch style data, enabling faster, updated data with an integrated serving layer with extremely low latency (in minutes) using basic techniques like upserts and incremental pulls. Hudi allows for ACID (Atomicity, Consistency, Isolation & Durability) transactions on data lakes. Apache Hudi can run on cloud storage like Amazon S3 or HDFS (Hadoop Distributed File System).
Apache Hudi in AWS -how does it work?
A data lake in Amazon S3 provides advantages in terms of scale, affordability, and reliability. Data in S3 can be processed and analyzed using Amazon EMR with open-source tools like Apache Spark, Hive, and Presto. However, incremental data processing, and record-level inserts, updates, and deletes can still prove a challenge on Amazon S3. That’s where Apache Hudi comes in.
Apache Hudi in AWS enables you to carry out record-level updates, inserts, and deletes to data in S3. With Apache Hudi on S3 you can consume real-time data streams, change data captures, reinstate data that comes in late, track history, maintain data privacy compliance and get rollbacks in a vendor-neutral format. Parquet and Avro are used by Hudi in S3 to store data, and Apache Hudi has multiple query engine integrations including Apache Spark, Hive, AWS Glue Data Catalog and Presto. This allows you to query near real-time data with familiar tools.
Hudi Table Types: Copy on Write (CoW) and Merge on Read (MoR)
Apache Hudi datasets consist of Copy on Write (CoW) and Merge on Read (MoR) table types. Copy on Write (CoW) has data stored in the columnar Parquet format. A new version of the base file is created by every update on a Write commit. A CoW table type is suitable for read-heavy workloads in data that does not change often.
The Merge on Read (MoR) table type uses a combination of Parquet (columnar) and Avro (row-based) formats. Updates are carried out on row-level delta files and compressed when required to create new versions of columnar files. MoR tables are suited for write-heavy workloads on data that has fewer reads. Hudi provides three logical views for data access: Read-optimized, Incremental and Real-time.
AWS Athena can be used to query Apache Hudi datasets in Read-optimized view – basic steps
- Raw data is stored in Amazon S3 data lake. Create an S3 Data Lake in Minutes
- Raw data is transformed to Apache Hudi CoW and MoR tables with Apache Spark on Amazon EMR.
- Tables can be queried and analyzed on Amazon Athena via a read-optimized view.
- When data is updated, Apache Hudi upserts the existing record in the dataset.
- The updated dataset can be queried and analyzed using Amazon Athena and updates are reflected in the results. Learn more about Apache Hudi on AWS in this AWS blog
Apache Hudi and AWS DMS – Upserts on Amazon S3
AWS DMS (AWS Data Migration Service) can deliver data from relational databases to Amazon S3 in a continuous stream using Change Data Capture. However, to apply and merge the changes (upsert) with the target tables, Apache Hudi is needed to provide data upserts in S3. The Hudi-EMR enabled upsert in Amazon S3 is not automated and a fair amount of coding is required to achieve this, in addition to expert developer resources and time. Compare AWS DMS with BryteFlow
Learn about AWS DMS Limitations for Oracle Replication
BryteFlow on AWS S3
How you can use your AWS S3 Data Lake as an S3 Data Lakehouse with BryteFlow
BryteFlow is a real-time, automated replication tool that ingests your data from relational databases such as SAP, Oracle, SQL Server, Postgres, and MySQL into Amazon S3 in real-time or as changed data in batches (as per configuration) using log-based CDC. It provides automated upserts in S3 and delivers ready for analytics data without tedious data prep. It is tech-neutral and is compatible with any third-party tool you may want to use.
Automated UPSERT on the S3 Data Lake: The biggest benefit according to our customers
What is an UPSERT?
An UPSERT combines two words- UPDATE and INSERT. UP is taken from UPDATE while SERT is taken from INSERT. It is an atomic, single step process. The Upsert inserts rows into a database table (if they don’t exist) or updates them (if they do).
Why Data Lakes aren’t good at UPSERTs
Data Lakes have an underlying base structure of files and are optimized for adding on of data rather than changing existing data. Updates, Inserts and Deletes on source create data changes that need to be merged with existing data in the S3 data lake as Upserts – delivering current, reliable data for analytics.
Typically, users need to read the entire or partitioned table subset and then overwrite it. Organizations try to handle the issue by manual coding and writing complex queries in SQL, Spark, etc.
BryteFlow’s Automated Upserts on the Amazon S3 Data Lake
BryteFlow provides automated Upserts in S3 so changes at source including inserts, updates and deletes are merged with the existing data – absolutely NO coding required. This is a huge benefit since you do not need to install third-party tools like Apache Hudi and can avoid all the manual coding and integration that you would otherwise need in case of Hudi Upserts or even with the Delta Lake’s MERGE commands. These automated S3 Upserts in AWS save you a lot of time, effort, and cost, and you get to access your data that much faster. BryteFlow for AWS ETL
S3 Data Lake into Data Lakehouse: Other BryteFlow Benefits
- BryteFlow provides automated partitioning in the S3 data lake. Any column can be used for partitioning, not just by date. This means your queries can run faster, assuring you of reliable analytics performance. In case of an error, the replication process can restart from the partition being processed rather than the very beginning.
- BryteFlow Ingest provides time-series data so it is possible to time travel and access the version you want for roll-backs or audits. This is a great feature for predictive analytics.
- BryteFlow replication software integrates seamlessly with Amazon Athena, Redshift Spectrum and AWS Glue Data Catalog for running queries. It offers easy configuration of file formats and compression e.g. Parquet-snappy. Learn about S3 Security Best Practices
- For complex querying you can prepare data automatically on S3 and load it to Amazon Redshift or Snowflake if required. This preserves the compute resources of the data warehouse and saves on storage costs since you load only the data you need for analytics to Redshift or Snowflake.
- BryteFlow leverages AWS services like Amazon Athena, AWS Glue, Amazon Redshift, Amazon SageMaker etc. and integrates them with your S3 data lake for seamless, optimal functioning.
- BryteFlow offers fast replication to Amazon S3, approx. 1,000,000 rows in 30 seconds (6x faster than GoldenGate). It has S3 data lake best practices baked in, to achieve high throughput, low latency and high security.
- Data in a BryteFlow generated Data Lake is open to all technologies and can be used for various use cases including Analytics, AI and Machine Learning.
- S3 Bulk loads are easy and fast with parallel, multi-thread loading and partitioning by BryteFlow XL Ingest.
- BryteFlow data replication software automates DDL (Data Definition Language) creation in the S3 data lake and creates tables automatically with best practices for performance – no tedious data prep needed.
- BryteFlow provides data conversions out of the box like Typecasting and GUID data type conversion to ensure your data is analytics ready. It enables configuration of custom business logic to collect data from multiple applications for AI and Machine Learning models.
- BryteFlow offers easy configuration of file formats and compression e.g., Parquet-snappy.
- You also get automated data reconciliation in your S3 data lake. BryteFlow TruData compares row counts and columns checksum to verify data completeness and provides alerts for missing or incomplete data.