What is ETL? ETL stands for Extract Transform and Load. Extract gets the data from databases or other sources, Transform – modifies the data to make it suitable for consumption and Load – Loads the data to the destination (in this case on AWS). This allows data from disparate sources, to be made available on a Data Lake or a Data Warehouse in the same format, so that it can be easily used for reporting and analytics. AWS Glue is one of the AWS ETL tools.
What is AWS Glue?
AWS Glue is an Extract Transform Load (ETL) service from AWS that helps customers prepare and load data for analytics. It is a completely managed AWS ETL tool and you can create and execute an AWS ETL job with a few clicks in the AWS Management Console. All you do is point AWS Glue to data stored on AWS and Glue will find your data and store the related metadata (table definition and schema) in the AWS Glue Data Catalog. Once catalogued in the Glue Data Catalog, your data can be immediately searched upon, queried, and accessible for ETL in AWS.
How AWS Glue works as an AWS ETL tool
- Serverless – Behind the scenes, AWS Glue can use a Python shell and Spark. When AWS Glue ETL jobs use Spark, a Spark cluster is automatically spun up as soon as a job is run. Instead of manually configuring and managing Spark clusters on EMR, Glue handles that seamlessly.
- Crawlers – The AWS Glue ETL process includes crawlers for discovering metadata. They scan through the S3 repository and create a Glue Data Catalog that can be indexed and queried. The Glue Data Catalog can act as a central repository for data about your data.
- Script Auto generation – AWS Glue can be used to auto-generate an ETL script. You can also write your own scripts in Python (PySpark) or Scala. Glue takes the input on where the data is stored. From there, Glue creates ETL scripts in Scala and Python for Apache Spark.
- Scheduler – AWS Glue ETL jobs can run on a schedule, on command, or upon a job event, and they accept cron commands.
- PAYG – you only pay for resources when AWS Glue is actively running.
AWS Glue Pricing
AWS Glue pricing involves an hourly rate, billed by the second, for crawlers (discovering data) and ETL jobs (processing and loading data). For the AWS Glue Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second. As with all AWS services, pricing differs slightly between regions.
AWS Glue Limitations
- Learning Curve – The learning curve for AWS Glue is steep. You have to ensure that your team has strong knowledge of Spark concepts especially PySpark, when it comes to optimization. So, though they may know Python, this may not be enough.
- Real-time Change Data Capture and transformation for ETL in AWS – to achieve this, AWS Glue needs to interface with AWS DMS, another AWS service to capture changed data from databases. The integration between the two needs coding and strong developer inputs. Deployment times are extended and require a significant number of resources. Compare AWS DMS with BryteFlow
- Testing can be a challenge – There is no environment to test the transformations. You are forced to deploy your transformation on parts of real data, making it time consuming
- Delay in starting jobs – AWS ETL Glue jobs have a cold start time of approximately 15 mins per job. It generally depends on the amount of resources you are requesting and the availability of the resources on the AWS side. This can significantly impact the pipeline and delivery of data
- Lack of control for tables – AWS Glue does not give any control over individual table jobs. ETL is applicable to the complete database.
Real-time ETL in AWS with AWS DMS and AWS Glue
When data needs to be transformed real-time from databases, the data needs to be captured with AWS DMS and then integrated with AWS Glue.
Limitations with this approach
- Performance – Usually to process the changed or delta data from DMS, a Kafka stream is introduced to keep the records in the right order for application on the destination. This not only introduces another integration and another point of failure, but also slows down the processing as a row by row approach needs to be undertaken and then integrated with AWS Glue. For destinations like Snowflake, Redshift and S3, this might be a severe bottleneck
- Significant coding– Once the order of the inserts, updates and deletes is available, this needs to be processed for several tables, data type mappings and general reliability and robustness. Glue also has a learning curve for job optimization and developers need to be upskilled. If AWS Glue is replaced with other ETL software, this also has similar issues with significant coding
- Control – It might appear that the user is in charge of the direction, but can lead to “re-inventing the wheel” approach.
- Maintenance overheads – not only do the teams need to manage AWS DMS and Glue, but they also need to manage their integrations and maintain them.
- Best practices – Each destination platform has its own best practices – that must be known and implemented (manually) in some cases.
An AWS ETL option for real-time data integration – BryteFlow
A far more viable AWS ETL option is BryteFlow. With the BryteFlow AWS ETL tool, the Change Data Capture and data transformation are linked, and integration is seamless, so that data is transformed as it is ingested.
BryteFlow Ingest (for data replication) and BryteFlow Blend (for data transformation) offer a no-code experience as compared with AWS DMS and AWS Glue, when you want to process your data real-time and load to Amazon S3, Redshift or Snowflake.
BryteFlow comes with several other must-have features for AWS ETL: BryteFlow XL Ingest for ingesting large volumes in minutes and BryteFlow TruData for automated data reconciliation.
Benefits with BryteFlow are:
- No code real-time replication to AWS – BryteFlow Ingest gets data replicated real-time to S3 (and Athena), Redshift and Snowflake. For S3 replication, it gets the data as an initial sync, does an “upsert” automatically and provides data that is ready to use on S3 or Amazon Athena and hence automatically to Glue Data Catalog. You can configure partitions and file formats like Parquet and compression like snappy etc. BryteFlow uses best practices for Redshift and drastically lowers compute for Snowflake, and provides the fastest replication.
- Real-time data transformation – BryteFlow Blend handles the real-time time data and transforms only the delta records, saving on time and compute resources. Existing development resources with SQL skills can be used to manage pipelines.
- High performance – when there are large volumes involved, a number of best practices need to be employed depending on the destination, e.g. partitions and file formats for S3 and Athena, right micro batching for Snowflake and Redshift. BryteFlow handles this with simple point and click configuration
- Reduce deployment times by 90% – An automated pipeline which employs best practices and allows configuration for different environments, speeds up deployment by 90%.
- BryteFlow is embedded in AWS – keeps customers firmly on the AWS path to receive more innovations as it is embedded in the AWS eco-system.
You can see now why BryteFlow is another AWS ETL option especially if you want to save on developer time, programming and compute costs. Experience seamless data integration on S3, Redshift and Snowflake in real-time with a 5 day free trial of BryteFlow and see just how easy AWS ETL can be.