How do you tune your Amazon Athena query performance?
It is important to understand how Amazon Athena works, and the tweaks you can make now, so that you can derive the best performance and lower your costs. Let’s first understand about Athena and then dive into performance tuning.
Download our eBook: How to get siloed data to AWS Athena in just a few clicks.
What is Amazon Athena?
Need to query data on Amazon S3 directly? Amazon Athena is the interactive AWS service that makes it possible. You can query data on Amazon Simple Storage Service (Amazon S3) with Athena using standard SQL. Being a serverless service, you can use Athena without setting up or managing any infrastructure. And you pay only for the queries you run which makes it extremely cost-effective. You also can run queries in parallel, Athena simply scales up without a fuss and results are lightning-fast even with huge datasets. Create an S3 Data Lake in Minutes (turorial with vidoes)
Read about BryteFlow for AWS ETL
Things to know regarding user access on Amazon Athena
There are certain restrictions imposed by AWS on user access to Athena, which you should be aware of. 13 Tips for 3 Security Best Practices
There are service limits imposed by AWS – you need to refer to these before deciding if this works for you. Hence, if you have several users needing to do interactive queries or using dashboards on data on Amazon S3, this may not be the solution for you. Automating your Data Pipeline
You may need to access data on S3 via an API or via Redshift Spectrum or other means. Your AWS ETL options with AWS Glue are explained in our blog if you need to ingest or transform your data on S3. Create an S3 Data Lake
Please read our blog Face off: AWS Athena vs Redshift Spectrum – which service you should use and when.
Setting up Amazon Athena
Amazon Athena is easy to set up; it is a serverless service which can be accessed directly from the AWS Management Console with a few clicks. You simply point Athena to your data stored on Amazon S3 and you’re good to go. Build a Data Lakehouse on Amazon S3
How to tune your Amazon Athena query performance: 7 easy tips
Amazon Athena’s performance is strongly dependent on how data is organized in S3. You can improve the performance with these 7 tips:
Tip 1: Partition your data
By partitioning your data, you can divide tables based on column values like date, timestamps etc. Partitions create focus on the actual data you need and lower the data volume required to be scanned for each query. This makes query performance faster and reduces costs. To start, you need to load the partitions into the table before you start querying the data. You can:
- Use the ALTER TABLE statement for individual partitions.
- Use a single MSCK REPAIR TABLE statement to create all partitions. For this method your object key names must be in accordance with a specific pattern.
Did you know BryteFlow partitions data for you automatically as it loads to S3? (Use BryteFlow to create an S3 Data Lake in minutes)
Tip 2: Compression and splitting of files
Splitting a file means the Athena execution engine can use multiple readers to process it in parallel. A single, unsplit file means only one reader can be deployed and the others remain idle. It is recommended to use data in Apache Parquet or Apache ORC formats since they are splittable and also by default compress data. Make sure the file size on S3 is optimal – between 200 MB to 1GB.
Too small files (less than 128 MB) means the execution engine may take a lot of time opening S3 files, and accessing details like metadata, listing directories, reading file headers and compression dictionaries etc. not to mention setting up file transfers. If the file is overly large and cannot be split, the processing of the query is halted till one reader has finished reading the entire file.
Did you know BryteFlow compresses the partitioned data for you automatically as it loads to S3? (Use BryteFlow to create an S3 Data Lake in minutes)
Tip 3: Optimization of Columnar Formats
Use columnar formats like Apache ORC or Apache Parquet to store your files on S3 for access by Athena. Apache ORC and Apache Parquet store data in columnar formats and are splittable. Data storage is enhanced with features that employ compression column-wise, different encoding protocols, compression according to data type and predicate filtering. These measures like compression ratios and filtering out non-essential data blocks ensure faster query performance.
BryteFlow can be configured for ORC and Parquet formats easily
We have found that our customers have obtained significant performance benefits from using ORC format with snappy compression, which is supported natively by BryteFlow.
Tip 4: Create Table as Select (CTAS)
Athena allows you to create tables using the results of a SELECT query or CREATE TABLE AS SELECT (CTAS) statement. You can use CTAS statements to create new tables from existing tables on a subset of data, or a subset of columns.
Where CTAS can be of value is taking an existing unoptimized dataset, converting it to Apache ORC and partitioning it to better optimize for repeated queries.
Tip 5: Consider splitting the data set with BUCKETING
You can also specify BUCKETING with a CTAS (CREATE TABLE AS SELECT) statement. With bucketing, you can specify one or more columns containing rows that you want to group together and put those rows into multiple buckets. This allows you to query only the bucket that you need to read when the bucketed columns value is specified, which can dramatically reduce the number of rows of data to read.
Recommended practice is to select a column for bucketing that has high cardinality (it has a large number of unique values), and that is frequently used to filter the data read during query time.
Tip 6: Optimizing Athena performance by tweaking SQL Operations
Presto is the engine that powers Athena to perform queries. Understanding how Presto works is key to optimizing queries. Here are some tips to optimize operations:
“SELECT *” clause optimization
When you select all columns, the amount of data that needs to processed through the entire query execution pipeline increases substantially, hence slowing down the query performance. Especially, when you are querying tables that have large numbers of columns that are string-based and/or these tables are used to perform multiple joins or aggregations.
What you should do:
When running your queries, limit the final SELECT statement to only the columns that you need instead of selecting all columns. Trimming the number of columns reduces the amount of data that needs to be processed through the entire query execution pipeline.
“ORDER BY” clause optimization
By default, Presto sends all data rows to one worker for sorting. This requires a lot of memory that can cause query failure or cause it to crawl along.
What you should do:
Use a LIMIT clause while using ORDER BY. This enables distributing the sorting and limiting to different workers instead of depending on one worker to do all the sorting.
“JOIN clause” optimization
If you are joining 2 tables keeping the smaller table on the right side of the join and the larger table on the left side, Presto will assign the right side table to worker nodes and direct the left table to conduct the join.
What you should do:
Keep the smaller table on the right, and make it be the table that drives the query. This needs less memory, making the query run faster.
Exception: While joining multiple tables, you may encounter the option of a cross join. Presto doesn’t support joint reordering and conducts joins from left to right. Then you need to specify tables from largest to smallest, taking care that two tables are not specified together, as this could result in a cross join.
“GROUP BY” clause optimization
The GROUP BY function allocates to worker nodes rows based on columns that are kept with GROUP BY values in memory. When rows are processed, columns are searched in memory. In case GROUP BY columns are alike, you get jointly aggregated values.
What you should do:
When your query has a GROUP BY function, the columns should be arranged by cardinality, from highest to lowest. Instead of strings you can also use numbers within the GROUP BY clause and limit column numbers within the SELECT order.
“LIKE” clause optimization
Avoid using the LIKE clause too many times.
What you should do:
When filtering for multiple values on a string column, it is preferable to use regular expressions.
Tip 7: Using Approximate Functions
When dealing with huge datasets, a common practice is to is to take a column and define the count of distinct values for it using COUNT (DISTINCT column). When an exact number is not required, the approximate functions can save on memory usage by counting unique value hashes rather than complete strings. However there will be a standard error of 2.3% if you can live with it.
SELECT approx_distinct(subscriber_no) FROM call_history;
BryteFlow offers automated partitioning and compression for data ingested and prepared on S3, so fast, flawless data ingestion is a given. The intuitive point and click interface involves no coding so even business users can get up to speed fast with integrating their data and querying it to derive fast insights. Get a FREE Trial now
Should you choose Parquet, ORC or Avro as your data format? Read more.