Amazon Athena and Redshift Spectrum are both AWS services that can run queries on Amazon S3 data. Both are part of the AWS environment so it is quite natural to be a bit confused about which one you should use. Before we go into details, here is a quick rundown about both of them.
AWS Athena is a fast, serverless, interactive query service. It is simple to use since you can analyze data using standard SQL. You do not need to load data into Athena, you can use Athena to query your data on S3. There is no infrastructure to manage and you only pay for the queries you run – pretty cost-effective if you ask us. Athena supports various S3 file formats including CSV, JSON, PARQUET, ORC and AVRO and allows for partitioning of data which is key to smooth large volume data queries.
Athena is easy to use but requires some amount of data preparation
To use Athena, simply point to your data on S3, define schema and start querying using standard SQL. Query results are delivered extremely fast and you can avoid the nuisance of complex ETL jobs to make your data ready for analytics. Anyone with decent SQL skills can analyze big datasets. When you are ingesting sources near real-time to S3 from structured databases using Change Data Capture methodology, you do need to merge the real-time changes with the previous data sync – so there is some data preparation involved.
It gets data from the sources – real-time using log Change Data Capture, merges the deltas and prepares data that is ready to use. The initial sync and delta sync is merged continuously and automatically on S3 using a stateless EMR and file type and compression on S3 can be configured in the software. This means data can be automatically viewed in Athena with no extra effort. It also interfaces automatically with Glue Data Catalog and AWS Lake Formation. Automated data preparation means faster querying and insights. Build an S3 Data Lake in Minutes
Things to know regarding user access on Amazon Athena
There are certain restrictions imposed by AWS on user access to Athena, which you should be aware of. There are service limits imposed by AWS – you need to refer to these before deciding if this works for you. Hence, if you have several users needing to do interactive queries or using dashboards on data on Amazon S3, this may not be the solution for you. You may need to access data on S3 via an API or via Redshift Spectrum or other means. Data Integration on Amazon S3
Amazon Redshift Spectrum
Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to query data in S3 without needing to load the data into your Redshift data warehouse. The Redshift SQL Query Editor can be used to query exabytes of data in S3 as well as on Redshift cluster tables. The analytical power of Redshift can be expanded to go beyond the local disks of the warehouse to query huge volumes of unstructured data on the Amazon S3 data lake without spending time on laborious loading or data transformation.
When BryteFlow creates a data lake on S3, Redshift or both in tandem, the initial sync and delta sync is merged automatically on S3 using a stateless EMR and file type and compression on S3 can be configured in the software. Automation means faster querying and faster insights.
Let us consider AWS Athena vs Redshift Spectrum on the basis of different aspects:
Provisioning of resources
Both Athena and Redshift Spectrum are serverless. Athena is dependent on the combined resources AWS provides to compute query results while resources at the disposal of Redshift Spectrum depend on your Redshift cluster size. With Redshift Spectrum you have the freedom to increase the Redshift cluster size to boost compute power (in case of complex queries involving large datasets or needing faster response) which can be expensive. With Athena you use whatever resources AWS allocates to your query and this can vary quite a bit in peak usage times.
Configuration of tables
Redshift Spectrum and Athena both query data on S3 using virtual tables. The Glue Data Catalog is used for schema management. Athena works directly with the table metadata stored on the Glue Data Catalog while in the case of Redshift Spectrum you need to configure external tables as per each schema of the Glue Data Catalog. AWS ETL options: AWS Glue explained
Athena is agile- for basic table scans, small aggregations and adhoc queries it is the clear winner, while Redshift Spectrum performs better when it comes to complex joins and bigger aggregations.
Athena enables you to query data in its original format on S3 i.e. structured and unstructured data. However, it is preferable you store data in a partitioned, compressed or columnar format like Parquet or ORC for best results, since different storage formats can have varying results upon querying. Athena supports file formats like Parquet, ORC, AVRO and JSON.
Please read our earlier blog on How to choose between Parquet, ORC or Avro formats to decide between the file formats.
Athena’s feature called the Glue Classifier allows Athena to support custom file types. Athena has a tight integration with Glue Data Catalog and Athena table DDLs can be auto-generated using Glue crawlers too. Tip: BryteFlow does this for you automatically when it ingests the data to S3. In the case of Redshift, the Redshift data warehouse supports structured data only at the node level, though Redshift Spectrum tables also support other storage formats like Parquet, ORC, AVRO, TEXTFILE, SEQUENCEFILE, RCFILE, RegexSerDe, Grok, CSV, Ion, and JSON. Tip: BryteFlow creates Redshift Spectrum external tables automatically when it ingests the data to S3 and or Redshift.
Redshift Spectrum may provide a more stable querying experience than Athena. This is because Redshift Spectrum compute engine distributes queries across thousands of nodes, leading to a fast performance that is not derailed even by lots of parallel queries and high concurrency. Redshift Spectrum is a good choice for running complex queries on huge data while Athena’s forte is running adhoc queries fast.
Our blog How to get your Amazon Athena queries to run 5x faster explains how to tune your Athena queries.
Tip: BryteFlow Ingest takes the effort out of partitioning data since it compresses and partitions data for you automatically as it loads to S3 – leading to even faster queries.
AWS Athena and Amazon Redshift Spectrum are similar in the sense that they are both serverless and can be used to run queries on S3 using SQL. Spectrum is a feature of Redshift whereas Athena is a standalone service. Results of queries run on Athena can be stored on S3 and loaded to Redshift if needed. Spectrum can directly join tables stored on Redshift.
Athena and Redshift integrate data in different ways. Athena has connectors built in to load data from sources external to S3 and allow you to query data in Athena without needing to load data to S3 before. For a Redshift query, Redshift Federated Query enables you to query databases and data lakes and run the same query on data stored on S3 or Redshift. It can also query live data in Amazon RDS or Aurora.
You can also ingest data into Redshift using Federated Query. While querying operational databases, you can perform transformations through Federated Query and load Redshift tables directly with the data.
Tip:You may find it easier and more intuitive to load data to Redshift with an automated data ingestion tool like Bryteflow. Just a couple of clicks to set up and no coding needed at all.
Data security is handled by Redshift Spectrum through server side encryption (SSE-S3) and uses an S3 managed AES-256 encryption key. Server-side encryption is managed with keys from the AWS Key Management Service (SSE-KMS). The Redshift cluster is launched within a VPC (Virtual Private Cloud) for further security. There is no support for S3 client-side encryption. User permissions cannot be controlled for an external table with Redshift Spectrum but permissions can be granted or revoked for external schema. Read more about data security on S3
For AWS Athena data access security can be controlled in 3 ways:
- With AWS IAM (Identity and Access Management) policies
- Access Control Lists
- Amazon S3 bucket policies
IAM offers users fine-grained control over access to your S3 bucket. Security can be tweaked to enable different users to view different sets of data and to provide access to data of other users. You can also restrict users from querying data with Athena by controlling access to S3. Athena lets you easily query encrypted data on S3 and store the encrypted results in your S3 bucket. Both-client-side and server-side encryption are supported.
Other points of difference between Athena and Redshift Spectrum
SQL Syntax: Athena is derived from Presto while Redshift uses Postgres as a foundation. Presto allows querying of data where it resides and is an open source distributed SQL query engine that can run analytic queries against data sources of varying sizes.
Redshift Spectrum uses PostgreSQL also known as Postgres. This is a free and open-source relational database management system (RDBMS) that emphasizes extensibility and is SQL compliant. It can handle a range of workloads, from single machines to data warehouses or Web services with many concurrent users.
Data Types: AWS Athena offers support for complex data types like STRUCT, ARRAY, and MAP while Redshift Spectrum does not support these.
Connectivity: You can connect to Athena with standard API, JDBC or ODBC drivers, for Redshift however, many products offer out-of-the-box connections.
Scalability: With Athena, storage and compute are separated. You could store any amount of data on S3 and query only what you want on Athena with high performance – a practice which is highly economical. AWS allocates compute resources so Athena auto-scales according to the complexity of your query. For Redshift Spectrum to increase compute power, you will either need to add more nodes or spin up new clusters which is more expensive. Spectrum also cannot be used without a Redshift instance in place.
User defined functions: With Amazon Redshift you get support for User Defined Functions (UDFs). These are processes that define parameters, perform actions and return the result as a value. Athena does not support UDFs.
Setup time: AWS Athena allows you to be faster out of the gate since it is quicker and simpler to set up. You can point it to your data on S3 and begin querying immediately. With Redshift Spectrum, the process is longer since you need to prepare a cluster of computing resources and create tables to load your data into.
Additional requirements: A SQL client and a cluster are needed for Redshift Spectrum to run on which are provided by Redshift. Athena requires no additional setup of infrastructure. It creates external tables on Amazon S3 datasets.
Management: With Athena you do not need to actively manage the process since it is fully managed by AWS. It allocates resources and automatically scales them according to requirement. With Redshift Spectrum, as the name suggests, it works in tandem with a Redshift cluster – so this needs to be provisioned.
Storing results: Athena is read-only and does not change data on S3 but results of queries can be written to S3. Redshift Spectrum is also read-only and cannot perform operations like insert, delete or update on external tables. Users can view tables on Redshift as well as less frequently accessed tables on S3 created by Redshift Spectrum, providing a unified view of the data.
Location: Athena allows you to access data from a different region than from where the query was initiated. For Redshift Spectrum to access the data, the Amazon Redshift cluster and the S3 bucket need to be in the same AWS region.
So which one should you choose – AWS Athena or Redshift Spectrum?
A big factor that can influence your decision is where your data is stored. If you are already using Redshift, using Spectrum to query data on S3 might be a no-brainer – especially if you are running queries against very large datasets that include complex joins and bigger aggregations. Redshift Spectrum may also provide a more consistent querying performance since it does not depend on pooled resources provided by AWS. Of course, you might end up spending a bit more if you need additional nodes or have to spin up another cluster to increase compute power. On the other hand, if the majority of your data is on Amazon S3, and enabling queries for a small number of users is the objective, using Athena would be more intuitive, faster and cost-effective. We have customers that use Athena with their visualization software for real-time dashboards on the real-time data that BryteFlow syncs to S3. This is even more true if you are running adhoc queries that involve basic table scans and small aggregations.
Get a supported free trial of BryteFlow
Whichever service you decide to use, you can take the hard work (read manual coding) out of querying by going with a completely automated data ingestion and transformation tool like Bryteflow. Data scientists typically use upto 80% of their time preparing data for querying. BryteFlow allows you to free up all that time by providing automatic data ingestion, preparation and transformation – no coding required. BryteFlow Ingest and XL Ingest are our data ingestion tools while BryteFlow Blend takes care of data transformation.