Test-Driving S3 Tables

Listen to this article:

In November of 2024, AWS re:Invent introduced major advancements across AI/ML, serverless computing, databases and storage that will shape the future of cloud technology. One of the most significant announcements was Amazon S3 Tables, a new feature purpose-built for storing tabular data optimized for analytics workloads. S3 Tables provide seamless integration with services like Athena, Glue, EMR and Redshift, enabling faster analytical queries and more performant transactions.

Potential use cases include typical big data analytics such as storing and analyzing historical patient data and genomic data for medicine research (Healthcare and Life Sciences), IoT data analysis for quality control and data processing for predictive equipment maintenance (Manufacturing) and supply-chain analysis (Logistics and eCommerce).

S3 Tables offer several standout features:

Optimized storage: Purpose-built table buckets offer better analytical query performance and higher transactions-per-second (TPS) while maintaining S3’s durability and scalability.
Automated optimization: Automatically manages compaction, snapshot cleanup and orphan file deletion to enhance performance and reduce costs.
Apache Iceberg support: Enables schema and partition evolution, transactional consistency and time travel queries.

Loka’s engineers set out to benchmark S3 Tables to compare the performance, transaction capabilities and cost in three different setups: one using the new S3 Tables feature, one using native Apache Iceberg tables with medium sized files and the third using Iceberg tables with small files. Our goal was to gain insights which format works best for ingestion and querying.

The following comparison focuses on key areas including analytical query performance, transaction speed and associated costs. The results will highlight the strengths of S3 Tables and Iceberg Tables, making it easier to determine which option best suits different scenarios.

Benchmark Environment

To compare the performance of S3 Tables and Iceberg Tables, we set up an environment in AWS. This setup includes different AWS services such as S3 General Purpose Bucket for data storage, S3 table bucket, AWS Glue jobs for data ingestion and query execution and Amazon Athena for running analytical queries.

S3
The raw data generated by the TPC-H data generation tool was initially stored in an S3 General Purpose bucket. Two additional S3 General Purpose buckets were used to store the Iceberg Tables and Iceberg Tables with small files, respectively.

S3 Tables
Table buckets store tabular data and metadata as objects for use in analytics workloads. For the purpose of this benchmark we created a dedicated table bucket and enabled Integration with AWS analytics services feature.

Glue Job
We configured all AWS Glue jobs using the same worker type: G.2X (8 vCPUs, 32 GB RAM) with four workers in total.

All jobs were run using similar configuration and were used to:

Load small parquet files (~1mb) into native Iceberg Table
Load medium parquet files (between 40mb and 230mb, depending on table) into another native Iceberg Table
Load parquet files into S3 Tables table bucket (with compaction target file size of 512mb)

Those Glue jobs were also used for query execution, running queries on their respective table formats to compare performance across S3 tables and Iceberg tables.

Athena
We used Amazon Athena to execute queries and to compare the performance between S3 Tables and Iceberg Tables.

Data
The TPC-H dataset is a decision support benchmark that evaluates and compares the performance of S3 Tables and Iceberg Tables.

For this benchmark we generated 104GB of raw data using the TPC-H data generation tool. We stored the dataset in an S3 General Purpose bucket in Parquet format, with each table organized into a separate folder.

Benchmark Methodology
To evaluate the performance and cost-effectiveness of S3 Tables and Iceberg Tables, we created a structured benchmarking covering data ingestion, query execution and cost analysis.

To analyze analytical query performance, we selected a set of TPC-H queries, which are designed to evaluate the performance of decision support systems, especially in the context of large-scale data processing and analytic workloads. These queries were executed against all three table formats using AWS Glue jobs with identical configurations. We tracked and compared ingestion performance of the Glue job while capturing Spark UI logs to analyze specific query execution times. The queries used in our tests are referenced in Table 1 with their corresponding query IDs.

As an additional step, we used Amazon Athena to run queries directly against the data. This allowed us to observe how Athena interacts with the different table formats and analyze query performance.

As a final step, we performed a cost analysis by referencing AWS pricing documentation for both S3 Tables bucket and General Purpose buckets in the specific regions where our data was stored.

The final benchmark results are presented in the Results section.

Benchmark KPI Definitions

When comparing the efficiency of S3 Tables and Iceberg Tables in a General Purpose bucket, we focused on three key evaluation metrics: analytical query performance, transactions per second and cost effectiveness.

Analytical Query Performance

Analytical query performance measures how quickly and efficiently data can be retrieved and processed. We evaluated this by running a set of queries generated from the TPC-H benchmark dataset. These queries represent real-world analytical scenarios that we used to compare execution times. Faster query execution means better performance and responsiveness, which is important for efficient data analysis.

‍Transactions per Second

To measure transactions per second (TPS) using the same TPC-H dataset, we measured ingestion times for loading the data into S3 tables, Iceberg tables and Iceberg tables with small files. A higher TPS indicates faster handling of operations, which is important when handling large volumes of data or frequent updates.

Cost Effectiveness

Cost effectiveness plays a key role in deciding between data storage options. For Iceberg Tables, we calculated the storage cost based on the S3 Standard General Purpose bucket used to store the data. For S3 Tables, however, we considered additional maintenance costs such as storage, compaction and monitoring.

By evaluating these metrics, we analyzed the trade-offs between using S3 Tables and Iceberg Tables in a General Purpose bucket, comparing their differences in both cost and performance.

Results

Analytical Query Performance Comparison

Efficient query execution is crucial for optimizing data processing workflows, and the choice of table format plays a significant role in performance. In this analysis, we compare the execution times of queries across three different table formats: S3 Tables using Apache Iceberg format, Iceberg Tables in a General Purpose bucket and Iceberg tables with small files in a General Purpose bucket. The inclusion of Iceberg Tables with small files is important because in real-world scenarios poorly optimized data ingestion can lead to small file fragmentation, which may negatively impact query performance.

Execution times for S3 Tables were 4.6 times to 37.6 times faster than Iceberg Tables with small files, a significant difference that we expected. This gap shows how small files can slow down performance and reduce query efficiency.

On the other hand, when comparing S3 Tables to Iceberg Tables with medium sized files, the performance difference was smaller but still noticeable, with S3 Tables executing queries two times faster. While this query performance improvement is noticeable even for short-running queries (a few minutes in duration), the difference becomes much more significant for more complex queries. For longer-running queries, which involve extensive computation or large data scans, S3 Tables were up to two times faster, significantly reducing execution times and making them a more efficient and scalable choice.

The results presented in Diagram 1 compare query performance across S3 Tables, Iceberg Tables and Iceberg Tables with small files, including query execution times in minutes for each case, with the last two columns highlighting the performance difference between S3 Tables and both Iceberg Tables and Iceberg Tables with small files. The same results are visually represented in the bar charts in Diagram 2 and Figure 3, which show that S3 Tables outperform Iceberg Tables in query execution times.

Diagram 1. Query execution time comparison across different table formats using Glue Job (pySpark)

Diagram 2. Query execution time comparison graph showing Iceberg Tables with small files as the slowest, followed by Iceberg Tables and S3 Tables as the fastest

Diagram 3. Query execution time comparison between Iceberg Tables and S3 Tables, excluding Iceberg Tables with small files for clearer visualization

Additionally, we executed a subset of the queries using Athena to compare execution performance in a different query engine. The results in Athena differed from those observed when using Glue jobs. In most cases, queries executed on S3 Tables had longer execution times compared to those on Iceberg Tables. However, when compared to Iceberg Tables with small files, queries on S3 Tables still performed better, with shorter execution times. Diagram 4 provides a detailed comparison of query performance for S3 Tables, Iceberg Tables and Iceberg Tables with small files when executed in Athena.

It is also important to note that according to the AWS documentation, at the time of evaluating the S3 Tables query results in Athena, this integration feature was still in preview and subject to change. This could explain the performance results observed in Athena and future improvements and updates may impact performance, potentially leading to different outcomes over time.

Diagram 4. Query execution time comparison across the three different table formats using Athena

‍Transactions per Second

The diagram below highlights the ingestion performance between S3 Tables, Iceberg Tables and Iceberg Tables with small files. This is based on Glue job execution time which is measured in minutes.

The last two columns of the diagram compare the speed of S3 Tables and Iceberg Tables for ingesting data. Those columns are very important for understanding the performances of S3 Tables over Iceberg Tables.

Diagram 5. Ingestion time comparison across different table formats using Glue Job (pySpark)

The data from Table 3 clearly shows that S3 Tables consistently outperform both Iceberg Tables and Iceberg Tables with small files when it comes to transactions done in a second.

This speed difference is especially noticeable when comparing S3 Tables to Iceberg Tables with small files. In these cases, S3 Tables demonstrate speed improvements ranging from 2.4 to 6.5 times, with the supplier table showing the highest improvement at 6.5 faster.

When compared to Iceberg Tables, S3 Tables remain faster, achieving speed improvements between 1.7 and 3.1, with the customer table showing the highest improvement at 3.1 faster.

Below is a graph that visualizes the ingestion performance between S3 Tables, Iceberg Tables, and Iceberg Tables with small files.

‍

Diagram 6. Ingestion execution time comparison showing Iceberg Tables with small files performing the slowest, followed by Iceberg Tables and S3 Tables with the fastest ingestion time

‍Cost

‍To assess the cost-effectiveness of S3 Tables versus Iceberg Tables, we based our analysis on AWS pricing for storage in a specified region.

We compared the costs for two data volumes of 10TB and 500TB to understand how the storage cost scales as data increases. Results of this comparison are shown in Figure 3, which displays a line chart illustrating the cost differences between the two storage types.

For both the 10TB and 500TB cases, we found that data stored in S3 Table buckets were approximately 17.5% more expensive than data stored in a General Purpose bucket. This difference was consistent across both scenarios, showing a clear trend that S3 Tables result in a higher cost. It's important to note that for the calculation of S3 Tables, the costs of monitoring and compaction were also included in the final result.

Diagram 7. Cost comparison between data stored in S3 Tables bucket and S3 General Purpose bucket for data of various volume

The Verdict

S3 Tables consistently deliver better performance than Iceberg Tables in both TPS and analytical query performance.

One of the influencing factors is automatic table maintenance, which takes care of tasks such as data compaction and snapshot management in the background, reducing the need for manual effort and improving efficiency. However, unlike Iceberg Tables, S3 Tables do not provide direct access to underlying Parquet files, which limits the file-level control.

When it comes to ingestion, the speed of S3 Tables is significantly faster, achieving improvements of up to 3.1 over Iceberg Tables and as high as 6.5 when compared to Iceberg Tables with small files.

Beyond ingestion speed, S3 Tables also offer query performance up to two times faster than Iceberg tables and up to 40 times faster compared to Iceberg Tables with small files. Their ability to retrieve data more efficiently makes them well-suited for frequent querying and high-throughput workloads.

This performance advantage comes with a tradeoff. Storage in S3 Tables is approximately 17% more expensive than Iceberg Tables, making cost an important factor when dealing with large-scale data. In situations when query execution speed and ingestion time are top priorities, S3 Tables may be a better choice despite the higher storage cost.

On the other hand, if workloads involve less frequent queries and transactions and cost efficiency is a primary concern, S3 Tables may incur much higher costs compared to Iceberg Tables, especially when dealing with huge amounts of data. In the end, the decision depends on the specific needs of the workload and the balance between performance and cost.

Artificial Intelligence

March 20, 2025

Loka Staff

Henrique Silva

Head of Data at Loka

AND

Eva Spirovska

Data Engineer

AND

Bisera Stojmanovska

Data Engineer

AND

Piotr Bialka

Data Engineering Manager

AND

Test-Driving S3 Tables

Listen to this article:

Benchmark Environment

Benchmark KPI Definitions

Results

‍Transactions per Second

‍Cost

The Verdict

Henrique Silva

Eva Spirovska

Bisera Stojmanovska

Piotr Bialka

Executing RAPIDS from Your Computer (Without a Local GPU)

Loka Achieves the AWS Generative AI Competency

Loka’s RealTicket Honored by Awwwards

Loka's syndication policy

Test-Driving S3 Tables

Listen to this article:

Benchmark Environment

Benchmark KPI Definitions

Results

‍Transactions per Second

‍Cost

The Verdict

Henrique Silva

Eva Spirovska

Bisera Stojmanovska

Piotr Bialka

Other articles you might like

Executing RAPIDS from Your Computer (Without a Local GPU)

Loka Achieves the AWS Generative AI Competency

Loka’s RealTicket Honored by Awwwards

Loka's syndication policy