The Sinking Data Warehouse: Is Apache Iceberg the Next Step?

Is it time to move out of traditional data warehouses?

August 28, 2023

The Sinking Data Warehouse: Is Apache Iceberg the Next Step?

Open source, high-performance Apache Iceberg table format has transformed data lake usage and data analytics for good, making traditional data warehouses less appealing, observes Jason Hughes of Dremio.

Amid ever-increasing volumes of data, it’s no secret that enterprises are struggling to get immediate value from that dataOpens a new window – while they simultaneously attempt to put systems in place that can respond to its future uses. What’s on the horizon can be tough to predict. Data platforms must meet this twofold need, and core technology is driving their evolution to do so. Open-source Apache IcebergOpens a new window , a high-performance format for analytic tables, is changing how businesses access data and put it to work, bringing fundamental flexibility to data analytics. 

Iceberg enables unimpeded data warehousing performance for the data lake, as traditional data warehouses have become more of an albatross than a lifeboat for businesses seeking cost-effective analytics. Having originated in Netflix engineering, enabling them to treat Amazon S3 as their data warehouse, Iceberg has been a production-ready open-source project used to drive data analytics at companies like Netflix, Adobe, Apple and many others for a long time. In addition to its proven production-readiness, its APIs have also been ensuring compatibility, but its 1.0 release late last year enshrined that compatibility as a guarantee and reinforced its status for production-grade data warehousing and data science use cases. Iceberg has grown at a tremendous rate, with 1,559 pull requests merged in the last 12 months, and the software’s development via the Apache Software Foundation is currently supported by Amazon, Snowflake, Google, Tabular, and Dremio, among others. 

A table format, like Iceberg, is a critical component of the new lakehouse architectures that enable analytical workloads running queries on vast volumes of data on cloud object stores like S3 and ADLS. Iceberg tables facilitate data manipulation language (DML) operations directly on these cloud object stores. They can be optimized in many different ways, such as with partitioning, sorting, and indexes, to enable efficient data organization and processing at a massive scale. At the same time, users get an easy experience because they don’t have to know the underlying details of a table to take advantage of the performance benefits.

Iceberg’s surge as the open table format standard behind lakehouses has shifted what it means to create and run modern data infrastructure. Ultimately, this new approach will sink data warehouses that require constant data movement and which generate multiple copies of data, locking companies into proprietary, often expensive solutions. If you have two solutions, both of which can support equivalent workloads, but one is closed, and one is open and less expensive from a time, resources, and licensing perspective, technology history has generally shown that the latter wins out. 

See More: A Leader’s Guide to Improving Data Visualization and Dashboard Design

A Storied But Outdated Analytics Paradigm

For decades, data warehouses have been instrumental for querying vast amounts of historical structured data from a variety of sources and for enabling analytical workloads to run quickly. They’ve offered effective data governance policies to ensure data availability, usability and security. They’ve offered the technological capabilities to enable best practices like slowly changing dimensions and master data management. But data in a warehouse is hostage to a vendor-specific system that only the warehouse’s compute engine can use. Storage and/or compute in these systems are expensive—usually one, if not both—and that cost results in a tough choice for organizations: run all the workloads the business needs at a high cost, or don’t run all the workloads the business needs at a lower cost. Data warehouses also prevent organizations from running machine learning workloads they need, and they cannot handle semi-structured and unstructured data workloads well, if at all, which are becoming expectations in the market.

Data lake technology adoption then spread because it has offered cheap, massive storage for all types of data and the ability to run many different kinds of data science workloads. For many organizations, data lakes were implemented via a Hadoop ecosystem where analytics were initially enabled with frameworks like MapReduce. But those MapReduce jobs had to be written in Java, a nonstarter for the masses of SQL advocates working with data. Facebook then built Apache Hive to convert SQL statements into MapReduce jobs and the Hive table format to allow analysts to refer to directories and the files of data as tables in their SQL. As cloud object storage like S3 and ADLS became preferred lakes, Hadoop clusters and MapReduce lost favor, and many distributed, multi-language engines that could tackle data analytics, data science and machine learning replaced that system. But the Hive table format remained as the de facto standard.

It takes multiple technology components to provide analytics, regardless of whether you’re using a data lake or a data warehouse—including storage, file format, table format, a catalog, and engines. The data lake had all of that, but the very simplistic Hive table format was causing all sorts of problems.

And the engineering effort to optimize data lake components for analytics is significant, especially when performance and time matter. Data engineers must painstakingly configure all of the tools properly and ensure performance and ACID guarantees for high-priority or mission-critical work. Mistakes can be extremely costly. 

Giving the Engines What They Need

Query engines like Apache Spark, Apache Impala, Presto, Trino and others support analytics workloads and execute queries directly on the data lake. Because of the limitations of the Hive table format, these engines and the engineers using them ran into complexity when trying to do operations that a data warehouse would, like updating data safely from the data lake and providing high performance.

Apache Iceberg has crushed the limitations that Hive tables imposed. As a vendor-agnostic, open-source table format, Iceberg has made it possible to use robust DML operations (insert, update, and delete) on its tables, provide structures that enable high performance, evolve a table’s schema and performance optimizations over time, and access historical data within any defined period, previously only available in data warehouses and other databases. Critical for compliance and governance, Iceberg’s time travel capabilities mean that companies can access and audit historical data. Iceberg has also opened up new functionality, including data versioning, transparent partitioning, and a way to ensure high performance on cloud object storage at any scale. This has empowered teams to run more workloads on the data lake with greater ease and flexibility.

Community-driven Lakehouse Offers a Safe, Modern Harbor for Data

Iceberg is the pivotal component of a lakehouse architecture that ties together data warehouse functionality and data lake flexibility. All components of the lakehouse serve those ends and must do so cost-effectively. Being committed to Apache Iceberg and Apache Parquet and their community-driven standards requires an open lakehouse that supports all of Iceberg’s SQL DML and DDL operations and streaming analytics. A diverse community of developers from different companies is a sign that the interests of any particular company will not dominate a project’s direction. A community essentially owned by one company can lead to conflicting incentives where that company can be incentivized to keep some valuable features in proprietary parts of their stack, rather than contributing them to the open-source project. 

Iceberg’s vibrant innovation uniquely presents the need for a lakehouse with self-service SQL analytics, a unified data view, and sub-second performance at a very low cost. It could be wise to use Iceberg internally, even for non-Iceberg datasets in the data lake, to take advantage of Iceberg benefits. With Iceberg, users can build tables of any size. They can start small and grow to even over 1 trillion rows and petabytes of data, all while maintaining interactive query performance.

Part of the point of openness is that organizations can use any processing engine and eliminate vendor lock‑in. The Iceberg capabilities enable new use cases with massive datasets, like personalized healthcare, fraud detection, customer 360 data management, and clean energy development, among others. Ultimately, Iceberg is important to the data community—and any lakehouse communities pursuing efficiency and innovation—because it allows users to safely and creatively query and analyze their data at any scale in a performant, inexpensive way. 

If you’re looking to learn Iceberg from soup to nuts, three of my colleagues and I are writing Apache Iceberg: The Definitive GuideOpens a new window from O’Reilly, with the aim to continue the conversation and to educate the community on everything Iceberg.

New, profound technology is cementing data’s decision-driving role across industries and markets. The humble table format has come a long way thanks to the concerted efforts of hundreds of open-source contributors in the vibrant community around Apache Iceberg. That development is fundamentally changing modern data infrastructure, as we move away from decades-old data warehouse models and limited data lakes to the open possibilities of lakehouses. 

What’s your take on Apache Iceberg? Share with us on FacebookOpens a new window , XOpens a new window , and LinkedInOpens a new window . We’d love to hear from you!

Image Source: Shutterstock

MORE ON DATA WAREHOUSE

Jason Hughes
Jason Hughes

Director of Technical Advocacy, Dremio

Jason Hughes is the Director of Technical Advocacy at Dremio. Previously at Dremio, he's been a Product Director, Technical Director and a Senior Solutions Architect. He's been working in technology and data for over a decade, including roles as tech lead for the field at Dremio, the pre-sales and post-sales lead for Presto and QueryGrid for the Americas at Teradata, and leading the development, deployment, and management of a custom CRM system for multiple auto dealerships. He is passionate about making customers and individuals successful and self-sufficient. He lives in San Diego, California.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.