Apache Iceberg is a Terrible Database

Z
Zhou Sun
July 21, 2025
Musings
Apache Iceberg is a Terrible Database

There are two prominent tales in data.

1.Iceberg is the solution to all pains.

Up until 2024, this was the story told by BIG BIG DATA: Spark, Trino, Snowflake, Flink.

And more recently, a story for you and me:
Just last week, Supabase launched “Analytics Buckets” with built-in Iceberg support, and TigerData (formerly Timescale) unveiled Tiger Lake(https://www.tigerdata.com/blog/tiger-lake-a-new-architecture-for-real-time-analytical-systems-and-agents) to bridge PostgreSQL and the lakehouse.

Vendors are now doing the heavy lifting to integrate Iceberg into application data platforms. See, the core of Iceberg is undisputedly really, really cool:
1. Keep all data in object storage in open file formats like Parquet.
2. Fool any execution engine into thinking these files are a table and query them with SQL directly.

And with clever design you get:
1. Interoperability across your execution engines. You don’t need to load data into specific systems format to get great performance.
2. ACID transactions + schema evolution.
3. True separation of storage and compute.
4. Time travel + versioning.

2. Iceberg is cool if you're willing to build your own Snowflake (BYOS).

Iceberg was built by big data teams at Netflix, Apple et all. It's well worth the hundres of engineers to build their their own Snowflake or Databricks on raw cloud storage.

It was designed for very large + slow-changing datasets. Think huge fact tables with heavy batch writes, not high-concurrency row updates.

This is why people who start on Iceberg and expect it to JUST work, like Postgres or ClickHouse, face:

1. Metadata overhead:Iceberg tracks snapshots, manifests, manifest lists, and table-metadata JSONs. A one-row update triggers four S3 writes (each hundreds of ms). A regular RDMS on SSD handles this <1 ms.
The result of all these metadata updates is commonly known as the “small file explosion”. And it’s up to the user to manage the maintenance of these files. This is WILD concept for users coming from the RDMS world where all of this is on ‘auto-pilot’.

2. Limited Write concurrency: Iceberg uses optimistic concurrency. Each writer tries to commit new metadata, assuming no conflict. If two collide, one fails and retries. This guarantees serializability, but at the cost of write throughput.
To contrast: Iceberg handles few commits per minute on the same table without collisions. RDMS can handle thousands of transactions per second: each with full isolation, indexing, and durability.

3. Read / Write Performance Tradeoffs: From the Snowflake Iceberg docs: “ performance primarily depends on how efficiently the Parquet files are written”. If another engine writes sub-optimally, Snowflake’s query speed suffers.
Investing in Iceberg requires investing in partitioning, sorting, and compaction jobs to get good performance. The classic is the “OPTIMIZE” operations (e.g., via Spark or Trino) to merge small files, at additional compute cost. This extra maintenance tax is well worth it at Netflix scale. But fir you and me – we’re just used to a cloud DW that just works.

4. Batch + stale data: No real indexes, no streaming. Iceberg focuses on table storage mechanics (files, metadata, transactions) but leaves many higher-level features to other layers. "Real-time" workloads require a separate DBMS.

Soo, who’s Fixing Iceberg, and How?

AWS / R2: “We’ll host your files. The rest is your problem.”
They give you CDC connectors, a basic catalog, and occasional compaction if you pay for it. That’s it. File explosion? Your bill. You want fast queries? Run your own optimization jobs.

Snowflake: “It’s Iceberg-compatible — as long as we own it.”
You can read external Iceberg tables. But if you want write access, performance, or optimizations? Convert it to a Snowflake-managed Iceberg table — which is basically Snowflake with an Iceberg API skin.
For example, Snowflake writes 16MB parquet files, which works perfectly with their storage nodes and disk cache. Imagine-the performance without them? (the recommended size for other engines are usually 256MB to 1GB).

DuckLake (DuckDB devs): “The spec is wrong. Let’s rethink metadata.”
Instead of dozens of JSON/Avro files in S3, store metadata in an actual embedded database (DuckDB, SQLite). Way faster planning, better concurrency. Essentially it is challenging the iceberg design, engine can just read from object-store.

Hudi / Paimon: “The whole iceberg format is too broken. We built a new one.”
Event-stream driven companies like Uber/ Alibaba have different use cases, so they introduced these new formats. They bake compaction, indexing, and streaming ingest into the format. Hudi has table services, Paimon uses LSM trees.

Food for thought : Iceberg with normal database ergonomics.

To the outsider, database and big data engineers might be ‘same-same’; but we look at the same spec very differently.

Big data people, the Iceberg OGs, see ‘table’ and think JSON metadata files, compaction pipelines, eventually consistent catalog state, maybe Spark will clean it up later.

As a database kernel person myself, I immediately think indexing, concurrency, transaction isolation, hot writes, query optimization. I keep wondering....
What if we just 'copied' the obvious stuff from real databases and layered it around Iceberg?
What if we stopped treating it like a format to worship, and started treating it like a backend storage layer for a proper RDBMS?

Let’s set some bare-minimum expectations, things even SQLite does without breaking a sweat:

1. Concurrent Writes + Streaming Ingest That Works Stop the “single writer” nonsense. No more retry storms. No more fear of write conflicts on concurrent jobs.
Let multiple writers append, update, and even delete data — without overwriting each other or bloating S3 with metadata garbage.

2. Automatic Optimization, That Never Conflicts with Writes Background compaction should just happen. And it should never require a lock or cause a conflict with a user query or write.
Optimizing a table shouldn’t be a “maintenance window” — it should be invisible, online, constant.

3. Built-in Indexing and Caching, Not an Afterthought Not everything needs to be a billion-row scan.
Let me define an index — zonemap, Bloom, bitmap, anything — and store it with the table, queryable by any engine.

The Crazy Part? It’s Actually Not That Hard.

This isn’t impossible. It’s not even particularly original. We’ve been doing this for decades in RDBMS land.
We can have a simple WAL to mark committed logs.
We can have a primary key index to apply updates and deletes instead of full scans.
We can save the ‘compaction’ operation state, and remap the deletion that happened in between.
We can also just index the parquet files, write them as puffin blobs.
You keep the open format, the decoupled compute, the time travel, the engine interoperability.
But you get actual database ergonomics. You get performance without handholding. You get operations that make sense.

Iceberg doesn’t need to be replaced. It just needs help thinking like a database.

That's 🥮. Starting with Moonlink: sub-second Postgres to Iceberg Sync.