Query one Apache Iceberg table from Trino, Spark, and DuckDB
One Iceberg table can look like three different systems depending on which engine you point at it. Trino wants a catalog and a SQL session. Spark wants a catalog plus Iceberg extensions. DuckDB can read the table directly from metadata, or attach a catalog when you need the fuller Iceberg feature set.
The table does not change. The access path does. That is the part people usually skip, which is how they end up with three copies of the same data and a fourth copy of the same confusion.
We use this pattern when teams want one open table format and more than one query engine. It is also why the Databasin model keeps storage and access control separated instead of making every tool act like the center of the universe. If you want the broader control-plane version of that idea, start with the single-platform lakehouse.
TL;DR Iceberg gives Trino, Spark, and DuckDB a shared table format, but each engine reaches it differently. Trino usually queries through a catalog, Spark uses an Iceberg catalog plus Spark extensions, and DuckDB can either scan a table directly from metadata or attach a REST catalog. The practical win is that the table stays in one place while you choose the engine that fits the job. The practical cost is that freshness, permissions, and write support are not identical across engines, so you still need to be explicit about catalogs and snapshots.
flowchart LR
subgraph Storage[Object storage]
Files[(Iceberg data files)]
Meta[(Iceberg metadata files)]
end
subgraph Catalog[Catalog]
Cat[(Hive Metastore / REST / Nessie / Glue)]
end
Trino[Trino 482]
Spark[Spark 4.1.2 + Iceberg 1.11.0]
DuckDB[DuckDB current Iceberg extension]
Cat --> Meta
Meta --> Files
Trino --> Cat
Spark --> Cat
DuckDB --> Meta
Trino -->|SQL query| Files
Spark -->|SQL or DataFrame query| Files
DuckDB -->|iceberg_scan or attached catalog| Files
What Iceberg changes
Apache Iceberg is an open table format for analytic datasets. The useful detail here is not the marketing sentence; it is the metadata model. Queries do not hunt through a warehouse directory tree looking for files. They read table metadata, resolve the current snapshot, and then scan the files that belong to that snapshot.
That gives you three things that matter when you want the same table available from multiple engines.
- A table name points at a snapshot, not a random pile of files.
- The snapshot can move forward atomically when new metadata is committed.
- Different engines can agree on the table layout as long as they agree on the catalog and file formats.
That last part is the reason Trino, Spark, and DuckDB can all read the same Iceberg table without format conversion. They are not decoding a proprietary warehouse layout. They are reading the same metadata contract.
The catch is that the contract is only as clean as the catalog and snapshot discipline around it. If one engine is looking at a stale catalog cache, or a local DuckDB session is pointed at an old metadata file, you can absolutely query the wrong snapshot while thinking everything is fine. Fun, in the way dry rot is fun.
Three ways to reach the same table
There are really only three access patterns in play here, even if the tooling makes them look more complicated.
| Access pattern | What it uses | Best for | Main downside |
|---|---|---|---|
| Catalog-backed SQL | Trino, Spark, DuckDB with an attached REST catalog | Shared team access, repeatable queries, writes | More session setup, more catalog dependency |
| Direct metadata scan | DuckDB iceberg_scan on a metadata file |
Local inspection, quick analysis, one-off reads | Read-only, snapshot must be explicit |
| Engine-specific job runtime | Spark with Iceberg extensions and a catalog | ETL, modeling, multi-step transforms | Heavier session and cluster overhead |
That table is the practical heart of the post. Iceberg is not trying to make every engine identical. It is trying to make the storage contract stable enough that each engine can use its own runtime without inventing a new table format.
For most teams, the useful decision is simple:
- If you need a shared SQL interface, use Trino.
- If you need a distributed job runtime, use Spark.
- If you need a fast local read, use DuckDB.
The failure mode is when those three roles blur together. A local investigation turns into a “temporary” pipeline. A Spark job becomes the only place anyone knows how to read the table. A Trino catalog cache quietly lags behind the snapshot a DuckDB session is reading directly. The point of Iceberg is that you do not have to build a new data copy every time you want a different answer path. The point is not that you can ignore the boundaries.
One more detail matters here: snapshot identity. When you query the same table in three engines, you are not just comparing SQL syntax. You are comparing what each engine believes the current snapshot is. If that sounds pedantic, it is not. Snapshot drift is exactly how people end up arguing about a result that is correct in one shell and wrong in another.
So before you blame the engine, check the table state. Is the catalog refreshed? Is the metadata file current? Are you reading through a catalog-backed path or a direct-read path? The answer usually explains the discrepancy faster than the query plan does.
Trino: query through a catalog
Trino is the easiest way to hit an Iceberg table from shared SQL when the data already lives in object storage and the catalog is in place. Trino’s Iceberg connector supports catalog-backed access, and the connector documentation is explicit about the requirements: Trino needs network access to the object store and a supported catalog such as Hive Metastore, Glue, JDBC, REST, Nessie, or Snowflake.
The query itself is normal SQL. The part you need to get right is the catalog reference.
-- Trino 482
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
If you want the shortest operational summary, it is this:
- Configure an Iceberg catalog for Trino.
- Point that catalog at the same warehouse or REST endpoint your other engines use.
- Query the table with a catalog-qualified name.
Trino is strong here because it stays close to the data and is good at interactive, shared SQL. It is not the best choice if your only goal is a local ad hoc read from a laptop, and it is not the most convenient engine if you need to run transformations and writes in the same job.
That trade-off matters. Trino is a query engine, not your storage strategy. Keep that straight and you will avoid a lot of accidental architecture.
If the broader question is why you should keep this kind of data on an open lakehouse at all, the single-platform lakehouse framing is the right place to start. If the answer is a warehouse-plus-bolt-ons stack, read the Databasin vs Snowflake comparison before you buy the pain twice.
Spark: query the same table from a Spark catalog
Spark is the right answer when querying is only part of the job. If you are also transforming data, building downstream tables, or integrating with a batch pipeline, Spark is usually the engine people already have on the hook.
Iceberg’s Spark docs separate the two steps clearly: you configure Spark to speak Iceberg, then you query the table through a Spark catalog. The table path does not change. The session configuration does.
# Spark 4.1.2 + Apache Iceberg 1.11.0
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.11.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.lakehouse=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.lakehouse.type=rest \
--conf spark.sql.catalog.lakehouse.uri=https://catalog.example.com
-- Spark SQL in Spark 4.1.2 with Iceberg 1.11.0
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
Spark is the least subtle of the three tools in this comparison. That is not a flaw. It is a reminder that Spark was built for distributed computation first and interactive exploration second. If you just need to inspect a table, Spark can feel heavy. If you need to read, transform, and write in the same place, the extra weight is often justified.
The downside is operational. Spark needs more session setup than DuckDB and more ceremony than Trino for a simple read. It is also easier to drift into a “while we are here, let’s build a job” mindset, which is how a simple query turns into a small platform project with a weekend soundtrack.
DuckDB: read directly, or attach a catalog
DuckDB is the most convenient way to inspect an Iceberg table locally. The Iceberg extension can read individual tables directly from storage with iceberg_scan, and it can also work with catalog-managed tables through an attached Iceberg REST catalog.
The direct-read path is the one most people want first, because it requires almost no setup.
-- DuckDB current Iceberg extension
INSTALL iceberg;
LOAD iceberg;
SELECT order_id, customer_id, total_amount
FROM iceberg_scan('s3://warehouse/analytics/orders/metadata/v1.metadata.json')
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
That example matters for one reason: DuckDB can query an Iceberg table without asking you to wire up a catalog first. But the docs also make the limitation clear. Direct reads are read-only, and they need a metadata file or version hint unless you deliberately opt into guessing. If you want writes or richer catalog behavior, attach a REST catalog instead.
-- DuckDB current Iceberg extension
CREATE SECRET iceberg_secret (
TYPE iceberg,
CLIENT_ID 'admin',
CLIENT_SECRET 'password',
OAUTH2_SERVER_URI 'https://catalog.example.com/v1/oauth/tokens'
);
ATTACH 'warehouse' AS lakehouse (
TYPE iceberg,
SECRET iceberg_secret,
ENDPOINT 'https://catalog.example.com'
);
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
DuckDB is the fastest path to “show me the table right now.” It is not the same thing as “this is my shared production query engine.” That distinction matters. The direct-read path is great for inspection and analysis; the catalog path is what you use when you want the same Iceberg semantics Trino and Spark are using.
Which engine should you use?
Use the engine that matches the job, not the one with the loudest fan club.
| Job | Best fit | Why |
|---|---|---|
| Shared SQL across a team | Trino | Low-friction SQL on top of a shared catalog |
| ETL, modeling, or batch transforms | Spark | Reads and writes fit the same distributed job |
| Local inspection or quick analysis | DuckDB | Minimal setup, direct reads from metadata |
| One engine for all three jobs | None of them alone | That is where the architecture, not the syntax, starts to matter |
The important question is not “can it query the table?” All three can. The better question is “what happens when the same table is queried by three engines, three caches, and two people who each assume the other one refreshed the catalog?”
That is where teams start paying for overlap. If your answer involves three separate systems doing adjacent jobs, compare that with the Databasin approach to an open lakehouse with fewer moving parts. If you want the blunt version of that comparison, the Databasin vs Snowflake page is the right sibling link.
Trade-offs and failure modes
Iceberg is the right abstraction for this pattern, but it does not erase the cost of operating multiple engines.
Catalog freshness is not free. Trino and Spark generally depend on catalog-backed reads. If caches or refresh behavior differ, two engines can briefly disagree about the latest snapshot.
DuckDB direct reads are intentionally limited. They are excellent for ad hoc analysis, but they are read-only and depend on explicit metadata paths or hints. That is a feature, not a bug.
Write support is not symmetric. Iceberg gives you a common table format, but each engine exposes a different subset of the table lifecycle. The docs for Trino, Spark, and DuckDB do not all promise the same write path.
Session setup still matters. “It is an Iceberg table” is not enough. You still need the correct catalog, file system access, and, in DuckDB’s direct-read mode, the right metadata file.
Engine-specific optimizations are real. Trino can be excellent for shared interactive SQL. Spark is still the workhorse for distributed processing. DuckDB is a very good local analyst’s knife. Pretending they are interchangeable is how you end up with awkward compromises.
If that sounds like a lot of plumbing for one table, yes, that is the point. Multiple engines buy flexibility, but only if you are disciplined about what each engine owns. Otherwise you get the bill and the complexity.
FAQ
Can Trino, Spark, and DuckDB query the same Iceberg snapshot?
Yes, if they point at the same catalog and metadata state. Iceberg’s snapshot model is what makes this possible. The risk is operational drift, not file format mismatch.
Do I need to copy data into each engine?
No. The whole point of Iceberg is that the table lives in object storage and the engines read the same files through the same metadata contract.
Can DuckDB write to the same Iceberg table?
Only through the catalog-managed Iceberg path. DuckDB’s direct table reads are read-only, which is exactly why they are so easy to use for inspection.
Why does DuckDB need a metadata file or version hint?
Because direct reads resolve the table from Iceberg metadata, not from a catalog lookup. The metadata file tells DuckDB which snapshot to read.
Which engine is best for interactive SQL?
Trino is usually the best default for shared interactive SQL against a catalog-backed Iceberg table. DuckDB is better when you want local analysis without setting up much infrastructure.
What is the main downside of using all three?
You now have three valid ways to reach the same table, which means three places to misconfigure access, caching, or snapshot freshness. Flexibility is real. So is the coordination tax.
Conclusion
The useful trick is not that Trino, Spark, and DuckDB can all read Iceberg. The useful trick is that they can all read the same table without forcing you to move the data around. Once you see that clearly, the rest is an engine-choice problem: Trino for shared SQL, Spark for distributed processing, DuckDB for local inspection.
If you want to keep the open-table-format benefits without turning the rest of the stack into a hobby, start with Start Free. If you are still deciding whether the open-lakehouse trade-off is worth it, read Databasin vs Snowflake and the single-platform lakehouse overview first.