When AWS Glue Jobs Quietly Burn $22k a Month

A Spark + Iceberg story about DPUs, partitions, and one tiny data type choice

Jun 30, 2026

Cloud costs don’t always explode because of scale.
Sometimes they explode because of details.

This is a story about a few AWS Glue jobs, Apache Iceberg, and how a single query pattern quietly pushed our Glue spend to $22k in a month days, without anyone noticing at first.

Nothing was broken. Data was flowing.
And yet, money was leaking fast.

The Symptom: $22k MTD in AWS Glue

At first glance, the number alone didn’t tell us much:

Glue spend (MTD, 23 days): ~$22,000

No alerts. No failures.
Just… cost.

So we did what any FinOps forensic investigation starts with: job-by-job attribution.

Low-Hanging Fruit: DPUs Don’t Mean Performance

The first win was almost boring, and that’s the point.

Job: `append_to_iceberg_table`

Cost: ~$6k/month
Configuration: 4 × G.1X executors → 4 DPUs
Runtime: 1–2 minutes per run

After checking metrics and enabling Spark UI briefly, it was obvious.
This job was over-provisioned. We cut it to 2 × G.1X (2 DPUs).

Same runtime. Same output.
~50% cost reduction.

A reminder: more DPUs rarely equal faster jobs.

The Real Culprit: A Merge Job That Looked “Normal”

Then we hit the expensive one.

Job: `ter_merge_job`

8 × G.4X workers → 32 DPUs
Runtime: ~25 minutes
Cost: ~$5k MTD
Scheduled “every 7 minutes” (which was impossible with max concurrency = 1)

This one deserved a deeper look.
So we enabled Spark UI and Job metrics.

What we saw immediately raised red flags 🚩

The Smoking Gun: 162 GB of Shuffle

Two stages stood out:

Stage 12 → 162 GB shuffle write
Stage 14 → 14.6 minutes

Both mapped back to a MERGE INTO query against Apache Iceberg tables.

Digging into Spark SQL / DataFrame metrics:

One table scan returned ~3.3B rows
Another scan returned ~5.9B rows
After filtering?
👉 ~29,000 rows

That’s not inefficient. That’s catastrophic!

“But the Table Is Partitioned…”

The table was partitioned, yes, by created_at.

And the query looked reasonable:

audit_created_at >= TIMESTAMP '{main_max_audit_created_at}'
AND audit_created_at <  TIMESTAMP '{audit_max_audit_created_at}'

So why was Spark scanning every single row?
To answer that, we stripped the job down.

Reproducing the Problem (On Purpose)

We cloned the job and ran isolated queries to test partition pruning behavior.

Same table.
Same data.
Different WHERE clauses.

The results were… unsettling.

Query 1: CAST to DATE (works)

WHERE CAST(created_at AS DATE) >= DATE '2026-04-26'
  AND CAST(created_at AS DATE) <  DATE '2026-04-28'
  AND created_at >= TIMESTAMP '2026-04-26 11:00:00'
  AND created_at <  TIMESTAMP '2026-04-27 07:00:00'

✅ Partition pruning
⏱ ~28 seconds
📉 Millions of rows scanned

Query 2: DATE only (works)

WHERE CAST(created_at AS DATE) >= DATE '2026-04-26'
  AND CAST(created_at AS DATE) <  DATE '2026-04-28'

✅ Partition pruning
⏱ ~28 seconds

Query 3: TIMESTAMP (disaster!)

WHERE created_at >= TIMESTAMP '2026-04-26 11:00:00'
  AND created_at <  TIMESTAMP '2026-04-27 07:00:00'

❌ No partition pruning
🤯 ~5.6 billion rows scanned
⏱ ~15 minutes

Same data.
Same logical filter.
Wildly different cost.

The Root Cause: TIMESTAMP vs TIMESTAMP_NTZ

The key detail was hiding in plain sight.

created_at in Iceberg was defined as TimestampType (microsecond precision, no timezone)
Spark queries were using TIMESTAMP
Spark couldn’t push the predicate → no partition pruning

When we rewrote the query using TIMESTAMP_NTZ:

WHERE created_at >= TIMESTAMP_NTZ '2026-04-26 00:00:00'
  AND created_at <  TIMESTAMP_NTZ '2026-04-28 00:00:00'

Suddenly:

Filters appeared in the BatchScan
Partition pruning kicked in
Runtime dropped from 15 minutes → 30 seconds
Rows scanned dropped from billions to hundreds of thousands, which had a great impact on the cost of S3:GetObject

Evolution of GetObject API Operation daily cost for the Data Lake S3 Bucket

Same logic.
Different type.
Orders of magnitude difference.

The FinOps Lesson (Not a Spark One)

This wasn’t a Glue problem or an Iceberg problem.
It was a FinOps visibility problem.

A single query pattern multiplied cost
DPUs hid inefficiency
Partitioning existed but wasn’t used

What We Changed

Immediately:

Reduced DPUs where metrics showed overprovisioning
Fixed query filters to use TIMESTAMP_NTZ
Validated partition pruning via Spark UI before deploying

Going forward:

Cost anomalies trigger engineering reviews, not just finance checks
Glue cost ≠ job count, it’s query behavior

Final Thought

Cloud cost optimization isn’t about discounts.
It’s about understanding how your systems actually behave.

Sometimes the most expensive thing in your data platform isn’t scale.

It’s a single keyword in a WHERE clause.

🤔 If one of your Glue jobs scanned 5 billion rows today…
Would you know or would you just see the bill tomorrow?

FinOps: Zero to Hero

Discussion about this post

Ready for more?