AI Collection

We migrated our database layer from a PostgREST-based SDK to Drizzle ORM. As part of that, connections started going through a transaction-mode connection pooler.
Normal queries worked. SELECT, INSERT, UPDATE, JOINs, CTEs, JSONB operations. All fine.
Custom PostgreSQL function calls did not.
We had 16 methods calling custom Postgres functions via SELECT function(...). Every single one broke. Silently. No crash, no clear error message. Just a generic database error that got caught, retried 3 times, and sent to a dead letter queue.
Over 6 days, that DLQ accumulated 1M+ messages.
Nobody noticed because we didn't have an alarm on queue depth.
Why we had Postgres functions in the first place
Our original backend used PostgREST, which exposes tables as REST endpoints. Simple CRUD works well enough, but anything more complex (multi-table updates, conditional logic, JSONB manipulation) has to live in a PostgreSQL function invoked via .rpc().
Over time we accumulated 16 of these. When we migrated to Drizzle, we kept them as-is:
This saved time during the migration. It also meant 16 methods were one infrastructure change away from breaking.
What transaction-mode pooling actually does
Tools like PgBouncer share database connections between clients in transaction mode. When your transaction finishes, the connection goes back to the pool and some other client gets it.
This is useful for serverless. You don't exhaust your connection limit with 200 concurrent Lambdas.
The tradeoff: some Postgres features stop working. Prepared statements are the well-known one (set prepare: false and move on). Custom function calls are less documented.
The error you get back isn't "function calls not supported." It's a generic database error. Looks like any other query failure. You will not figure this out from the error message alone.
Why it took 6 days
Three things compounded:
1. The failure was silent. The Lambda didn't crash. It processed the message, hit a database error, retried, failed again, and sent it to the DLQ. This is correct behavior. That's the problem. The system was working exactly as designed, just against broken queries.
2. A louder incident masked it. The same ORM migration also caused timestamp serialization bugs (new Date(undefined) crashes Drizzle but PostgREST handled it fine). That incident was obvious: calls stuck, errors spiking, 24+ hours of firefighting. The silent function call failures were happening at the same time but invisible against the noise.
3. The degradation was gradual. The system didn't stop. Some code paths used standard queries and still worked. Call volume dropped but didn't hit zero. Ops noticed "fewer calls than usual" but couldn't pinpoint why because the system appeared to be functioning.
The DLQ was the signal the whole time. It went from 0 to 1M. A simple alarm on ApproximateNumberOfMessagesVisible > 1000 would have caught this on day 1. We didn't have that alarm.
The fix
Rewrite every SELECT function(...) as a normal query.
Simple case:
Complex case, where the function had multi-step logic:
16 methods across 7 files. The critical 5 were deployed within 1.5 hours of detection. The other 11 were done the same day.
What we learned
Connection poolers are not transparent. They change what SQL you can run. prepare: false is not the end of the story. If you're adopting a transaction-mode pooler, grep your codebase for every db.execute and check what it's actually doing.
Put alarms on your dead letter queues. This is the single highest-value change from this incident. DLQs are where silent failures accumulate. If you're not monitoring them, you're missing your best signal.
When you finish a big incident, look for the second one. We spent 24 hours fixing timestamp bugs from the same migration. The function call failures were right there the whole time, growing in the DLQ. After resolving a major incident, actively sweep for related breakage.
Don't preserve old patterns in new systems. We kept SELECT function(...) inside Drizzle to reduce migration scope. Every shortcut like that is a bet that the old pattern still works in the new environment. This time it didn't.
Date
February 24, 2026
Category
AI Collection
Reading
6 min

Leading engineering team
Explore Related Articles
GET STARTED


