AI Collection

We (partially) broke production for 6 days and didn't notice

We (partially) broke production for 6 days and didn't notice

We (partially) broke production for 6 days and didn't notice

How migrating to a connection pooler silently broke 16 PostgreSQL function calls, accumulated 1M+ dead letter queue messages, and what we missed.

How migrating to a connection pooler silently broke 16 PostgreSQL function calls, accumulated 1M+ dead letter queue messages, and what we missed.

Share

Share

We migrated our database layer from a PostgREST-based SDK to Drizzle ORM. As part of that, connections started going through a transaction-mode connection pooler.

Normal queries worked. SELECT, INSERT, UPDATE, JOINs, CTEs, JSONB operations. All fine.

Custom PostgreSQL function calls did not.

-- this works through a transaction-mode pooler
UPDATE records SET status = 'done' WHERE id = $1;

-- this does not
SELECT my_custom_function($1, $2);
-- this works through a transaction-mode pooler
UPDATE records SET status = 'done' WHERE id = $1;

-- this does not
SELECT my_custom_function($1, $2);
-- this works through a transaction-mode pooler
UPDATE records SET status = 'done' WHERE id = $1;

-- this does not
SELECT my_custom_function($1, $2);
-- this works through a transaction-mode pooler
UPDATE records SET status = 'done' WHERE id = $1;

-- this does not
SELECT my_custom_function($1, $2);

We had 16 methods calling custom Postgres functions via SELECT function(...). Every single one broke. Silently. No crash, no clear error message. Just a generic database error that got caught, retried 3 times, and sent to a dead letter queue.

Over 6 days, that DLQ accumulated 1M+ messages.

Nobody noticed because we didn't have an alarm on queue depth.

Why we had Postgres functions in the first place

Our original backend used PostgREST, which exposes tables as REST endpoints. Simple CRUD works well enough, but anything more complex (multi-table updates, conditional logic, JSONB manipulation) has to live in a PostgreSQL function invoked via .rpc().

// PostgREST pattern: complex logic lives in the database
const { data } = await client.rpc('apply_payment', {
  agreement_id: id,
  amount: 500
})
// PostgREST pattern: complex logic lives in the database
const { data } = await client.rpc('apply_payment', {
  agreement_id: id,
  amount: 500
})
// PostgREST pattern: complex logic lives in the database
const { data } = await client.rpc('apply_payment', {
  agreement_id: id,
  amount: 500
})
// PostgREST pattern: complex logic lives in the database
const { data } = await client.rpc('apply_payment', {
  agreement_id: id,
  amount: 500
})

Over time we accumulated 16 of these. When we migrated to Drizzle, we kept them as-is:

// we changed the transport, not the pattern
const result = await db.execute(
  sql`SELECT apply_payment(${id}, ${amount})`
)
// we changed the transport, not the pattern
const result = await db.execute(
  sql`SELECT apply_payment(${id}, ${amount})`
)
// we changed the transport, not the pattern
const result = await db.execute(
  sql`SELECT apply_payment(${id}, ${amount})`
)
// we changed the transport, not the pattern
const result = await db.execute(
  sql`SELECT apply_payment(${id}, ${amount})`
)

This saved time during the migration. It also meant 16 methods were one infrastructure change away from breaking.

What transaction-mode pooling actually does

Tools like PgBouncer share database connections between clients in transaction mode. When your transaction finishes, the connection goes back to the pool and some other client gets it.

This is useful for serverless. You don't exhaust your connection limit with 200 concurrent Lambdas.

The tradeoff: some Postgres features stop working. Prepared statements are the well-known one (set prepare: false and move on). Custom function calls are less documented.

The error you get back isn't "function calls not supported." It's a generic database error. Looks like any other query failure. You will not figure this out from the error message alone.

Why it took 6 days

Three things compounded:

1. The failure was silent. The Lambda didn't crash. It processed the message, hit a database error, retried, failed again, and sent it to the DLQ. This is correct behavior. That's the problem. The system was working exactly as designed, just against broken queries.

2. A louder incident masked it. The same ORM migration also caused timestamp serialization bugs (new Date(undefined) crashes Drizzle but PostgREST handled it fine). That incident was obvious: calls stuck, errors spiking, 24+ hours of firefighting. The silent function call failures were happening at the same time but invisible against the noise.

3. The degradation was gradual. The system didn't stop. Some code paths used standard queries and still worked. Call volume dropped but didn't hit zero. Ops noticed "fewer calls than usual" but couldn't pinpoint why because the system appeared to be functioning.

The DLQ was the signal the whole time. It went from 0 to 1M. A simple alarm on ApproximateNumberOfMessagesVisible > 1000 would have caught this on day 1. We didn't have that alarm.

The fix

Rewrite every SELECT function(...) as a normal query.

Simple case:

// before (broken)
await db.execute(sql`SELECT update_status(${id}, ${status})`)

// after
await db
  .update(records)
  .set({ status, updated_at: new Date() })
  .where(eq(records.id, id))
  .returning()
// before (broken)
await db.execute(sql`SELECT update_status(${id}, ${status})`)

// after
await db
  .update(records)
  .set({ status, updated_at: new Date() })
  .where(eq(records.id, id))
  .returning()
// before (broken)
await db.execute(sql`SELECT update_status(${id}, ${status})`)

// after
await db
  .update(records)
  .set({ status, updated_at: new Date() })
  .where(eq(records.id, id))
  .returning()
// before (broken)
await db.execute(sql`SELECT update_status(${id}, ${status})`)

// after
await db
  .update(records)
  .set({ status, updated_at: new Date() })
  .where(eq(records.id, id))
  .returning()

Complex case, where the function had multi-step logic:

// before (broken)
await db.execute(sql`SELECT apply_payment(${id}, ${amount})`)

// after: move the logic to application code in a transaction
await executeTransaction(async (tx) => {
  const row = await tx.select().from(agreements)
    .where(eq(agreements.id, id)).limit(1)

  const newAmount = Math.max(0, row.amount_due - amount)
  const newStatus = newAmount === 0 ? 'fulfilled' : 'pending'

  return await tx.update(agreements)
    .set({ amount_due: newAmount, status: newStatus })
    .where(eq(agreements.id, id))
    .returning()
})
// before (broken)
await db.execute(sql`SELECT apply_payment(${id}, ${amount})`)

// after: move the logic to application code in a transaction
await executeTransaction(async (tx) => {
  const row = await tx.select().from(agreements)
    .where(eq(agreements.id, id)).limit(1)

  const newAmount = Math.max(0, row.amount_due - amount)
  const newStatus = newAmount === 0 ? 'fulfilled' : 'pending'

  return await tx.update(agreements)
    .set({ amount_due: newAmount, status: newStatus })
    .where(eq(agreements.id, id))
    .returning()
})
// before (broken)
await db.execute(sql`SELECT apply_payment(${id}, ${amount})`)

// after: move the logic to application code in a transaction
await executeTransaction(async (tx) => {
  const row = await tx.select().from(agreements)
    .where(eq(agreements.id, id)).limit(1)

  const newAmount = Math.max(0, row.amount_due - amount)
  const newStatus = newAmount === 0 ? 'fulfilled' : 'pending'

  return await tx.update(agreements)
    .set({ amount_due: newAmount, status: newStatus })
    .where(eq(agreements.id, id))
    .returning()
})
// before (broken)
await db.execute(sql`SELECT apply_payment(${id}, ${amount})`)

// after: move the logic to application code in a transaction
await executeTransaction(async (tx) => {
  const row = await tx.select().from(agreements)
    .where(eq(agreements.id, id)).limit(1)

  const newAmount = Math.max(0, row.amount_due - amount)
  const newStatus = newAmount === 0 ? 'fulfilled' : 'pending'

  return await tx.update(agreements)
    .set({ amount_due: newAmount, status: newStatus })
    .where(eq(agreements.id, id))
    .returning()
})

16 methods across 7 files. The critical 5 were deployed within 1.5 hours of detection. The other 11 were done the same day.

What we learned

Connection poolers are not transparent. They change what SQL you can run. prepare: false is not the end of the story. If you're adopting a transaction-mode pooler, grep your codebase for every db.execute and check what it's actually doing.

Put alarms on your dead letter queues. This is the single highest-value change from this incident. DLQs are where silent failures accumulate. If you're not monitoring them, you're missing your best signal.

When you finish a big incident, look for the second one. We spent 24 hours fixing timestamp bugs from the same migration. The function call failures were right there the whole time, growing in the DLQ. After resolving a major incident, actively sweep for related breakage.

Don't preserve old patterns in new systems. We kept SELECT function(...) inside Drizzle to reduce migration scope. Every shortcut like that is a bet that the old pattern still works in the new environment. This time it didn't.

Details

Details

Details

Date

February 24, 2026

Category

AI Collection

Reading

6 min

Author

Author

Author

Kai Takami

Kai Takami

Kai Takami

CTO

CTO

Leading engineering team

Explore Related Articles

Engineering

January 18, 2026

Ubicloud Postgres - why I'm paying attention to this (deep dive)

Scaling pains. Domu is growing 3x month-over-month, and our original database schema can’t keep up. With millions of calls now hitting a system built for a fraction of that load, we’re re-architecting on the fly to stop the timeouts and survive the surge.

Engineering

January 18, 2026

Ubicloud Postgres - why I'm paying attention to this (deep dive)

Scaling pains. Domu is growing 3x month-over-month, and our original database schema can’t keep up. With millions of calls now hitting a system built for a fraction of that load, we’re re-architecting on the fly to stop the timeouts and survive the surge.

GET STARTED

Ready to see your future
AI agents in action?

Ready to see your future
AI agents in action?

Ready to see your future
AI agents in action?

See how effortless debt recovery can be with Domu AI.

See how effortless debt recovery can be with Domu AI.

See how effortless debt recovery can be with Domu AI.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.