Engineering

Ubicloud Postgres - why I'm paying attention to this (deep dive)

Ubicloud Postgres - why I'm paying attention to this (deep dive)

Ubicloud Postgres - why I'm paying attention to this (deep dive)

Scaling pains. Domu is growing 3x month-over-month, and our original database schema can’t keep up. With millions of calls now hitting a system built for a fraction of that load, we’re re-architecting on the fly to stop the timeouts and survive the surge.

Scaling pains. Domu is growing 3x month-over-month, and our original database schema can’t keep up. With millions of calls now hitting a system built for a fraction of that load, we’re re-architecting on the fly to stop the timeouts and survive the surge.

Share

Share

I've been spending a lot of time thinking about databases lately.

At Domu we've been growing fast. ~3x month over month kind of fast. And like most startups that grew faster than expected, our database schema is... not great. It was built for a different scale, different access patterns, different everything. Now we're processing millions of calls daily and every query that used to take 200ms is suddenly taking 20 seconds. Or timing out completely.

I've spent time optimizing Postgres. Creating composite indexes. Rewriting queries. Figuring out why EXPLAIN ANALYZE shows 96 seconds on production when it was 8 seconds on staging (cold cache lol). Understanding ClickHouse for our analytics workload is necessary because Postgres can no longer handle the aggregations (shoutout to Dre, an eng in our team that did the whole implementation by himself)

Which is how I came across Ubicloud's managed Postgres offering, thanks to a friend's referral to Umur one of the founders. And honestly, the more I dug into their architecture, the more interesting it got.

This is the first article in what I'm planning to be a series of technical deep dives on infrastructure, DBs, and services I find interesting. I want to understand how things actually work under the hood.


What I'll Cover

Before diving in, here's what this article goes through:

  • The storage architecture bet and why local NVMe changes everything

  • What actually goes into running Postgres for others (it's way more than I thought)

  • WAL handling details that most managed services get wrong

  • The fencing problem in high availability setups

  • Security isolation and the COPY FROM PROGRAM escape hatch

  • Where I'm considering using this for our own infrastructure


The Storage Bet That Changes Everything

Here's the thing about cloud databases that I didn't fully appreciate until recently: the storage architecture matters way more than I thought.

When AWS built RDS back in the day, they made a reasonable decision for 2010: move storage off the server onto the network. Hard drives failed all the time, single disk I/O was terrible, and centralizing storage let you spread load and replicate data transparently.

But hardware moved on. The software architecture didn't.

Consider this: a $600 NVMe SSD delivers 2.5 million IOPS. To get that same throughput through Aurora, you'd pay around $1.3M per month. That's not a typo. Three orders of magnitude difference.

Ubicloud made a different bet. Every managed Postgres instance runs on local NVMe storage. No network hop. No EBS. No shared storage fabric.


Their benchmarks are pretty wild:

TPC-C Transactions/sec: Ubicloud (873), Aurora (636), RDS(188)

TPC-C p99 Latency (ms): Ubicloud (314), Aurora (601), RDS (2,406)

TPC-H Mean Query Time: Ubicloud (1x), Aurora (2.42x), RDS(2.96x)


4.6x more transactions than RDS. 7.7x lower tail latency. And it costs roughly 3x less.


Running Postgres For Others Is Hard

I used to think managed Postgres was basically "spin up a VM, install Postgres, hand back a connection string." Maybe add some monitoring.

Well, turns out that's like saying building a car is just "put an engine in a metal box."

Ubicloud's team (they built Heroku Postgres and Citus before this) wrote about what actually goes into a managed Postgres service. The dependency graph is gnarly:

You need database provisioning (the obvious part), DNS records (so connection strings survive failovers), TLS certificates (signed, rotated, validated), extension installation (roughly 80 extensions, many requiring compilation), configuration tuning (parameters matched to instance size), backup infrastructure (WAL archiving, full backups, PITR), health monitoring (detecting failures before customers do), and failover orchestration (the really hard part).

Each of these depends on the others. You can't do HA without monitoring. You can't do read replicas without backups. You can't do point in time restore without WAL archiving.

The naive approach takes 10 minutes per database. Ubicloud got it down to 20-30 seconds through baked images, parallel operations, and pre-provisioned instance pools.


The WAL details

Here's something I learned recently that most managed Postgres services get subtly wrong

Postgres generates a new WAL file when it hits 16MB. If your database isn't busy, that might take hours. Which means your "point in time restore to any minute" claim is quietly false. You can only restore to whenever the last WAL file completed.

The fix is setting archive_timeout to force a new WAL file every minute. But there's a second edge case: if there's zero activity, Postgres won't generate a WAL file even with archive_timeout set.

Ubicloud handles this by generating synthetic write activity via pg_current_xact_id(). But it's the kind of detail that separates "managed Postgres" from "actually managed Postgres."


The fencing problem

HA sounds simple in theory. Monitor the primary, failover to the standby if it dies, provision a new standby.

The hard part is fencing.

When your health check fails, you don't actually know if the primary is dead. Maybe the network between your control plane and the database is partitioned. Maybe the primary will come back in 30 seconds.

If you promote the standby while the primary is still accepting writes, you have split brain. Some clients connect to the new primary, some to the old one. Data diverges. You lose writes. Catastrophic.

Ubicloud's fencing is aggressive and parallel. If the server is reachable they kill the Postgres process and prevent restart. They detach the network interface so no new connections arrive. And they deprovision the VM entirely. All three execute simultaneously.

They explicitly call out that DNS updates are not sufficient for fencing. DNS TTLs mean clients can still reach the old primary for the cache duration.


The COPY FROM PROGRAM Escape Hatch

There's a fun trick that most Postgres users don't know about:

COPY command_results FROM PROGRAM 'ls';
COPY command_results FROM PROGRAM 'ls';
COPY command_results FROM PROGRAM 'ls';
COPY command_results FROM PROGRAM 'ls';

With superuser access, you can execute arbitrary shell commands from inside Postgres. Most managed services respond to this by denying superuser access entirely.

Ubicloud takes a different approach. They give you superuser, but the postgres OS user is isolated to only accessing database files. Each database runs in its own VM. VMs are network isolated from each other. Even if you escape to the OS, you can't reach other tenants.

This is the kind of security model that only works when you control the full stack from bare metal up.


What This Means For Me

I'm not switching Domu's production database tomorrow. But I hope to consider migrating away from [redacted by compliance], which is currently causing us a lot of performance issues, in a couple of months.


Honorable Mention: PlanetScale (❤️)

I can't write about database options without mentioning PlanetScale. They're my other serious consideration for future projects.

PlanetScale is built on Vitess, the database clustering system YouTube created to scale MySQL. The architecture is completely different from what Ubicloud is doing. Instead of betting on local NVMe storage (they are using it for their Postgres offering though), they're betting on horizontal sharding. You can distribute your data across thousands of nodes, all presenting as a single database connection.

What I love about PlanetScale:

  • Non-blocking schema changes. You can deploy schema changes without locking tables or causing downtime. For a company doing continuous deployment, this is huge.

  • Database branching. Create a branch of your database just like you'd branch code. Test migrations safely before applying to production.

  • Proven at insane scale. Vitess powers Slack, GitHub, Square. It's not theoretical.

  • They also use local NVMe now. Their Metal product runs on locally attached NVMe drives, similar philosophy to Ubicloud.


The difference is that PlanetScale is MySQL (or Postgres now, they recently added that but haven't tested it yet), and they're focused on horizontal scaling as the primary value prop. Ubicloud is pure Postgres, focused on the price/performance of local storage.

For Domu, if we ever outgrow single-node Postgres and need true horizontal scaling, PlanetScale would be at the top of my list.


What's Next

I'm going to keep digging into their architecture. The control plane is written in Ruby (Roda, Sequel, Rodauth) which is an unusual choice that I find interesting. Their "Strand" system for managing long running operations like provisioning and failover looks well thought out. And the fact that everything is open source means I can actually read the code instead of guessing at what's happening behind the API.

This is the first of hopefully many deep dives. I want to do similar writeups on other tools I'm evaluating or using. Things like:

  • How ClickHouse actually handles our analytics workload

  • What makes OrbStack so fast compared to Docker Desktop

  • How Hookdeck handles webhook infrastructure at scale

If you're building something similar or have thoughts on any of this, feel free to hmu.

This article comes from my personal research done in my free time, so not everything might be correct since I haven't checked in with the Ubicloud team for verification. Either way, I hope you found something useful in it


Links

Ubicloud

PlanetScale

Details

Details

Details

Date

January 18, 2026

Category

Engineering

Reading

6 min

Author

Author

Author

Kai Takami

Kai Takami

Kai Takami

CTO

CTO

Leading engineering team

Explore Related Articles

AI Collection

February 24, 2026

We (partially) broke production for 6 days and didn't notice

How migrating to a connection pooler silently broke 16 PostgreSQL function calls, accumulated 1M+ dead letter queue messages, and what we missed.

AI Collection

February 24, 2026

We (partially) broke production for 6 days and didn't notice

How migrating to a connection pooler silently broke 16 PostgreSQL function calls, accumulated 1M+ dead letter queue messages, and what we missed.

GET STARTED

Ready to see your future
AI agents in action?

Ready to see your future
AI agents in action?

Ready to see your future
AI agents in action?

See how effortless debt recovery can be with Domu AI.

See how effortless debt recovery can be with Domu AI.

See how effortless debt recovery can be with Domu AI.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.

We’re building the next generation of engagement technology: intelligent, automated, and compliant. Our mission is to empower financial institutions to orchestrate every stage of the servicing lifecycle with dignity and unprecedented efficiency.

Supported by

Y Combinator

AWS

Microsoft

Copyright © 2025 Domu Communications LLC. All rights reserved.