On a dark and stormy night, your phone buzzes. It’s 3 am, and you’re still awake. At wit’s end after your datacenter’s team told you, “We don’t know why the system is unavailable, but we’re working on it.” You tremble with helplessness as you pick up your phone and read, “The database crashed and it’s not coming back up.”
After long enough, we all have horror stories: the calls we field at awful hours, the system failures that confound us, the vendors that let you down. But often the worst times can yield the best lessons.
If you’re American, you may have followed the media circus around the healthcare.gov launch:
- Service outages at datacenters run by Verizon Terremark made healthcare.gov periodically unavailable.
- Insufficient hardware brought the site to a crawl even under typical usage.
- Various Oracle products in healthcare.gov’s stack failed miserably, causing downtime.
- The developers who built healthcare.gov didn’t understand their core database, MarkLogic.
In essence, all kinds of vendors failed the team through service outages and product failures. Even when products weren’t breaking, the coders building the site didn’t understand their stack, and built sketchy software on top of it as a result. As a result, even when Oracle’s products worked and Verizon’s datacenters weren’t on fire, the site could still barely handle demand.
The quick fix, which they executed, was to throw more hardware at the problem. With this approach, code still consumes more hardware than budgeted for, a problem which only gets worse and more expensive as the project scales.
So, MarkLogic threw a bunch of their own engineers at the problem, optimized queries, and reduced response times by as much as a four-fifths. Pages that took 1.5 seconds to load suddenly took only 300 ms. Not bad, so what can we learn from healthcare.gov’s mistakes?
Understand Your Stack
Coders building on a NoSQL system like MarkLogic with the same assumptions they’d hold for a SQL setup will run into troubles. In the absence of JOINs, for example, you need to think about how you’ll aggregate your data effectively and efficiently. orc-denorm is one solution for NoSQL systems like Orchestrate. Without similar tools, you should put in the time to grok the tech you rely on. Tech debt will hurt more than bad tech ever could.
Use Vendors You Can Trust
I have made much of my career helping folks migrate off Oracle tech. Everyone who has worked in enterprise has horror stories about them. Verizon, too, has been a little spooky. Do your research, no matter what Forbes tells you. Google “[vendor]+[critical verb of your choice]” and see what spine tingling tales come up around the campfire.
Truth is, most tech will do the job. healthcare.gov could work on PostgreSQL as well as MarkLogic, but it would take different expertise. If you don’t want to develop that expertise, seek out and evaluate expert vendors you can rely on.
You Don’t Have to Be Afraid
No matter the system, it will break. If you don’t understand it, it will confound you, fail you, and wake you up in a cold sweat at 3am when your phone buzzes reading, “Something broke.”
Only expertise will prevent nightmares. If you need to focus elsewhere, delegate to experts you can trust and rely on. Without systems like that we’d all be making our own underwear and administering clusters we built by hand.
What a nightmare that would be.