Crafting the Pearl Health Incident Policy

Crafting the Pearl Health Incident Policy

What is an incident policy, and why do we need one?

At Pearl Health, we strive to build rock-solid technology, with excellent uptime and quality. But we all know that technology (and humans) are imperfect. Inevitably, something will go wrong. We’ll suffer an outage, or a key feature will misbehave. The important part is how we react to those problems.

Our incident policy is a tool that we use to make us better at reacting to critical issues. It answers two questions:

In our early stage, we didn’t need formal structure for this. We got by with Slack posts and quick huddles. Everyone knew which problems were urgent, and our process was simple: all hands on deck!

But Pearl Health is growing. We have more engineers, more customers, and our product is more complex. Eventually, we’ll outgrow our ability to handle problems by instinct. We decided to create an incident policy now, before we need it. This gives us a chance to think about our approach, practice it, and correct the mistakes while we’re still small.

What it looks like

Our policy is as simple as possible — the minimum amount of structure that we need, and no more. Specifically:

Incident criteria

The incident criteria list is short, and that’s intentional. We kept it short for two reasons. First, when something goes wrong, we don’t want to spend a lot of time wondering, “Is this an incident?” A shorter list means it’s easier to make that decision and move to the next steps. Second, we only want to page the on-call engineer for the most critical issues.

We built the list by asking ourselves, “Would this problem put our customers or our business at risk?” That was a powerful question, and it helped us filter down to only four specific scenarios:

Regarding PHI leaks: it’s important to state that we take the utmost precautions to keep patient data safe. Pearl Health is HIPAA-compliant, and we’re pursuing HITRUST certification. This is one of those processes that we hope we never need to use. But it would be irresponsible not to define the process in the first place.

SLAs

The “Application is unavailable” criterion was surprisingly hard to define. What counts as “unavailable”? Does it mean the system is completely offline? What if it’s just really slow? What counts as unacceptably slow? What if it works for some users, but not others?

We didn’t want to leave these questions as an exercise for the reader. When something is broken, we need a quick decision about whether to begin the incident process. At the same time, we knew that we couldn’t possibly list every failure scenario in advance.

For now, our approach is to find a middle ground.

First, we set specific application conditions that must always be true. These service-level agreements (SLAs) are based on metrics (e.g. error rates, number of customers affected, etc.) and thresholds. If any of our SLAs are broken, then it’s an incident.

Second, we included this “escape valve” guidance:

  1. Use your best judgment.
  2. When in doubt, escalate and ask for a second opinion!
Choosing SLAs is hard, but it’s an extremely valuable process. In a future blog post, we’ll explore how Pearl Health handled it.

Incident process

The incident process is simple:
  1. Post in Slack. We have a dedicated #incident channel, including the entire Engineering, Product and Customer Success teams.
  2. Get help immediately! If someone responds right away, great. Otherwise, page the on-call engineer.
  3. Phone or text the CTO. If the CTO doesn’t respond, try the Chief Product Officer, and if she doesn’t respond, try the VP of Product.

Why the last step? If we’re all in the same Slack channel, then why does the CTO need a separate phone call? It’s so that they can handle all the other calls we need to make. For example, our executives will have important questions like, “Which customers are affected? What’s the impact? When will it be fixed?” We want senior leaders to handle those questions, so that the Engineering team can focus on diagnosing and fixing the issue. And in order to do that, they need to know about the problem right away, even if they’re not looking at Slack.

Conclusion

Our team of provider-enablement, risk-bearing, and technology experts are thoughtfully building a values-based team to democratize access to healthcare risk, align incentives with patient outcomes to deliver higher-quality care at a lower cost, and to make our healthcare system more sustainable.

Interested in learning more? We are hiring across functions, including engineering, product, sales, marketing, customer success, and finance. Check out opportunities on our Careers Page.

Our Technology

Platform and services that empower providers to deliver better quality care at a lower cost
Matt Solnit

Matt Solnit

Chief Technology Officer, Pearl Health