Hoist the Jolly Roger: Setting Sail with Django Waffle
As developers, we’re all too familiar with the challenges of building applications across multiple environments, each with different settings and configurations. Managing feature development can become difficult under such circumstances, but fortunately we have a powerful tool at our disposal: feature flags!
By allowing us to toggle features on and off in different environments based on specific conditions, such as a value stored in a database, these flags help us ensure that users aren’t affected by works-in-progress. Once a feature is ready, it can then be seamlessly enabled in production environments with the push of a button. Awesome!
Here at Pearl Health, we use feature flags in our Django application1 and leverage the popular Django Waffle library2 to manage them. This allows us to utilize the built-in Django Admin app as a UI to interact with flags that we store in our database. Our technology platform runs in multiple environments, and we use a different database instance for each environment.
Altogether, this provides us with a simple way to add and update flags:
Django-Waffle UI
Django Waffle Flag Selection
Django Waffle Enable/Disable Flag
Simple and safe, right? Wrong! As with many powerful tools, we recently learned the hard lesson that feature flags can unintentionally steer our engineering efforts toward turbulent waters. In this post, I’ll review the timeline of events that led to a breakdown of our feature release due to flag mismanagement (read: I messed up), the steps we took to safely and quickly return to calmer seas, and practical strategies we implemented to avoid making this kind of error moving forward.
Blue Skies and Smooth Sailing
In 2022, more than 800 primary care providers from coast to coast partnered with Pearl Health, resulting in more than 10x year-over-year growth and expanding our network from 10 to 29 states. With this growth came the need to support a 10x demand increase on our technology platform, which in turn required developing several new features that would enable the platform to remain performant while handling larger amounts of traffic and data. Our average practice size grew, too, which meant we needed enhancements to our Panel View feature to handle larger data sets. Finally, we introduced some brand-new features that gave our clients new ways to leverage their patient data.
Our workflow for managing the development of a new feature typically involves the following steps:
- Create a new flag entry in our database, defaulting to “off” for everyone.
- Build the feature in our codebase “behind” a feature flag. When we build behind a flag, we can safely merge to our main branch and deploy, because we leave the flag disabled in production until the feature is ready. Here’s a tiny example of what that might look like in a Django view:
from waffle import flag_is_active
@api_view([“GET“])
def some_view(request):
if flag_is_active(request, ‘awesome-new-feature-v1.0.0’):
# your awesome new feature goes here
…
else:
# your old feature or other application code goes here
…
- Turn on the flag in our development environments and test the feature before release.
- When the feature is production-ready, turn on the flag in our production environment. One of the great things about Waffle is that you can toggle flags on a per-user (or per-role) basis. We often turn the flag on for Pearl Health staff accounts before enabling it for everyone, depending on the complexity of the feature and associated risk.
- Refactor our codebase to remove the check for an active flag. If there’s any deprecated code that we don’t need anymore (for the “old” behavior), then this is where we’ll delete that, too.
- Delete the feature flag from Django.
In developing and releasing features to support our 2022 growth, we set a goal of giving both our existing and new clients a seamless onboarding experience. This required releasing all our new features simultaneously, in sync with our Customer Success team’s product engagement efforts. In short, we required coordinated development and release across multiple independent Pearl teams working on different features.
Thanks to steps one through four of the feature flag process described above, our teams were able to navigate the perilous straits of interconnected projects smoothly, and the new features launched successfully! Our users responded positively and engagement was high.
Shiver Me Timbers: Recovering from a Feature Flag Blunder
About a week after the features were released, we were ready to complete step five of the process, where we remove the check for the feature flag and clean up the now-unused code. However, as we embarked upon this final stretch of our voyage, the skies suddenly darkened. Here’s how things went wrong, and how our crew sprang into action.
At around 6:00 PM, a clinician signed into the Pearl Platform to review their patient panel. Although they’d been using the platform successfully for several days, suddenly they couldn’t view their data! Instead, all they saw was this splash screen:
Splash screen visible to users in the Pearl Platform
This didn’t make sense, because we’d already imported their data into production and turned on the “real” application.
The practice’s staff reached out to their trusted Pearl Customer Success contact, who immediately posted in our company-wide #bug Slack channel to increase visibility and kick off an investigation. The Customer Success team quickly uncovered that the bug seemed to affect all of our newly-enrolled 2023 practices. Our Engineering team jumped in right away to provide technical support.
As a first investigative step, the responding engineer determined that the bug was unlikely to have been caused by changes to our codebase, because no changes had gone out for several hours. Next, the engineer used our database audit log to look for changes to our database (shout out to Django Easy Audit),3 and noticed that a key feature flag had been deleted from the database right before the problem started! This meant that when Django checked the database to see if the flag was enabled, instead of finding a “true” or “false” value, it simply couldn’t find the flag at all. And so, of course, it couldn’t verify that the flag was enabled. It turned out that this was the flag governing whether we showed the “Coming Soon” splash screen or the real user interface.
By 6:15pm ET — just 15 minutes after the incident surfaced — we had restored the missing flag, verified that features worked as expected, and made it through the storm unscathed. Our all-hands-on-deck rapid response meant that our customers could continue using the Pearl Platform without any meaningful interruption, allowing us to not only maintain, but actually build, trust in our ability to deliver a best-in-class technology experience.
Digging for Pearls of Wisdom
With the crisis averted, we turned to the next critical task: understanding what happened and how to avoid similar situations in the future. We learned that there had been a mistake during steps five and six of the process (refactoring the code and deleting the flag from Django). In the process of testing the removal of the feature flag check from our code, it turns out that I accidentally deleted the flag from our production database instead of the development database, disabling our new features in production. ‘Twas I who set the platform ablaze.
Lucky for me, the team at Pearl is not only adept at immediately identifying and addressing technical issues to avoid negatively impacting customers, but we also practice a blameless culture that takes mistakes as a chance to learn and improve. After the incident was resolved, the responding engineer and I asked ourselves a series of questions: What went wrong? What could we have done differently to prevent it? What processes could we put in place to ensure this doesn’t happen again?
Rather than pointing fingers, we took a collaborative approach to discuss the root cause of the issue and chart a course to avoid future mistakes. We ultimately recommended two actions:
- Modify Django Admin templates to use a different color scheme in production. This would provide a clear and unambiguous warning that any changes made will affect the production platform, giving developers more visibility into the impact of their actions.
Here’s the best part: after we presented the idea in Slack, one of our engineers took it upon himself to spin up a pull request. This really underscored the supportive and agile culture at Pearl. Now, when one of our employees logs into the production Admin site, they are presented with a clear and unambiguous warning:
And furthermore, each page is adorned with a matching header:
Nice job, team!
- Add monitoring and alerting to ensure that flags are present where and only where they are intended. By monitoring and setting alerts for feature flag presence using tools like Datadog and Sentry, we can be notified immediately if an expected flag is identified as missing or an unexpected flag is present. That way, we can proactively respond to and mitigate potential issues before any end users are impacted.
While the team hasn’t yet experienced the need to implement this recommendation, it will be considered for our more complicated cross-functional projects moving forward, such as the one that led to the feature flag fiasco described above.
A World-Class Crew
This incident certainly underscored that simple human errors can cause major disruptions, and even with the best planning, mistakes still occur. However, while careful and attentive work is critical, Pearl’s response also demonstrated the importance of an engineering culture which acknowledges that humans are prone to error, quickly remedies any issues, and continuously invests in measures to minimize risks. I’m glad to work within such a culture; it makes us more resilient and leads to a stronger engineering team.
I hope these suggestions will help you avoid making my same mistake and keep your own applications sailing smoothly!
Interested in learning more about Engineering at Pearl? Keep an eye out for opportunities on the Pearl Health Careers Page!
Our Technology
- Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design. For more information, see djangoproject.com
- Waffle is feature flipper for Django that can define the conditions for which a flag should be active, and use it in a number of ways. For more information, see Django-Waffle 4.0.0 Documentation
Django Easy Audit is an audit log app, which allows developers to keep track of every action taken by an application’s users. For more information, see Github — Django Easy Audit