Django Zero Downtime Deployment: Best Practices and Implementation Guide
Imagine you're browsing your favorite online store, ready to make a purchase, when suddenly, the website crashes. Frustrating, right? In today's fast-paced digital world, users expect websites to be available 24/7 and to load in no more than two seconds. Forty percent of website users will leave the website if it takes more than three seconds to load. Any downtime, even for a few seconds, can lead to lost sales, damaged reputation, and frustrated customers.
That's where zero downtime deployments come in. This strategy ensures your product stays online and accessible while you're making updates, adding new features, or migrating to a new system. In this blog post, we will discuss how to achieve it in the context of Django applications.
Understanding the Different Types of Downtime
Downtime isn't always a complete blackout. There are two main types:
- Hard Downtime: This is the worst-case scenario. Your product becomes completely inaccessible, displaying error messages or redirecting users to malicious sites;
- Soft Downtime: This is more subtle. Your product might be up, but it's slow, unresponsive, or some features are unavailable. This can still frustrate users and drive them away.
Why is Downtime Such a Big Deal?
Think of downtime as a silent killer for your online business. It can have a devastating impact on:
- Your Bottom Line: You lose potential revenue every minute your product is down. Studies indicate that the average cost of downtime is a staggering $9,000 per minute;
- Customer Trust: Frequent outages erode customer confidence. People rely on your product to be there when they need it. If it's constantly unavailable, they'll start looking elsewhere;
- Your Reputation: Downtime can tarnish your brand image. It portrays unreliability and incompetence, making it harder to attract new customers and retain existing ones.
When it is being done correctly, it can also positively affect:
- Development Velocity: Deployments become less stressful and more frequent, boosting team agility.
- DORA Metrics: Zero downtime directly and positively impacts key DevOps metrics like deployment frequency and change lead time.
Building a Zero Downtime Architecture: The Key Architectural Components
Achieving zero downtime requires a holistic approach, encompassing various layers of your application:
- Load Balancing: Distribute traffic across multiple servers, ensuring that others can handle the load if one server goes down;
- Redundancy: Have backup servers and systems to take over if primary systems fail;
- Feature Flags: Use feature flags to enable or disable functionalities remotely, allowing for gradual rollouts and minimizing risks;
- Health Checks and Monitoring: Monitor your application and infrastructure for any issues.
- Immutable Infrastructure: Treat infrastructure components as immutable, deploying new instances with updates instead of modifying existing ones;
- Circuit Breakers, Caches, Read Replicas: These patterns enhance resilience and performance, helping your application withstand unexpected events;
- Continuous Deployment/Delivery: Automate your deployment processes for faster and more frequent updates, reducing downtime and improving agility. Several strategies exist for deploying updates without interrupting service.
All these architectural layers are very important to make your deployment flow safer, but the most common cause of downtime is database migrations. We'll focus specifically on that in the following topics.
Why are database migrations without downtime challenging?
There are a few issues associated with database migrations using any migration system. Some of these are:
Unexpected table locks
In most databases, doing some operations requires an exclusive lock on the table. An exclusive lock prevents data modification (DML) operations, such as UPDATE
, INSERT
, and DELETE
, while the operation is running. It is important to build the migrations in such a way as to avoid these locks.
Synchrony with the Application version
Another common issue is database synchrony with the application version. Usually, we run migrations before deploying, which means the tables will be updated before the new code is in place. That may cause a temporary problem where the old version references database resources that were renamed or that no longer exist.
Long migrations that block the release
When you're doing data migrations or schema migrations with default values in giant tables, your migration may take a long time. This might not only temporarily overload your database but also block the release. While the migration is running, you won't be able to release hotfixes, for instance.
Reverse migration
Reverse migrations can be particularly challenging. When a migration is applied, you need a reliable way to roll back changes if something goes wrong. This is complicated when:
- Data dependencies exist between old and new schemas, making it difficult to revert without losing data or breaking application logic;
- Complex operations (like renaming columns or changing types) do not have straightforward rollback paths. Implementing comprehensive migration and reverse migration logic is essential to avoid data loss and ensure application integrity.
Migration conflict due to parallel work
Another significant challenge arises from parallel work on migrations. When multiple developers or teams work on migrations simultaneously, conflicts can occur, leading to:
- Race conditions where two migrations try to alter the same table or column, causing errors or data corruption;
- Versioning issues if one migration is applied while another is still in development, potentially leading to mismatches between the codebase and the database schema.
To mitigate these issues, teams should:
- Coordinate migration efforts and establish clear protocols for managing and reviewing migrations.
- Use features like migration squashing to consolidate multiple migrations into one, reducing complexity and conflict likelihood.
By addressing these challenges proactively, teams can navigate the complexities of database migrations and maintain seamless operations.
Writing safe database migrations
Updating your database can take time and effort. You need to ensure data integrity and avoid any interruptions to your application.
Here are some strategies for safe database migrations:
- Safe Operations: Use database tools and techniques that minimize the risk of data corruption during the migration process;
- Turning Unsafe Operations into Safe Ones: Carefully plan your migration scripts and use techniques like data replication, transactional operations, and breaking the unsafe operation into multiple steps to ensure data consistency;
- Coupled vs. Decoupled Migrations: Based on the complexity and risks involved, decide whether to perform database migrations alongside application deployments or separately before or after the code change.
Why is this incredibly challenging in Django?
The Django migration system resolves migrations automatically based on your models. It already has built-in ways of avoiding implementing reverse migrations and detecting conflicting migrations, together with tooling to merge and squash them.
The generated migrations are usually accurate for simple cases, but their magic is a bit limited. Occasionally, we need to tweak them to achieve the expected results. Understanding the implications of each migration operation is crucial to ensuring that migrations are safe and can be applied with zero downtime.
Here’s a short guide to handling various types of migrations safely.
Understand what you're doing at DB-level
Before running migrations, you should run django-admin sqlmigrate
to ensure you understand what will happen. The SQL output might give you the insight to validate that you're not introducing unnecessary downtime. By running python manage.py sqlmigrate {app_label} {expected_migration_number}
you can see the SQL that Django will run to synchronize from the current database state to the expected one.
With that in hand, it's easier to understand whether the operations will cause downtime. The following sections will give you some guidelines to infer that.
Safe Migrations
Certain operations are inherently safe, as they do not interfere with the current data or application logic during migration:
- CREATE SEQUENCE: This operation is safe since business logic shouldn't rely on new sequences during migration. It should run before the code is deployed;
- DROP SEQUENCE: Similar to
CREATE SEQUENCE
, this is also safe, as your logic shouldn't use the dropped sequence. It should run after the code is deployed; - CREATE TABLE: Safe because your application logic will only interact with this table once the migration is complete. It should run before the code is deployed;
- DROP TABLE: Safe if your logic doesn't interact with the table during the migration. It should run after the code is deployed;
- ALTER TABLE ADD COLUMN: Safe if the new column does not have constraints like
NOT NULL
orUNIQUE
. It should run before the code is deployed; - ALTER TABLE DROP COLUMN: This is generally safe; however, ensure that dependencies (e.g., constraints and indexes) are handled beforehand. It should run after the code is deployed.
Risky Migrations and Alternatives
Some migrations pose risks that could lead to application errors if mishandled. Here’s how to mitigate those risks:
- ALTER TABLE RENAME TO: Renaming tables can disrupt business logic, as an active version of the application might have references to the old name while the database has already been updated. To safely rename:
- Use
SeparateDatabaseAndState
to rename the table and create an updatable view. This allows old and new code to function simultaneously until the old code is phased out.
- Use
- ALTER TABLE SET TABLESPACE: This operation can be unsafe. Instead of directly changing the tablespace, consider creating a new table in the desired tablespace and copying the data.
- ALTER TABLE ADD COLUMN SET NOT NULL: Postgres checks that all existing records comply with the NOT NULL constraint, which can be time-consuming. Instead:
- First, add the column with a
DEFAULT
value. - Populate the column with the correct data.
- Then, set the column to
NOT NULL
.
- First, add the column with a
- ALTER TABLE ADD CONSTRAINT UNIQUE / PRIMARY KEY: These operations are unsafe due to the time required to create indexes. A safer approach:
- Add the column without constraints.
- Create the index concurrently.
- Finally, add the constraint using the newly created index.
- ALTER TABLE ALTER COLUMN TYPE: Changing a column type can be risky. For significant changes, I prefer creating a new column and copying data into it. Some type changes, like widening varchar, are safe.
Handling NOT NULL Constraints
To make a column NOT NULL
safely:
- Add a
CHECK
constraint to ensure values areNOT NULL
but keep it asNOT VALID
. - Validate the constraint, ensuring all records are compliant.
- Alter the column to set it as
NOT NULL
, utilizing the existingCHECK
constraint to bypass full-table validation. - Finally, drop the
CHECK
constraint if desired.
Managing Unique Constraints
When enforcing uniqueness:
- Use
CREATE UNIQUE INDEX
instead of adding a unique constraint allowing concurrent operations. - If you must drop the constraint, consider dropping the index concurrently to avoid locks.
Final Considerations
- Always test and time the migrations in a staging environment that mirrors production.
- Use Django’s built-in tools to check migrations for potential issues (
python manage.py makemigrations --check
). - Consider implementing a phased approach for more complex migrations: apply code changes first, followed by the corresponding database migrations.
- Add columns as nullable for more oversized tables.
- When adding
NOT NULL
columns, add a default to the database manually. - Keep migrations small.
- Use
django-admin sqlmigrate
to understand what happens at the DB level. - Temporarily shut down batch processing jobs that operate on tables to be changed.
- Update rows of large tables in smaller transactional batches.
By understanding these principles and employing the right strategies, you can ensure that your Django migrations are executed safely, minimizing downtime and preventing disruption to your application.
Tools for writing safe migrations in Django
Now that you understand the risks and the possible solutions for writing safe migrations in Django, here is a short guide on tools that help mitigate these issues and how to integrate them into your Django application.
Zero Downtime lib
Django-pg-zero-downtime-migrations
is a powerful tool that modifies the Django PostgreSQL backend to apply migrations with a focus on minimizing locks, enabling zero downtime during schema changes.
Benefits
It avoids table locks, enabling concurrent operations during migrations.
Installation
To install the library, use pip:
pip install django-pg-zero-downtime-migrations
Usage
To enable zero downtime migrations for PostgreSQL, set up the Django backend provided by this package and configure the following settings in your settings.py
:
DATABASES = {
'default': {
'ENGINE': 'django_zero_downtime_migrations.backends.postgres',
}
}
ZERO_DOWNTIME_MIGRATIONS_LOCK_TIMEOUT = '2s'
ZERO_DOWNTIME_MIGRATIONS_STATEMENT_TIMEOUT = '2s'
ZERO_DOWNTIME_MIGRATIONS_FLEXIBLE_STATEMENT_TIMEOUT = True
ZERO_DOWNTIME_MIGRATIONS_RAISE_FOR_UNSAFE = True
NOTE: This backend brings zero downtime improvements for migrations (schema and RunSQL operations, but not for RunPython
operations). For other purposes, it functions like the standard Django backend.
Differences with Standard Django Backend
This backend provides the same final state as the standard backend, but it uses different mechanisms to avoid table locks. Importantly, it does not use transactions for migrations (except for RunPython
operations). This design choice helps prevent deadlocks during complex migrations. If a migration fails, you'll need to address the database state manually. It’s advisable to keep migration modules as small as possible to facilitate this.
The setting ZERO_DOWNTIME_MIGRATIONS_IDEMPOTENT_SQL=True
can help automate manual database state fixing by allowing failed migrations to be rerun after issues are addressed.
Settings Overview
- ZERO_DOWNTIME_MIGRATIONS_LOCK_TIMEOUT: Sets a timeout for SQL statements that require
ACCESS EXCLUSIVE
locks. The default isNone
. - ZERO_DOWNTIME_MIGRATIONS_STATEMENT_TIMEOUT: Sets a timeout for SQL statements requiring
ACCESS EXCLUSIVE
locks. The default isNone
. - ZERO_DOWNTIME_MIGRATIONS_FLEXIBLE_STATEMENT_TIMEOUT: Allows
statement_timeout
to be set to0ms
for long-running operations like index creation. The default isFalse
. - ZERO_DOWNTIME_MIGRATIONS_RAISE_FOR_UNSAFE: If enabled, it prevents the execution of potentially unsafe migrations. The default is
False
. - ZERO_DOWNTIME_DEFERRED_SQL: Defines how to apply deferred SQL. The default is
True
. - ZERO_DOWNTIME_MIGRATIONS_IDEMPOTENT_SQL: Enables idempotent mode to skip already applied SQL migrations. The default is
False
. - ZERO_DOWNTIME_MIGRATIONS_EXPLICIT_CONSTRAINTS_DROP: Determines whether to explicitly drop foreign key and unique constraints before dropping tables or columns. The default is
True
. - ZERO_DOWNTIME_MIGRATIONS_KEEP_DEFAULT: Controls whether to keep or drop code defaults at the database level when adding a new column. The default is
False
(only applies to Django < 5.0).
Django Safemigrate
django-safemigrate
enhances Django by adding a safemigrate
command, which provides finer control over migration execution, allowing you to mark migrations as safe to run before or after code deployment.
Benefits
Prevents accidental execution of unsafe migrations during deployments.
Installation
To use django-safemigrate
, first install the package.
pip install django-safemigrate
And add it to your INSTALLED_APPS
in settings.py
:
INSTALLED_APPS = [
# ...
"django_safemigrate",
]
Marking Migrations
You can designate migrations as safe to run during specific stages of deployment. For example, to mark a migration that adds a column as safe to run before deploying code, you can do the following:
from django_safemigrate import Safe
class Migration(migrations.Migration):
safe = Safe.before_deploy()
Once marked, you can run the safemigrate
command, executing only those designated migrations. If there are dependencies on unsafe migrations, the command will fail, preventing potentially dangerous changes.
Running Migrations
After deploying your code, you can run the regular migrate
command. This setup is ideal for deployment processes like those on Heroku, where safe migrations can be executed automatically upon release promotion.
Safety Options
There are three options for the safe
property:
Safe.before_deploy()
: Safe to run before the code is deployed (e.g., adding a new field).Safe.after_deploy(delay=None)
: Safe to run after the code is deployed. You can specify a delay, such astimedelta(days=7)
, to control when the migration can be executed.Safe.always()
: Safe to run both before and after deployment. This is the default option.
Nonstrict Mode
In development, you may encounter a buildup of migrations between team members. To allow the safemigrate
command to run without raising errors due to dependencies, enable nonstrict mode by adding the following setting:
SAFEMIGRATE = "nonstrict”
In this mode, safemigrate
will execute all non-blocked migrations, allowing for flexibility during development.
Disabled Mode
To completely disable the protections of safemigrate
, use:
SAFEMIGRATE = "disabled"
In this mode, migrations will run as if using the normal migrate command, bypassing the safety checks entirely.
Django Deprecate fields
Django-deprecate-fields
is a useful package for safely managing the removal of fields in Django models. It marks them as deprecated before full removal, allowing for gradual codebase updates.
Benefits
Ensures smoother transitions when removing model fields without breaking existing functionality.
Installation
To install the package, run:
pip install django-deprecate-fields
Usage
Consider the following simple model:
from django.db import models
class MyModel(models.Model):
field1 = models.CharField()
field2 = models.CharField()
To safely remove field1
, first mark it as deprecated:
from django.db import models
from django_deprecate_fields import deprecate_field
class MyModel(models.Model):
field1 = deprecate_field(models.CharField())
field2 = models.CharField()
After marking it, run makemigrations
. This will change field1 to be nullable, and any references to it in your code will return None
, or you can specify a different return value using the return_instead
argument.
Finally, once the changes are deployed and any lingering references have been addressed, you can safely remove field1
from the model and run makemigrations
again to complete the process. If you are using the safemigrate
lib, you can also execute the two migrations on the same deployment (the first step running as pre-code-deployment and the second one as the post-code-deployment)
This approach helps ensure a smooth transition without breaking existing functionality.
Suggested deployment flow
Zero downtime deployment requires several conditions:
- Multiple application instances are available; the application must remain operational even when one instance is restarted;
- A load balancer is placed in front of the instances;
- The application should work correctly before, during, and after migrations;
- The application should function correctly before, during, and after instance updates;
- Communicate each step of your deployment.
Deployment Steps:
- Do your code checks (lint, build, check migrations, code security checks)
- Apply the pre-code-deployment migrations (
safemigrate
command)- Rollback migration if it fails and abort the deployment
- Deploy the code using rolling release strategy
- Apply the post-code-deployment migrations (
migrate
command)- Rollback the migration if it fails
CircleCI example:
version: 2.1
orbs:
slack: circleci/slack@4.1.0 # Use the latest version of the Slack orb
commands: # a reusable command with parameters
slack-notification:
parameters:
text:
default: ""
type: string
channel:
default: "#" # Specify your channel
type: string
event:
type: enum
enum: ["fail", "pass", "always"]
default: always
steps:
- slack/notify:
channel: "<<parameters.channel>>"
event: <<parameters.event>>
custom: |
{
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "<<parameters.text>>"
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Author*: $CIRCLE_USERNAME"
}
]
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {
"type": "plain_text",
"text": "View Workflow"
},
"url": "${CIRCLE_BUILD_URL}"
}
]
}
]
}
jobs:
deploy:
docker:
- image: circleci/python:3.9
steps:
- checkout
# Notify start of code checks
- slack-notification:
channel: "#your-channel"
text: "Starting code checks..."
# Code checks
- run:
name: Lint Code
command: |
flake8 .
- run:
name: Run Tests
command: |
pytest
- run:
name: Check Migrations
command: |
python manage.py makemigrations --check --dry-run
- run:
name: Code Security Checks
command: |
bandit -r.
# Notify start of pre-code-deployment migrations
- slack-notification:
channel: "#your-channel"
text: "Applying pre-code-deployment migrations..."
# Pre-code-deployment migrations
- run:
name: Apply Pre-Code-Deployment Migrations
command: |
python manage.py safemigrate
- slack-notification:
event: fail
channel: "#your-channel"
text: "Pre-code-deployment migration failed."
- run:
when: on_fail
name: Rollback Pre-Code-Deployment
command: |
echo "Rolling back pre-code-deployment migration."
# Notify start of code deployment
- slack-notification:
channel: "#your-channel"
text: "Deploying code..."
# Deploy the code using rolling release strategy
- run:
name: Deploy Code
command: |
heroku git:push -a your-app-name
- slack-notification:
event: fail
channel: "#your-channel"
text: "Deployment failed."
- run:
when: on_fail
name: Rollback Deployment
command: |
echo "Rolling back deployment."
# Notify start of post-code-deployment migrations
- slack-notification:
channel: "#your-channel"
text: "Applying post-code-deployment migrations..."
# Post-code-deployment migrations
- run:
name: Apply Post-Code-Deployment Migrations
command: |
python manage.py migrate
- slack-notification:
event: fail
channel: "#your-channel"
text: "Post-code-deployment migration failed."
- run:
when: on_fail
name: Rollback Post-Code-Deployment
command: |
echo "Rolling back post-code-deployment migration."
#Notify start of application restart
- slack-notification:
channel: "#your-channel"
text: "Restarting application instances..."
# Optional: Restart application instances if necessary
- run:
name: Restart Application Instances
command: |
heroku ps: restart -a your-app-name
# Notify successful deployment
- slack-notification:
channel: "#your-channel"
text: "Deployment successful!"
workflows:
version: 2
deploy:
jobs:
- deploy:
context: your-context
If your deployment doesn't satisfy these conditions, consider breaking it into smaller, manageable deployments. You can also use other code release strategies; remember to allow migrations to be applied before and after the new code is running on the production environment.
Conclusion: Keeping Your Product Alive and Thriving
Zero downtime is not just a technical goal; it's a strategic imperative. It ensures a positive user experience, builds trust, and allows your business to thrive in today's competitive digital landscape. By embracing these strategies and best practices, you can create a robust and resilient product, deliver a seamless and uninterrupted user experience, and achieve high maturity in our development workflow.