Django zero downtime deployment: best practices and implementation guide

Ellias Matheus

December 18, 2024

Imagine you're browsing your favorite online store, ready to make a purchase, when suddenly, the website crashes. Frustrating, right? In today's fast-paced digital world, users expect websites to be available 24/7 and to load in no more than two seconds. Forty percent of website users will leave the website if it takes more than three seconds to load. Any downtime, even for a few seconds, can lead to lost sales, damaged reputation, and frustrated customers.

That's where zero downtime deployments come in. This strategy ensures your product stays online and accessible while you're making updates, adding new features, or migrating to a new system. In this blog post, we will discuss how to achieve it in the context of Django applications.

Table of Contents

Understanding the Different Types of Downtime

Downtime isn't always a complete blackout. There are two main types:

Hard Downtime: This is the worst-case scenario. Your product becomes completely inaccessible, displaying error messages or redirecting users to malicious sites;
Soft Downtime: This is more subtle. Your product might be up, but it's slow, unresponsive, or some features are unavailable. This can still frustrate users and drive them away.

Why is Downtime Such a Big Deal?

Think of downtime as a silent killer for your online business. It can have a devastating impact on:

Your Bottom Line: You lose potential revenue every minute your product is down. Studies indicate that the average cost of downtime is a staggering $9,000 per minute;
Customer Trust: Frequent outages erode customer confidence. People rely on your product to be there when they need it. If it's constantly unavailable, they'll start looking elsewhere;
Your Reputation: Downtime can tarnish your brand image. It portrays unreliability and incompetence, making it harder to attract new customers and retain existing ones.

When it is being done correctly, it can also positively affect:

Development Velocity: Deployments become less stressful and more frequent, boosting team agility.
DORA Metrics: Zero downtime directly and positively impacts key DevOps metrics like deployment frequency and change lead time.

Building a Zero Downtime Architecture: The Key Architectural Components

Achieving zero downtime requires a holistic approach, encompassing various layers of your application:

Load Balancing: Distribute traffic across multiple servers, ensuring that others can handle the load if one server goes down;
Redundancy: Have backup servers and systems to take over if primary systems fail;
Feature Flags: Use feature flags to enable or disable functionalities remotely, allowing for gradual rollouts and minimizing risks;
Health Checks and Monitoring: Monitor your application and infrastructure for any issues.
Immutable Infrastructure: Treat infrastructure components as immutable, deploying new instances with updates instead of modifying existing ones;
Circuit Breakers, Caches, Read Replicas: These patterns enhance resilience and performance, helping your application withstand unexpected events;
Continuous Deployment/Delivery: Automate your deployment processes for faster and more frequent updates, reducing downtime and improving agility. Several strategies exist for deploying updates without interrupting service.

All these architectural layers are very important to make your deployment flow safer, but the most common cause of downtime is database migrations. We'll focus specifically on that in the following topics.

Why are database migrations without downtime challenging?

There are a few issues associated with database migrations using any migration system. Some of these are:

Unexpected table locks

In most databases, doing some operations requires an exclusive lock on the table. An exclusive lock prevents data modification (DML) operations, such as UPDATE, INSERT, and DELETE, while the operation is running. It is important to build the migrations in such a way as to avoid these locks.

Synchrony with the Application version

Another common issue is database synchrony with the application version. Usually, we run migrations before deploying, which means the tables will be updated before the new code is in place. That may cause a temporary problem where the old version references database resources that were renamed or that no longer exist.

Long migrations that block the release

When you're doing data migrations or schema migrations with default values in giant tables, your migration may take a long time. This might not only temporarily overload your database but also block the release. While the migration is running, you won't be able to release hotfixes, for instance.

Reverse migration

Reverse migrations can be particularly challenging. When a migration is applied, you need a reliable way to roll back changes if something goes wrong. This is complicated when:

Data dependencies exist between old and new schemas, making it difficult to revert without losing data or breaking application logic;
Complex operations (like renaming columns or changing types) do not have straightforward rollback paths. Implementing comprehensive migration and reverse migration logic is essential to avoid data loss and ensure application integrity.

Migration conflict due to parallel work

Another significant challenge arises from parallel work on migrations. When multiple developers or teams work on migrations simultaneously, conflicts can occur, leading to:

Race conditions where two migrations try to alter the same table or column, causing errors or data corruption;
Versioning issues if one migration is applied while another is still in development, potentially leading to mismatches between the codebase and the database schema.

To mitigate these issues, teams should:

Coordinate migration efforts and establish clear protocols for managing and reviewing migrations.
Use features like migration squashing to consolidate multiple migrations into one, reducing complexity and conflict likelihood.

By addressing these challenges proactively, teams can navigate the complexities of database migrations and maintain seamless operations.

Writing safe database migrations

Updating your database can take time and effort. You need to ensure data integrity and avoid any interruptions to your application.

Here are some strategies for safe database migrations:

Safe Operations: Use database tools and techniques that minimize the risk of data corruption during the migration process;
Turning Unsafe Operations into Safe Ones: Carefully plan your migration scripts and use techniques like data replication, transactional operations, and breaking the unsafe operation into multiple steps to ensure data consistency;
Coupled vs. Decoupled Migrations: Based on the complexity and risks involved, decide whether to perform database migrations alongside application deployments or separately before or after the code change.

Why is this incredibly challenging in Django?

The Django migration system resolves migrations automatically based on your models. It already has built-in ways of avoiding implementing reverse migrations and detecting conflicting migrations, together with tooling to merge and squash them.

The generated migrations are usually accurate for simple cases, but their magic is a bit limited. Occasionally, we need to tweak them to achieve the expected results. Understanding the implications of each migration operation is crucial to ensuring that migrations are safe and can be applied with zero downtime.

Here’s a short guide to handling various types of migrations safely.

Understand what you're doing at DB-level

Before running migrations, you should run django-admin sqlmigrate to ensure you understand what will happen. The SQL output might give you the insight to validate that you're not introducing unnecessary downtime. By running python manage.py sqlmigrate {app_label} {expected_migration_number} you can see the SQL that Django will run to synchronize from the current database state to the expected one.

With that in hand, it's easier to understand whether the operations will cause downtime. The following sections will give you some guidelines to infer that.

Safe Migrations

Certain operations are inherently safe, as they do not interfere with the current data or application logic during migration:

CREATE SEQUENCE: This operation is safe since business logic shouldn't rely on new sequences during migration. It should run before the code is deployed;
DROP SEQUENCE: Similar to CREATE SEQUENCE, this is also safe, as your logic shouldn't use the dropped sequence. It should run after the code is deployed;
CREATE TABLE: Safe because your application logic will only interact with this table once the migration is complete. It should run before the code is deployed;
DROP TABLE: Safe if your logic doesn't interact with the table during the migration. It should run after the code is deployed;
ALTER TABLE ADD COLUMN: Safe if the new column does not have constraints like NOT NULL or UNIQUE. It should run before the code is deployed;
ALTER TABLE DROP COLUMN: This is generally safe; however, ensure that dependencies (e.g., constraints and indexes) are handled beforehand. It should run after the code is deployed.

Risky Migrations and Alternatives

Some migrations pose risks that could lead to application errors if mishandled. Here’s how to mitigate those risks:

ALTER TABLE RENAME TO: Renaming tables can disrupt business logic, as an active version of the application might have references to the old name while the database has already been updated. To safely rename:
- Use SeparateDatabaseAndState to rename the table and create an updatable view. This allows old and new code to function simultaneously until the old code is phased out.
ALTER TABLE SET TABLESPACE: This operation can be unsafe. Instead of directly changing the tablespace, consider creating a new table in the desired tablespace and copying the data.
ALTER TABLE ADD COLUMN SET NOT NULL: Postgres checks that all existing records comply with the NOT NULL constraint, which can be time-consuming. Instead:
- First, add the column with a DEFAULT value.
- Populate the column with the correct data.
- Then, set the column to NOT NULL.
ALTER TABLE ADD CONSTRAINT UNIQUE / PRIMARY KEY: These operations are unsafe due to the time required to create indexes. A safer approach:
- Add the column without constraints.
- Create the index concurrently.
- Finally, add the constraint using the newly created index.
ALTER TABLE ALTER COLUMN TYPE: Changing a column type can be risky. For significant changes, I prefer creating a new column and copying data into it. Some type changes, like widening varchar, are safe.

Handling NOT NULL Constraints

To make a column NOT NULL safely:

Add a CHECK constraint to ensure values are NOT NULL but keep it as NOT VALID.
Validate the constraint, ensuring all records are compliant.
Alter the column to set it as NOT NULL, utilizing the existing CHECK constraint to bypass full-table validation.
Finally, drop the CHECK constraint if desired.

Managing Unique Constraints

When enforcing uniqueness:

Use CREATE UNIQUE INDEX instead of adding a unique constraint allowing concurrent operations.
If you must drop the constraint, consider dropping the index concurrently to avoid locks.

Final Considerations

Always test and time the migrations in a staging environment that mirrors production.
Use Django’s built-in tools to check migrations for potential issues (python manage.py makemigrations --check).
Consider implementing a phased approach for more complex migrations: apply code changes first, followed by the corresponding database migrations.
Add columns as nullable for more oversized tables.
When adding NOT NULL columns, add a default to the database manually.
Keep migrations small.
Use django-admin sqlmigrate to understand what happens at the DB level.
Temporarily shut down batch processing jobs that operate on tables to be changed.
Update rows of large tables in smaller transactional batches.

By understanding these principles and employing the right strategies, you can ensure that your Django migrations are executed safely, minimizing downtime and preventing disruption to your application.

Tools for writing safe migrations in Django

Now that you understand the risks and the possible solutions for writing safe migrations in Django, here is a short guide on tools that help mitigate these issues and how to integrate them into your Django application.

Zero Downtime lib

Django-pg-zero-downtime-migrations is a powerful tool that modifies the Django PostgreSQL backend to apply migrations with a focus on minimizing locks, enabling zero downtime during schema changes.

Benefits

It avoids table locks, enabling concurrent operations during migrations.

Installation

To install the library, use pip:

pip install django-pg-zero-downtime-migrations

Usage

To enable zero downtime migrations for PostgreSQL, set up the Django backend provided by this package and configure the following settings in your settings.py:

DATABASES = {
	'default': {
    	'ENGINE': 'django_zero_downtime_migrations.backends.postgres',
	}
}

ZERO_DOWNTIME_MIGRATIONS_LOCK_TIMEOUT = '2s'
ZERO_DOWNTIME_MIGRATIONS_STATEMENT_TIMEOUT = '2s'
ZERO_DOWNTIME_MIGRATIONS_FLEXIBLE_STATEMENT_TIMEOUT = True
ZERO_DOWNTIME_MIGRATIONS_RAISE_FOR_UNSAFE = True

NOTE: This backend brings zero downtime improvements for migrations (schema and RunSQL operations, but not for RunPython operations). For other purposes, it functions like the standard Django backend.

Differences with Standard Django Backend

This backend provides the same final state as the standard backend, but it uses different mechanisms to avoid table locks. Importantly, it does not use transactions for migrations (except for RunPython operations). This design choice helps prevent deadlocks during complex migrations. If a migration fails, you'll need to address the database state manually. It’s advisable to keep migration modules as small as possible to facilitate this.

The setting ZERO_DOWNTIME_MIGRATIONS_IDEMPOTENT_SQL=True can help automate manual database state fixing by allowing failed migrations to be rerun after issues are addressed.

Settings Overview

ZERO_DOWNTIME_MIGRATIONS_LOCK_TIMEOUT: Sets a timeout for SQL statements that require ACCESS EXCLUSIVE locks. The default is None.
ZERO_DOWNTIME_MIGRATIONS_STATEMENT_TIMEOUT: Sets a timeout for SQL statements requiring ACCESS EXCLUSIVE locks. The default is None.
ZERO_DOWNTIME_MIGRATIONS_FLEXIBLE_STATEMENT_TIMEOUT: Allows statement_timeout to be set to 0ms for long-running operations like index creation. The default is False.
ZERO_DOWNTIME_MIGRATIONS_RAISE_FOR_UNSAFE: If enabled, it prevents the execution of potentially unsafe migrations. The default is False.
ZERO_DOWNTIME_DEFERRED_SQL: Defines how to apply deferred SQL. The default is True.
ZERO_DOWNTIME_MIGRATIONS_IDEMPOTENT_SQL: Enables idempotent mode to skip already applied SQL migrations. The default is False.
ZERO_DOWNTIME_MIGRATIONS_EXPLICIT_CONSTRAINTS_DROP: Determines whether to explicitly drop foreign key and unique constraints before dropping tables or columns. The default is True.
ZERO_DOWNTIME_MIGRATIONS_KEEP_DEFAULT: Controls whether to keep or drop code defaults at the database level when adding a new column. The default is False (only applies to Django < 5.0).

Django Safemigrate

django-safemigrate enhances Django by adding a safemigrate command, which provides finer control over migration execution, allowing you to mark migrations as safe to run before or after code deployment.

Benefits

Prevents accidental execution of unsafe migrations during deployments.

Installation

To use django-safemigrate, first install the package.

pip install django-safemigrate

And add it to your INSTALLED_APPS in settings.py:

INSTALLED_APPS = [
	# ...
	"django_safemigrate",
]

Marking Migrations

You can designate migrations as safe to run during specific stages of deployment. For example, to mark a migration that adds a column as safe to run before deploying code, you can do the following:

from django_safemigrate import Safe

class Migration(migrations.Migration):
	safe = Safe.before_deploy()

Once marked, you can run the safemigrate command, executing only those designated migrations. If there are dependencies on unsafe migrations, the command will fail, preventing potentially dangerous changes.

Running Migrations

After deploying your code, you can run the regular migrate command. This setup is ideal for deployment processes like those on Heroku, where safe migrations can be executed automatically upon release promotion.

Safety Options

There are three options for the safe property:

Safe.before_deploy(): Safe to run before the code is deployed (e.g., adding a new field).
Safe.after_deploy(delay=None): Safe to run after the code is deployed. You can specify a delay, such as timedelta(days=7), to control when the migration can be executed.
Safe.always(): Safe to run both before and after deployment. This is the default option.

Nonstrict Mode

In development, you may encounter a buildup of migrations between team members. To allow the safemigrate command to run without raising errors due to dependencies, enable nonstrict mode by adding the following setting:

SAFEMIGRATE = "nonstrict”

In this mode, safemigrate will execute all non-blocked migrations, allowing for flexibility during development.

Disabled Mode

To completely disable the protections of safemigrate, use:

SAFEMIGRATE = "disabled"

In this mode, migrations will run as if using the normal migrate command, bypassing the safety checks entirely.

Django Deprecate fields

Django-deprecate-fields is a useful package for safely managing the removal of fields in Django models. It marks them as deprecated before full removal, allowing for gradual codebase updates.

Benefits

Ensures smoother transitions when removing model fields without breaking existing functionality.

Installation

To install the package, run:

pip install django-deprecate-fields

Usage

Consider the following simple model:

from django.db import models

class MyModel(models.Model):
	field1 = models.CharField()
	field2 = models.CharField()

To safely remove field1, first mark it as deprecated:

from django.db import models
from django_deprecate_fields import deprecate_field

class MyModel(models.Model):
	field1 = deprecate_field(models.CharField())
	field2 = models.CharField()

After marking it, run makemigrations. This will change field1 to be nullable, and any references to it in your code will return None, or you can specify a different return value using the return_instead argument.

Finally, once the changes are deployed and any lingering references have been addressed, you can safely remove field1 from the model and run makemigrations again to complete the process. If you are using the safemigrate lib, you can also execute the two migrations on the same deployment (the first step running as pre-code-deployment and the second one as the post-code-deployment)

This approach helps ensure a smooth transition without breaking existing functionality.

Suggested deployment flow

Zero downtime deployment requires several conditions:

Multiple application instances are available; the application must remain operational even when one instance is restarted;
A load balancer is placed in front of the instances;
The application should work correctly before, during, and after migrations;
The application should function correctly before, during, and after instance updates;
Communicate each step of your deployment.

Deployment Steps:

Do your code checks (lint, build, check migrations, code security checks)
Apply the pre-code-deployment migrations (safemigrate command)
1. Rollback migration if it fails and abort the deployment
Deploy the code using rolling release strategy
Apply the post-code-deployment migrations (migrate command)
1. Rollback the migration if it fails

CircleCI example:


version: 2.1

orbs:
   slack: circleci/slack@4.1.0 # Use the latest version of the Slack orb

commands: # a reusable command with parameters
  slack-notification:
    parameters:
      text:
        default: ""
        type: string
      channel:
        default: "#" # Specify your channel
        type: string
      event:
        type: enum
        enum: ["fail", "pass", "always"]
        default: always
    steps:
      - slack/notify:
            channel: "<<parameters.channel>>"
            event: <<parameters.event>>
            custom: |
                {
                    "blocks": [
                        {
                            "type": "header",
                            "text": {
                                "type": "plain_text",
                                "text": "<<parameters.text>>"
                            }
                        },
                        {
                            "type": "section",
                            "fields": [
                                {
                                    "type": "mrkdwn",
                                    "text": "*Author*: $CIRCLE_USERNAME"
                                }
                            ]
                        },
                        {
                            "type": "actions",
                            "elements": [
                                {
                                    "type": "button",
                                    "text": {
                                    "type": "plain_text",
                                    "text": "View Workflow"
                                    },
                                    "url": "${CIRCLE_BUILD_URL}"
                                }
                            ]
                        }
                    ]
                }
jobs:
  deploy:
    docker:
    - image: circleci/python:3.9
    steps:
        - checkout
        # Notify start of code checks
        - slack-notification:
            channel: "#your-channel"
            text: "Starting code checks..."
        # Code checks
        - run:
            name: Lint Code
            command: | 
                flake8 .
        - run:
            name: Run Tests
            command: | 
                pytest
        - run:
            name: Check Migrations 
            command: |
                python manage.py makemigrations --check --dry-run
        - run:
            name: Code Security Checks
            command: |
                bandit -r.
        # Notify start of pre-code-deployment migrations
        - slack-notification:
            channel: "#your-channel"
            text: "Applying pre-code-deployment migrations..."
        # Pre-code-deployment migrations
        - run:
            name: Apply Pre-Code-Deployment Migrations
            command: |
                python manage.py safemigrate
        - slack-notification:
            event: fail
            channel: "#your-channel"
            text: "Pre-code-deployment migration failed."
        - run:
            when: on_fail
            name: Rollback Pre-Code-Deployment
            command: |
                echo "Rolling back pre-code-deployment migration."
        # Notify start of code deployment
        - slack-notification:
            channel: "#your-channel"
            text: "Deploying code..."
        # Deploy the code using rolling release strategy
        - run:
            name: Deploy Code
            command: |
                heroku git:push -a your-app-name
        - slack-notification:
            event: fail
            channel: "#your-channel"
            text: "Deployment failed."
        - run:
            when: on_fail
            name: Rollback Deployment
            command: |
                echo "Rolling back deployment."
        # Notify start of post-code-deployment migrations
        - slack-notification:
            channel: "#your-channel"
            text: "Applying post-code-deployment migrations..."
        # Post-code-deployment migrations
        - run:
            name: Apply Post-Code-Deployment Migrations
            command: |
                python manage.py migrate
        - slack-notification:
            event: fail
            channel: "#your-channel"
            text: "Post-code-deployment migration failed."
        - run:
            when: on_fail
            name: Rollback Post-Code-Deployment
            command: |
                echo "Rolling back post-code-deployment migration."
        #Notify start of application restart
        - slack-notification:
            channel: "#your-channel"
            text: "Restarting application instances..."
        # Optional: Restart application instances if necessary
        - run:
            name: Restart Application Instances
            command: |
                heroku ps: restart -a your-app-name
        # Notify successful deployment 
        - slack-notification:
            channel: "#your-channel"
            text: "Deployment successful!"

workflows: 
   version: 2 
   deploy: 
     jobs:
        - deploy:
            context: your-context

If your deployment doesn't satisfy these conditions, consider breaking it into smaller, manageable deployments. You can also use other code release strategies; remember to allow migrations to be applied before and after the new code is running on the production environment.

Conclusion: Keeping Your Product Alive and Thriving

Zero downtime is not just a technical goal; it's a strategic imperative. It ensures a positive user experience, builds trust, and allows your business to thrive in today's competitive digital landscape. By embracing these strategies and best practices, you can create a robust and resilient product, deliver a seamless and uninterrupted user experience, and achieve high maturity in our development workflow.

Django zero downtime deployment: best practices and implementation guide

Understanding the Different Types of Downtime

Why is Downtime Such a Big Deal?

Building a Zero Downtime Architecture: The Key Architectural Components

Why are database migrations without downtime challenging?

Unexpected table locks

Synchrony with the Application version

Long migrations that block the release

Reverse migration

Migration conflict due to parallel work

Writing safe database migrations

Why is this incredibly challenging in Django?

Understand what you're doing at DB-level

Safe Migrations

Risky Migrations and Alternatives

Handling NOT NULL Constraints

Managing Unique Constraints

Final Considerations

Tools for writing safe migrations in Django

Zero Downtime lib

Installation

Usage

Differences with Standard Django Backend

Settings Overview

Django Safemigrate

Installation

Marking Migrations

Running Migrations

Safety Options

Nonstrict Mode

Disabled Mode

Django Deprecate fields

Installation

Usage

Suggested deployment flow

Conclusion: Keeping Your Product Alive and Thriving

Related articles

Don't forget the stamps: testing email content in Django

Why we use Django & what it's used for

Metaprogramming and Django: using decorators