Running a complex query using Window, When, Case and SQCount, and more
The context
In a project we're currently working on, one of the features is used to manage ads campaigns and their posts. Posts are the ads itself, that are aired on various platforms (such as Google or Facebook). These posts are grouped under campaigns, that can be paid or not.
The core models of this part of the system are the following:
class Campaign(models.Model):
...
class Post(models.Model):
...
post_template = models.ForeignKey("social.PostTemplate", on_delete=models.CASCADE)
...
class PostTemplate(models.Model):
...
campaign = models.OneToOneField(Campaign, null=False, on_delete=models.CASCADE)
...
class PaidPostTemplate(PostTemplate):
...
total_budget = models.DecimalField(max_digits=10, decimal_places=2)
...
Post
objects have an 1:N relationship (foreign key) with PostTemplate
objects, meaning that a template may have multiple posts under its umbrella, but a given post can only be related to a single template. Since PostTemplate
has an 1:1 relationship with the Campaign
model, it’s safe to say that all posts that share a same template are part of the same campaign.
data:image/s3,"s3://crabby-images/7a5ef/7a5efbbac33ff80ed6cf1d310f080231a7455980" alt=""
As said before, campaigns can be paid or not. For handling paid campaigns, we have the PaidPostTemplate
model extending the base PostTemplate
and adding a total_budget
field to represent how much is available for a given campaign.
The challenge
Our task consisted of adding a new visualization to the dashboard where users can manage campaigns and posts. This visualization consisted of adding a field representing how much of budget was available for a given set of posts or campaigns. This info should show up individually for each entity (posts/campaigns), but also on the metrics panel that would contain data from the selected set, inside a given time range.
The data for all these visualizations is calculated in a single source, which receives a list of posts and then aggregate them all to calculate the metrics for that given set.
For this post, we’ll only consider the challenge of displaying the budgets of the posts of a single selected campaign.
The wrong approach
aggregates = (
posts.aggregate(
total_budget=Sum("post_template__paidposttemplate__total_budget"),
)
)
We initially thought that the solution would be as simple as to sum the budgets of all posts from a campaign, getting then through the relationship with the post template.
However, that proven to not being the correct solution. Actually, the total_budget
field on a PostTemplate
represented the budget of the whole campaign, so we’re effectively doing some miracle of budget multiplication there. We had to divide the budget that came from a post by the number of “sibling” posts of the same campaign.
The naive (and the “don’t do it”) approach
Before pointing out the naive approach, let’s see what we were trying to do inside the aggregation to divide the budget values.
aggregates = (
posts.annotate(
same_post_template_post_count=Subquery(
Post.objects.filter(post_template_id=OuterRef("post_template_id")).count()
),
).aggregate(
total_budget=Sum(
F("post_template__paidposttemplate__total_budget") / F("same_post_template_post_count")
),
)
)
With this approach, we wanted to check, for each post, how many “siblings” it had through the annotation, and then in the aggregation we would obtain the budget per post and sum it back in the accumulated value. However, this approach was raising SQL errors, with the code complaining that we were trying to use an OuterRef
outside a Subquery
, which seemed illogical.
We then decided to try something else. Instead of calculating this information in the aggregation, we could calculate it through Python and then add it back to the aggregation dictionary. The implementation for it was actually correct (at least in the automated tests), but when testing against the local database, we stopped waiting for the results after the request was still hanging after 2 minutes.
This was a also a good example of correct solutions (in terms of output) not being acceptable solutions (in terms of non-functional requirements).
After this, we found out what was going on with the subquery error we mentioned before. Django’s queryset count
method was actually causing trouble there, as it was immediately running the query, making the code think that the OuterRef
was not inside a subquery anymore. The solution for this issue was to use a custom SQCount
operator.
class SQCount(models.Subquery):
template = "(SELECT COUNT(*) FROM (%(subquery)s) %(count_variable_name)s)"
output_field = models.PositiveIntegerField()
def __init__(self, subquery, *args, **kwargs):
count_variable_name = kwargs.pop("count_variable_name", "_count")
super().__init__(subquery.values_list("id", flat=True), *args, **kwargs)
if hasattr(self, "extra") and isinstance(self.extra, dict):
self.extra["count_variable_name"] = count_variable_name
So, instead of doing Subquery(Post.objects.filter(post_template_id=OuterRef("post_template_id")).count())
, we would instead run SQCount(Post.objects.filter(post_template_id=OuterRef("post_template_id"))
. That actually worked and the tests passed. However, it felt in the same pitfall as the “Python-only” approach, as the request was taking a long time to return something in the manual testing.
Since we’re basically running an extra query for every post in the initial set, we’re incurring into the N+1 problem.
The “it looked right, but it wasn’t” approach
We then remembered that PostgreSQL offers us something called “window function”. It’s an useful tool to count how many objects in a query result are related to others in some way (i.e.: through the same post template id), by creating partitions. This way, we could discover how many items were related without having to rely on extra queries. We had used this approach before in other project, so we knew it could work for that case and it should be resource efficient.
aggregates = (
posts.annotate(
same_post_template_post_count=Window(
Count("post_template_id"), partition_by=[F("post_template_id")]
),
).aggregate(
total_budget=Sum(
F("post_template__paidposttemplate__total_budget") / F("same_post_template_post_count")
),
)
)
This approach not only passed the tests, but also didn’t impact performance at all. The results from our manual tests also were correct. We’ve cracked the case! We’ve sent this solution to our customer and moved on to work on the next priority. However, there was a hidden issue (actually, not very hidden) in this solution…
As said before, this approach was used not only for aggregating data from a single campaign, but also from any random set of posts that could come from any campaign. This meant that, i.e.: if “Campaign A” had 3 posts, not necessarily the queryset would have all 3 posts. As the window function only inspected the queryset, we wouldn’t be getting the correct number of “sibling” posts all the time. In fact, we’d be always getting the full campaign budget everytime, even if only one of its posts were present. So we had to scrap everything and get back to the drawing board.
The approach that finally worked
After thinking a lot, a possible solution came up. What if we combined everything we tried before and check if it’d work? It actually worked, with the implementation we made being in the snippet below.
aggregates = (
posts.annotate(
visible_posts_count=Window(
Count("post_template_id"), partition_by=[F("post_template_id")]
),
)
.annotate(
visible_campaign_budget=Case(
When(
pk=Window(Max("id"), partition_by=[F("post_template_id")]),
then=(
(
F("post_template__paidposttemplate__total_budget")
* F("visible_posts_count")
)
/ SQCount(
Post.objects.filter(post_template_id=OuterRef("post_template_id"))
)
),
),
default=Value(None),
),
)
.aggregate(
total_budget=Sum("visible_campaign_budget"),
)
)
Ok, what’s going on here?
The first part of this statement uses the same window function from before, but now we correctly label it as the number of posts that are related through their campaign that are in the queryset. The second part of this query is the tricky one.
The inner part of this annotation is pretty much the same as the one we tried before (but had to skip due to it taking long to complete), with the difference here being how many times this query will be executed. Since the major blocker we’re facing was that the number of times it was executed being responsible for degrading performance, the idea here was to run this query once per campaign. In order to achieve that, we had to combine window functions, conditionals, subqueries and aggregation functions.
Since we are running this only once per campaign, we had to “simulate” that the query is running for every post in the queryset. To achieve that, we had to multiply the result of the budget / campaign_post_count
expression by the number of visible related posts in the queryset (which we obtained in the first annotation).
In order to run only once per campaign, we used PostgreSQL conditionals, in the form of When/Case
. This behaves similarly to the switch statements from languages such as C++, where we define a set of conditions that we’ll try to match against some data and then run some procedure, and also specify a default procedure in case no condition matched (in the end it’s pretty much an if-else block).
For our scenario, we would have a condition that would check if the id of a post was the maximum value inside a window that consists of the ids of posts from the same campaign. If it matched that condition, it’d then return the result of the (budget * visible_campaign_post_count) / campaign_post_count
. If it didn’t match, it would return a null value that wouldn’t impact in the final aggregate sum.