Deduplicating rows in multi-table JOINs

I recently came across a problem in which I needed to retrieve data from multiple joined tables. While I was using PostgreSQL, the problem applies to relational databases in general. Joining tables results in duplication of data in the resultset, and one needs to extract the data manually or use an ORM.

Background

PostgreSQL has powerful JSON features that make it possible to get results as JSON, which can be fed to a JSON parser. JSON supports nested structures, unlike database resultsets. This means that one can write queries like the following, to get results as a JSON object:

1
SELECT ROW_TO_JSON(tbl) FROM tbl

It's also easy to get data from multiple tables in a query:

1
2
3
SELECT ROW_TO_JSON(parent), JSONB_AGG(child_tbl)
FROM parent
LEFT JOIN child_tbl ON child_tbl.parent_id = parent.id

The result for this query has one column per table, and the child tables belonging to each parent row are combined into JSON arrays.

Problem

Recently I encountered a situation where one parent table was being joined to multiple child tables:

1
2
3
4
SELECT ROW_TO_JSON(parent), JSONB_AGG(child_1), JSONB_AGG(child_2)
FROM parent
LEFT JOIN child_1 ON child_1.parent_id = parent.id
LEFT JOIN child_2 ON child_2.parent_id = parent.id

This caused duplication inside the generated JSON array, if child_1 had multiple rows for a given parent. Additionally, a WHERE clause was being used to select only a small fraction of the rows from parent, so queries that did sequential scans on the child tables were very slow.

Options

Here is a rundown of the things I tried. To go straight to the solution, feel free to skip this section.

Option 1 (aggregate inside JOIN)

This applies the aggregate function before doing the JOIN, so that the JOIN is always done with a relation that has one row per parent. This simplifies the relations being joined, by doing the grouping before the JOIN. However, because the WHERE clause isn't inside the JOIN clause, the planner isn't able to only extract the relevant rows fnom the child tables.

One way around this is to repeat the WHERE clause inside the JOIN clause, but this leads to duplication within the query, and will complicate the query if the filtering is being done on the contents of the child tables.

This is what the query looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
SELECT
  ROW_TO_JSON(parent) AS parent,
  COALESCE(child1, '[]') AS child1,
  COALESCE(child2, '[]') AS child2
FROM parent
LEFT JOIN (
  SELECT
    parent_id,
    JSONB_AGG(child1) FILTER (WHERE child1.id IS NOT NULL) AS child1
  FROM child1 GROUP BY parent_id
) child1 ON child1.parent_id=parent.id
LEFT JOIN (
  SELECT
    parent_id,
    JSONB_AGG(child2) FILTER (WHERE child2.id IS NOT NULL) AS child2
  FROM child2 GROUP BY parent_id
) child2 ON child2.parent_id=parent.id
WHERE ...

Option 2 (CTEs)

This option is similar to the previous one, in that it also does the grouping before the JOIN, but is slightly easier to read because the relations are defined at the top level. However, it is slower (on versions below 12) because the CTEs are materialized before they are used.

This is what the query looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
WITH child1_cte AS (
  SELECT
    parent_id,
    JSONB_AGG(child1) FILTER (WHERE child1.id IS NOT NULL) AS child1
  FROM child1 GROUP BY parent_id
), child2_cte AS (
  SELECT
    parent_id,
    JSONB_AGG(child2) FILTER (WHERE child2.id IS NOT NULL) AS child2
  FROM child2 GROUP BY parent_id
)
SELECT
  ROW_TO_JSON(parent) AS parent,
  COALESCE(child1_cte, '[]') AS child1,
  COALESCE(child2_cte, '[]') AS child2
FROM parent
LEFT JOIN child1_cte ON child1_cte.parent_id=parent.id
LEFT JOIN child2_cte ON child2_cte.parent_id=parent.id
WHERE ...

Option 3 (DISTINCT inside JSONB_AGG)

This query is the most similar to the original query (which was returning duplicate entries), but is the slowest. This is because O(n^k) rows must be generated (where n is the number of rows, and k is the number of tables being joined) and then filtered for uniqueness. It may also require changing the schema, because DISTINCT requires all the columns to be comparable for equality. This means that, e.g. JSON columns won't work, and must be converted to JSONB.

This is what the query looks like:

1
2
3
4
5
6
7
8
9
SELECT
  ROW_TO_JSON(parent) AS parent,
  COALESCE(JSONB_AGG(DISTINCT child1) FILTER (WHERE child1.id IS NOT NULL), '[]') AS child1,
  COALESCE(JSONB_AGG(DISTINCT child2) FILTER (WHERE child2.id IS NOT NULL), '[]') AS child2
FROM parent
LEFT JOIN child1 ON child1.parent_id=parent.id
LEFT JOIN child2 ON child2.parent_id=parent.id
WHERE ...
GROUP BY parent.id

Solution (aggregate inside JOIN, with LATERAL)

This query is very similar to Option 1, but uses lateral subqueries so that the parent.id column can be referenced inside the JOIN clause.

This provides good performance for queries, regardless of what fraction of the table is returned.

The query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
SELECT
  ROW_TO_JSON(parent) AS parent,
  COALESCE(child1, '[]') AS child1,
  COALESCE(child2, '[]') AS child2
FROM parent
LEFT JOIN LATERAL (
  SELECT
    parent_id,
    JSONB_AGG(child1) FILTER (WHERE child1.id IS NOT NULL) AS child1
  FROM child1
  WHERE child1.parent_id=parent.id
  GROUP BY parent_id
) child1 ON child1.parent_id=parent.id
LEFT JOIN LATERAL (
  SELECT
    parent_id,
    JSONB_AGG(child2) FILTER (WHERE child2.id IS NOT NULL) AS child2
  FROM child2
  WHERE child2.parent_id=parent.id
  GROUP BY parent_id
) child2 ON child2.parent_id=parent.id
WHERE ...

While all the queries above return the same results, they have very different performance characteristics, particularly when the same query may be used with different WHERE clauses to return a small or large number of results. (In my case, it could require retrieving only 1 parent based on the id, or it could require getting all the data.) I have found this query to be the most performant in both cases.

I hope this is helpful; be sure to validate my advice for your use-case by using EXPLAIN (docs<https://www.postgresql.org/docs/current/using-explain.html>). There is also a great series of blog posts<https://www.depesz.com/2013/04/16/explaining-the-unexplainable/> that shows how to underderstand the output from EXPLAIN.