I recently came across a problem in which I needed to retrieve data from multiple joined tables. While I was using PostgreSQL, the problem applies to relational databases in general. Joining tables results in duplication of data in the resultset, and one needs to extract the data manually or use an ORM.
PostgreSQL has powerful JSON features that make it possible to get results as JSON, which can be fed to a JSON parser. JSON supports nested structures, unlike database resultsets. This means that one can write queries like the following, to get results as a JSON object:
|
|
It's also easy to get data from multiple tables in a query:
|
|
The result for this query has one column per table, and the child tables belonging to each parent row are combined into JSON arrays.
Recently I encountered a situation where one parent table was being joined to multiple child tables:
|
|
This caused duplication inside the generated JSON array, if child_1 had multiple rows for a given parent. Additionally, a WHERE clause was being used to select only a small fraction of the rows from parent, so queries that did sequential scans on the child tables were very slow.
Here is a rundown of the things I tried. To go straight to the solution, feel free to skip this section.
This applies the aggregate function before doing the JOIN, so that the JOIN is always done with a relation that has one row per parent. This simplifies the relations being joined, by doing the grouping before the JOIN. However, because the WHERE clause isn't inside the JOIN clause, the planner isn't able to only extract the relevant rows fnom the child tables.
One way around this is to repeat the WHERE clause inside the JOIN clause, but this leads to duplication within the query, and will complicate the query if the filtering is being done on the contents of the child tables.
This is what the query looks like:
|
|
This option is similar to the previous one, in that it also does the grouping before the JOIN, but is slightly easier to read because the relations are defined at the top level. However, it is slower (on versions below 12) because the CTEs are materialized before they are used.
This is what the query looks like:
|
|
This query is the most similar to the original query (which was returning duplicate entries), but is the slowest. This is because O(n^k) rows must be generated (where n is the number of rows, and k is the number of tables being joined) and then filtered for uniqueness. It may also require changing the schema, because DISTINCT requires all the columns to be comparable for equality. This means that, e.g. JSON columns won't work, and must be converted to JSONB.
This is what the query looks like:
|
|
This query is very similar to Option 1, but uses lateral subqueries so that the parent.id column can be referenced inside the JOIN clause.
This provides good performance for queries, regardless of what fraction of the table is returned.
The query:
|
|
While all the queries above return the same results, they have very different performance characteristics, particularly when the same query may be used with different WHERE clauses to return a small or large number of results. (In my case, it could require retrieving only 1 parent based on the id, or it could require getting all the data.) I have found this query to be the most performant in both cases.
I hope this is helpful; be sure to validate my advice for your use-case by using EXPLAIN (docs<https://www.postgresql.org/docs/current/using-explain.html>). There is also a great series of blog posts<https://www.depesz.com/2013/04/16/explaining-the-unexplainable/> that shows how to underderstand the output from EXPLAIN.