We decided to use AWS RDS for our Postgres database, and wanted to use the same instance for loading Musicbrainz data instead of having to manage a separate Postgres instance. We also wanted to speed up the import by eliminating unnecessary steps, since we are only using the database to run a single SQL query.
I tried passing the RDS instance information to InitDb.pl
(which we were previously using for loading the data), but
found that it was failing because custom extensions are not supported by RDS.
I looked at the extensions that were used,
to see if they were really required for our use case.
The custom extensions in Musicbrainz are musicbrainz_unaccent
, and musicbrainz_collate
.
I realized that I would need to modify the script to avoid creating the extensions. InitDb.pl proved difficult to modify, due to not knowing perl, and it doing other things as well.
I looked for other scripts in the repository that could be used for loading data. I came across
script/create_test_db.sh
that was also doing the import.
I made changes & tried an import, investigated the errors
that occurred, updated the script to avoid them, and tried again.
We decided to drop the tables instead of the database itself, so on every import we needed to drop schemas instead of the DB.
Not creating the extensions was an easy fix - just comment out the line calling the extensions SQL file.
Because we were extracting & importing all the .tar.bz2 files together, all the tables needed to be created together.
for i in "admin/sql/CreateTables.sql admin/sql/caa/CreateTables.sql admin/sql/documentation/CreateTables.sql admin/sql/json_dump/CreateTables.sql admin/sql/sitemaps/CreateTables.sql admin/sql/statistics/CreateTables.sql admin/sql/wikidocs/CreateTables.sql"; do echo $i ; pv $i | sed -e 's/CREATE TABLE/CREATE UNLOGGED TABLE/' | $PSQL $DATABASE ; done
To extract the files in parallel (to make maximum use of the CPU cores), we used the followwing code:
for j in /media/musicbrainz/data/mbdump*.tar.bz2; do echo "Extracting $j"; bash -c "cd /media/musicbrainz/data; tar xjf $j mbdump" & done
Since Postgres only uses 1 core per query, loading multiple tables at the same times makes better use of CPU cores. GNU Parallel made it easy to use all the CPU cores:
parallel "echo \"Copying table {}\" ; cat {} | $PSQL -c \"COPY $(basename {}) FROM STDIN DELIMITER E'\t';\" $DATABASE || echo \"Error inserting into {}\"" ::: *
The main difficulty encountered was because of the schema search path not being set.
Musicbrainz creates a musicbrainz
schema,
and the search path needs to be set to musicbrainz,public
.
In the data dump, there is a file called editor_sanitized
, for which a corresponding table doesn’t exist in the
schema.
Thanks to OneMusicAPI for providing the AWS resources for this experiment.