DBIx::MultiRow – Updating multiple database rows quickly and easily

A requirement arises in many systems to update multiple SQL database rows. For small numbers of rows requiring updates, it can be adequate to use an UPDATE statement for each row that requires an update. But if there are a large number of rows that require an update, then the overhead of issuing large numbers of UPDATE statements can result in the operation as a whole taking a long time to complete.

The traditional advice for improving performance for multiple UPDATE statements is to “prepare” the required query once, and then “execute” the prepared query once for each row requiring an update. But in many cases this only provides a modest improvement as each UPDATE operation still requires a round-trip communication with the database server. In the case where the application server and database server are on different hosts, the round-trip will involve network latency as well. The dominant factor in the time taken to complete the overall operation tends to be the “admin” work in conveying the application’s intention to the database server rather than the actual updates to the database.

A more effective solution to this problem is to attempt to reduce the number of UPDATE statements. Let us start with a simple table:

staff
namesalary
Bob1100
Jane1150
Frank1050
Susan1125
John1100

Let us say we want to apply some pay rises to all staff. We could update each row using an UPDATE statement:

UPDATE staff SET salary = 1200 WHERE name = 'Bob';
UPDATE staff SET salary = 1200 WHERE name = 'Jane';
UPDATE staff SET salary = 1200 WHERE name = 'Frank';
UPDATE staff SET salary = 1200 WHERE name = 'Susan';
UPDATE staff SET salary = 1200 WHERE name = 'John';

That would require five update statements. Can we do any better? Yes; in this example, all staff are getting the same salary, so we can make those changes in a single UPDATE statement:

UPDATE staff
    SET salary = 1200
    WHERE name IN ('Bob', 'Jane', 'Frank', 'Susan', 'John');

That reduces the number of queries from five to one. But this approach will only work when all rows are having the same update for the same columns. What if not everyone was getting the same salary?

UPDATE staff SET salary = 1200 WHERE name = 'Bob';
UPDATE staff SET salary = 1250 WHERE name = 'Jane';
UPDATE staff SET salary = 1200 WHERE name = 'Frank';
UPDATE staff SET salary = 1250 WHERE name = 'Susan';
UPDATE staff SET salary = 1200 WHERE name = 'John';

We are no longer setting all the salary fields to the same value, so we can’t collapse it into a single statement. But we can group the updates according to the value being set, and then do one UPDATE statement per distinct value. In this case there are two distinct values, so we can do it in two UPDATE statements:

UPDATE staff SET salary = 1200 WHERE name IN ('Bob', 'Frank', 'John');
UPDATE staff SET salary = 1250 WHERE name IN ('Jane', 'Susan');

So we can reduce the number of queries from five to two. Not quite as good, but still an improvement.

But the extent to which this helps us reduces as the proportion of distinct SET values goes up. Eventually every SET value is different, for example:

UPDATE staff SET salary = 1125 WHERE name = 'Bob';
UPDATE staff SET salary = 1200 WHERE name = 'Jane';
UPDATE staff SET salary = 1100 WHERE name = 'Frank';
UPDATE staff SET salary = 1175 WHERE name = 'Susan';
UPDATE staff SET salary = 1150 WHERE name = 'John';

Since every SET clause is now different, we can’t coalesce any queries using the approaches mentioned before, so we still have one UPDATE statement for every row requiring changes. Is there anything we can do?

What if we had a database table which contained a representation of the updates we wanted to make, e.g.

updates
namesalary
Bob1125
Jane1200
Frank1100
Susan1175
John1150

and we could persuade the database server to apply those updates to the target table? This is in fact entirely possible in many database systems. In the case of PostgreSQL, it can be done using a FROM clause with UPDATE, like this:

UPDATE staff
    SET salary = updates.salary
    FROM updates
    WHERE updates.name = staff.name;

This does an INNER JOIN between the tables “staff” and “updates” where the column “name” matches. The SET clause then takes the “salary” field from the “updates” table and uses it to update the “salary” field of the “staff” table.

We can easily contrive for an “updates” table to exist by creating a temporary table and populating it. It is relatively straightforward to populate a table with multiple rows with just one query (or at least, far fewer queries than the number of rows desired).

So, given a list of updates to apply we could effect them using the following steps:

  1. Use CREATE TEMPORARY TABLE to create a temporary table to hold the updates
  2. Use INSERT to populate the temporary table with the updates
  3. Use UPDATE … FROM to update the target table using updates in the temporary table
  4. Use DROP TABLE to drop the temporary table

So in the example above we can reduce five statements to four. This isn’t a significant improvement in this case. But now the number of statements is no longer directly dependent on the number of rows requiring updates.

Even if we wanted to update a thousand rows with different values, we could still do it with four statements.

OK, that’s great, we have some theoretical approaches for reducing the number of queries, now what?

The code required to implement the above logic is sufficiently fiddly that we would probably not want to have to repeat it. So we could think in terms of creating a re-usable module which would implement that logic.

This is the intention of DBIx::MultiRow.

The input to the function would be a list of updates which the caller desires to be made. Each update should contain two things:

  1. An indication of which row should be updated
  2. New values for one or more fields

Going back to our first example:

UPDATE staff SET salary = 1200 WHERE name = 'Bob';
UPDATE staff SET salary = 1200 WHERE name = 'Jane';
UPDATE staff SET salary = 1200 WHERE name = 'Frank';
UPDATE staff SET salary = 1200 WHERE name = 'Susan';
UPDATE staff SET salary = 1200 WHERE name = 'John';

To do this using DBIx::MultiRow, we would write:

use DBIx::MultiRow 'multi_update';

multi_update($dbh,
    table         => \%staff_table_data,
    key_columns   => [ 'name' ],
    value_columns => [ 'salary' ],
    updates => [
        [ 'Bob',   1200 ],
        [ 'Jane',  1200 ],
        [ 'Frank', 1200 ],
        [ 'Susan', 1200 ],
        [ 'John',  1200 ],
    ]
);

“key_columns” specifies the columns which will be used to identify rows which need to be updated (using WHERE).

“value_columns” specifies the columns which will be updated (using SET).

“updates” specifies the updates to be made. Each element is a two-element array. The first element provides the value of the column (specified by “key_columns”) to identify the row to be updated. The second element provides the value to be SET in the column specified by “value_columns”.

In this case, multi_update will spot that the target values are all 1200, and will effect the desired changes using a single UPDATE statement as described above. (It will use placeholders and parameter binding if it thinks it’s appropriate.)

If given our second example with two distinct values, multi_update will spot that there are two distinct values, 1200 and 1250, and will effect this with two UPDATE statements as described above.

Optimising the number of UPDATEs by grouping the distinct SET values can be done in a way which is compatible with most common SQL databases. But the temporary table / UPDATE … FROM approach requires knowledge of the specific SQL database being used.

So DBIx::MultiRow is structured so that approaches which are generic across different SQL databases are expressed in a base class, and approaches which only work for specific SQL databases are expressed in a subclass. An object of the relevant class is instantiated when the call is made, and control then passed to the implementation relevant to the database in use.

If there is no database-specific subclass for the database in use, then DBIx::MultiRow will just use the base class which implements approaches that should work for any SQL database.

At the time of writing, the only database-specific subclass is for PostgreSQL. So if the caller has a PostgreSQL database, and calls multi_update with data to represent our third example (where the target values are all unique), then the PostgreSQL-specific subclass will effect the updates using the table / UPDATE … FROM approach.

There are a few more details worth mentioning. Let’s expand the original table a bit:

staff
companyfirst_namelast_namesalarybonuslast_updated
WidgetCoBobJones11001002013-01-01 12:30:06.815182+01
WidgetCoJaneSmith11501052013-01-01 12:30:06.815182+01
WidgetCoJaneAustin11501102013-01-01 12:30:06.815182+01
WidgetCoFrankJohnson1050902013-01-01 12:30:06.815182+01
WidgetCoJohnGrey11001002013-01-01 12:30:06.815182+01
FrobnitzBobJones11001102013-01-01 12:30:06.815182+01
FrobnitzSusanBlack11251152013-01-01 12:30:06.815182+01

“name” has now been split into “first_name” and “last_name”. To match on names we now need to match on two columns. So whereas previously we specified the matching column using a scalar, when there is more than one column to match, we use an ArrayRef.

Similarly, let’s say we want to update two fields, salary and bonus. If we want to update more than one column, whereas previously we specified the new value using a scalar, when there is more than one column to update, we use an ArrayRef.

Also, let’s say we want to make updates only for people in the company “WidgetCo”. “Bob Jones” appears in both companies, so we need to be careful to identify the company, but we don’t want to have to specify it repeatedly for each row. The ‘where’ parameter can be used to specify a global condition, e.g. where => { company => 'WidgetCo' }.

Another thing which we may want to do is update a column with the same value for all rows being updated. So here we could update the “last_updated” column. Rather than have to specify the same value over and over again, we can use the ‘set’ parameter.

The ‘set’ parameter takes a HashRef whose keys are the columns to be updated. Values specified as scalars are applied as-is, while values specified as ScalarRef specify literal SQL. In our example, we’ll set the “last_updated” column to the result of calling the SQL function NOW().

So to effect these changes, we could call multi_update like this:

multi_update($dbh,
    \%table_info,
    key_columns   => [qw{ first_name last_name }],
    value_columns => [qw{ salary bonus }],
    updates => [
        [ [ 'Bob',   'Jones'   ], [ 1125, 110 ] ],
        [ [ 'Jane',  'Smith'   ], [ 1200, 130 ] ],
        [ [ 'Jane',  'Austin'  ], [ 1200, 130 ] ],
        [ [ 'Frank', 'Johnson' ], [ 1125, 110 ] ],
        [ [ 'John',  'Grey'    ], [ 1150, 110 ] ],
    ],
    where => { company => 'WidgetCo' },
    set   => { last_update => \'NOW()' }
);

In addition to supporting explicit global conditions (using the ‘where’ parameter), DBIx::MultiRow will analyse the updates to look for any cases where the values to be matched in a column are all the same, and it will factorise these out into global conditions in the WHERE clause. So if there happen to be any implicit global conditions, then these will be identified and simplified anyway.

Similarly, DBIx::MultiRow will analyse the updates to look for any cases where the values to be SET for a column are all the same, and it will factorise these out into simple sub-clauses in the SET clause.

One more thing which deserves an explanation is the first names parameter \%table_info. In order to be able to reliably create temporary tables, and do SQL queries where the types of bound parameters cannot be inferred, DBIx::MultiRow requires the caller to provide information about the target table. So in the case of the six-column table above, %table_info might be:

(
    quoted_table_name => 'staff',
    columns => {
        'company'      => { dbi_type => SQL_VARCHAR,  sql_type => 'TEXT' },
        'first_name'   => { dbi_type => SQL_VARCHAR,  sql_type => 'TEXT' },
        'last_name'    => { dbi_type => SQL_VARCHAR,  sql_type => 'TEXT' },
        'salary'       => { dbi_type => SQL_NUMERIC,  sql_type => 'NUMERIC(9,2)' },
        'bonus'        => { dbi_type => SQL_NUMERIC,  sql_type => 'NUMERIC(9,2)' },
        'last_updated' => {
            dbi_type => SQL_DATETIME,
            sql_type => 'TIMESTAMP WITH TIME ZONE'
        }
    }
)

One upshot of doing things this way is that DBIx::MultiRow does not guarantee the order in which rows will be updated, and so in the general case, row update lock aquisition order is not known. If deadlock avoidance is a concern for the caller, then this would have to be handled separately. For example under PostgreSQL this may mean doing a separate SELECT … ORDER BY … FOR UPDATE statement to acquire the row update locks in a defined order first.

Having implemented code to do UPDATE on multiple rows, it was then only a small step to do a similar thing for DELETE. So DBIx::MultiRow includes another function multi_delete which can be used to delete multiple rows:

multi_delete($dbh,
    \%table_info,
    key_columns => [qw{ first_name last_name }],
    deletes     => [
        [ 'Bob',   'Jones'   ],
        [ 'Jane',  'Smith'   ],
        [ 'Jane',  'Austin'  ],
        [ 'Frank', 'Johnson' ],
    ],
    where       => { company => 'WidgetCo' },
);

If you want to try DBIx::MultiRow for yourself, we have released it as part of NET-A-PORTER’s commitment to release valuable open source where we can. You should be able to find it on Github at https://github.com/hochgurgler/DBIx-MultiRow.

13 thoughts on “DBIx::MultiRow – Updating multiple database rows quickly and easily

    • The first optimisation (coalescing of updates which update the same columns to the same values) is implemented in the base class and should work for MySQL already.

      The second optimisation (putting the updates in a temporary table, then applying them to the target table using a single query) is currently only implemented in a PostgreSQL-specific sublass, so will only work for PostgreSQL. (If the current implementation is used with MySQL, the updates should still be effected correctly, but without this optimisation.)

      Looking at the MySQL documentation, MySQL appears to have an UPDATE style which would support applying updates from one table to another. So in theory it ought to be possible to implement a MySQL-specific subclass which did the same thing for MySQL.

      At the present time, I don’t have any plans to implement a MySQL-specific subclass. I am happy to consider patches … or inducements :-)

  1. Pingback: FW: William Blunn’s DBIX::MULTIROW – UPDATING MULTIPLE DATABASE ROWS QUICKLY AND EASILY | Jason L. Froebe - Tech tips and How Tos for Fellow Techies

  2. Hi Martin,

    You must have missed the “If there is no database-specific subclass for the database in use, then DBIx::MultiRow will just use the base class which implements approaches that should work for any SQL database.” bit :)

    I’m looking forward to when William uploads it to CPAN.

    Jason

  3. The temporary table is pointless:
    UPDATE staff SET salary = updates.salary FROM (
    SELECT "Bob" AS name, 1125 AS salary
    UNION ALL SELECT "Jane", 1200
    UNION ALL SELECT "Frank", 1100
    UNION ALL SELECT "Susan", 1175
    UNION ALL SELECT "John", 1150
    ) AS update WHERE staff.name = update.name

    And no new module is needed to make this easy to write because SQL::Interp already does that job perfectly – in this case, via DBIx::Simple’s integration of SQL::Interp:
    my $db = DBIx::Simple->connect( @credentials );
    $db->iquery( 'UPDATE staff SET salary = updates.salary FROM', [
    { name => 'Bob', salary => 1125 },
    { name => 'Jane', salary => 1200 },
    { name => 'Frank', salary => 1100 },
    { name => 'Susan', salary => 1175 },
    { name => 'John', salary => 1150 },
    ], 'AS update WHERE staff.name = update.name' );

    This will create the above query under the hood (modulo whitespace obviously), except it‘ll emit placeholders and use bind values, automatically.

  4. Hello Aristotle,
    It looks like you’ve made two points here.
    One point is about whether the functionality provided is already provided by an existing module. You’re absolutely right; we shouldn’t be reinventing the wheel. However one thing that DBIx::MultiRow provides over and above the approach you suggested is that it should work regardless of the database in use. The code you mentioned uses the “UPDATE … FROM …” construction which will work for PostgreSQL but not for several other popular SQL databases. However, the same thing coded to use DBIx::MultiRow should work fine for many popular SQL databases. Being able to change the underlying database is not always just a pipe dream. Some places actively use the same code to access different databases in different scenarios; peppering the code with database-specific constructs simply isn’t an option for them.
    The other point is about whether the temporary table is necessary. When implementing the PostgreSQL-specific subclass, I did experiment with using a literal in-line table similar to the approach you suggested. (I used VALUES rather than SELECT … UNION ALL, but I think the effect is equivalent.) Using EXPLAIN I found that the PostgreSQL query planner generated query plans which I felt would not scale well, for example incorporating a sequential scan of the target table.
    However when I switched to using a temporary table, the resulting query plans seemed to me to be overall better. (There’s a comment in the code which mentions this point.)
    I’m not too bothered about the extra complexity of creating a temporary table because the code complexity is factorised into a single place (DBIx::MultiRow), and I think the execution overhead will be more than offset by the improved query plan.
    But I’m always open to suggestions. If you can find an approach which is better, I’d be interested to hear about it.
    Regards,
    Bill

  5. The changes can be performed in one query instead of four. The example below is generalizable. It looks pretty gross, but since it would be programmatically generated, the “DRY” principle doesn’t much matter:

    UPDATE staff

    SET salary =
    IF (first_name=’Bob’ AND last_name=’Jones’ , 1125,
    IF (first_name=’Jane’ AND last_name=’Smith’ , 1200,
    IF (first_name=’Jane’ AND last_name=’Austin’ , 1200,
    IF (first_name=’Frank’ AND last_name=’Johnson’ , 1125,
    IF (first_name=’John’ AND last_name=’Grey’ , 1150,
    salary))))),

    bonus =
    IF (first_name=’Bob’ AND last_name=’Jones’ , 110,
    IF (first_name=’Jane’ AND last_name=’Smith’ , 130,
    IF (first_name=’Jane’ AND last_name=’Austin’ , 130,
    IF (first_name=’Frank’ AND last_name=’Johnson’ , 110,
    IF (first_name=’John’ AND last_name=’Grey’ , 110,
    bonus))))),

    last_update =
    IF (first_name=’Bob’ AND last_name=’Jones’ , NOW(),
    IF (first_name=’Jane’ AND last_name=’Smith’ , NOW(),
    IF (first_name=’Jane’ AND last_name=’Austin’ , NOW(),
    IF (first_name=’Frank’ AND last_name=’Johnson’ , NOW(),
    IF (first_name=’John’ AND last_name=’Grey’ , NOW(),
    last_update))))),

    WHERE
    (
    ( first_name=’Bob’ AND last_name=’Jones’ )
    OR ( first_name=’Jane’ AND last_name=’Smith’ )
    OR ( first_name=’Jane’ AND last_name=’Austin’ )
    OR ( first_name=’Frank’ AND last_name=’Johnson’ )
    OR ( first_name=’John’ AND last_name=’Grey’ )
    )
    AND first_name IN (‘Bob’, ‘Jane’, ‘Frank’, John’)
    AND last_name IN (‘Jones’, ‘Smith’, ‘Austin’, ‘Johnson’, ‘Grey’)
    AND company = ‘WidgetCo’

    I haven’t tested this (or even syntax-checked it, sorry) but in similar code I’ve written before, the MySQL optimizer tended to handle this format pretty well. I’ve limited the first_name and last_name in two different ways to try to give the optimizer multiple ways to find matching rows efficiently. (Can’t hurt, might help.)

    If the OR clauses can’t use a two-column index and there are a million rows of “Bob Smith”s, then there would probably be a table scan, and the temporary table solution you present is an excellent improvement. I’m honestly not sure if the ORs present a problem for modern optimizers and if a properly-indexed temporary table would solve it. That hadn’t come up in the work I was doing, but then you’re trying to find a more generic solution than I was.

    The query is also longer than necessary, due to the parameters of the WHERE clause being repeated. But the multiplier is only the number of columns changed, so this isn’t O(n**2) or anything. And socket compression should mitigate the effect greatly.

    • Hello Jamie,
      That is an interesting suggestion.
      With your suggestion, it seems to me that at least one of the aspects of cost will have a component of the square of the number of updates — the nested-IF will presumably be O(n) but will need to be executed O(n) times, so O(n²) — so wouldn’t scale well over the number of updates. (I usually want to keep each component down to no worse than n*log(n).)
      I think for this reason I would have discounted that approach in the past.
      However for small numbers of updates O(n²) should be acceptable.
      Also there is gap we can potentially identify between the two existing strategies.
      Currently for two and three distinct sets of updates we would run two or three queries respectively.
      With the approach you suggest, we might be able to cover those with a single query.
      We would probably want to apply additional constraints to the strategy selection to avoid falling foul of any hostile cost components.
      So I think it would be worth considering adding this approach to the existing strategies for a future enhancement.
      I think that your query could be improved though:
      (1) I don’t think we need the bits in the WHERE which go
      AND first_name IN (‘Bob’, ‘Jane’, ‘Frank’, John’)
      AND last_name IN (‘Jones’, ‘Smith’, ‘Austin’, ‘Johnson’, ‘Grey’)
      because I think we are already constrained to this by the preceding AND/OR piece.
      (2) I think the nested-IF could be replaced by a CASE WHEN … THEN … END.
      (3) Having used a CASE, we could collapse all cases with the same THEN into CASE x IN (a,b,c) THEN y
      (4) As an alternative to (3), some databases support UPDATE foo SET (x,y,z) = (a,b,c) …, so we could perhaps collapse the per-SET-column CASE expressions into a single CASE expression.
      (5) Some databases support the construct (x,y,z) IN ((a,b,c),(d,e,f)) which could be used instead of AND/OR expressions to shorten the query
      Regards,
      Bill

Leave a Reply