DBI and Very Large Data Sets

May 25, 2007

Recently I had to do some post processing on a MySQL table with 10+ million rows of data. The amount of logic involved necessitated the use of a script, and stored procedures weren’t an option. The problem was that storing that many rows in memory really isn’t feasible, unless you’re blessed with several gigs of memory just sitting around unused.

Also, when using a garbage collected language (Ruby, Python, Perl … or if you must, Java), you just don’t have as much control over your memory usage, so the problem of keeping memory allocation under control is compounded.

Two Options (That I Couldn’t Use)

1. Cursors

If you are using a database that supports cursors, this is a great time to use them. Buffer the data to your script in a reasonably sized chunk. Problem solved.

2. mysql_use_result

Rather than reading all the data at once to a local buffer, you can use mysql_use_result to read the data directly from the database as you need it.

This wasn’t an option for me since it’s a blocking process and I needed to write the updated rows back to the table as they were processed.

Paging, with a Twist

Update: The solution below stands true for Perl DBI/DBD-mysql. When using Mysql/Ruby’s free method on Mysql::Result objects it frees up memory just fine. No need to use two scripts when paging.

My first instinct was to use LIMIT and OFFSET to page through the data in reasonably sized chunks. Yes, it does hurt to have the sort the table every time before you can return data, and this method isn’t terribly fast when your offset gets big. But it does work.

The problem is that reusing the same statement handle over and over again was that the script still eating a ton of memory during the process (which was rather long running, several hours minimum). The memory would never be released until the script finally exited. My research led me to the answer that this is just an unfortunate side effect of the way most garbage collectors work.

The solution? Have the script that does the actual processing work run separate from the script that determines the offset for each pass through the loop. What you end up with is a script that invokes the processing script as a separate instance for each page of data returned from the database. As each child process exits the memory is released back to the OS.

Some pseudo code (should be similar logic for Perl, Ruby, and Python):

The parent script:

total_rows = statement.exec( SELECT COUNT(*) FROM table )
limit = 100000
offset = 0

while ( offset <= total_rows )
  system(processor_script limit offset)

  offset += limit
end

And the child script:

chunk = statement.exec( SELECT * FROM table ORDER BY id LIMIT limit OFFSET offset )

for each row in chunk
  do some processing logic here
end  

It still kind of strikes me as a hack, but it works. The memory usage of each invocation of the script stayed well under 50 mb of memory, which was a big difference from the several hundred mb of memory the script was previously eating up.