Wednesday, December 5, 2007

MySQL Streaming Result Set

WTF? I thought this was a Java blog. Well, yes, it is. And, if you're just mucking around with java, or using it for a small application or research project, you probably won't have to deal with java.sql. But, if you're building most any kind of decent sized java application in the business world, it's hard to avoid using a database (or ten). We've been generally happily using JDBC and Hibernate to communicate with MySQL. That is, until one of my grab-all-the-data-from-a-largish-table queries died with an OutOfMemoryError before returning a single result. I was confused. How could that be? Could the ResultSet be consuming gigs worth of memory? Why, yes, it can, I quickly learned. A bit more searching and I happened upon the MySQL JDBC API Implementation Notes:
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and can not allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
Ouch! But, that last sentence provides hope. On my plate was a project to write a tool to analyze site activity which was stored in a table with tens of millions of rows. Would be a bit clunky if I couldn't stream/iterate through the data. Here's the magic incantation:
To enable this functionality, you need to create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this any result sets created with the statement will be retrieved row-by-row.
In fact, you don't need to specify forward-only and read-only as they are the defaults, but the Integer.MIN_VALUE fetch size is essential. This signals the driver to stream the data rather than send it all-at-once. There are caveats to this approach. Be sure to read the ResultSet section of the MySQL JDBC API Implementation Notes before using this for serious work. Also, note that if there are delays in processing the streaming data that your network timeout values are sufficiently large. The mysqld net_write_timeout variable should be set at least as large as the longest possible delay. Also, there appear to be undocumented issues with streaming from one table and using a separate connection to read data from that same table. I've observed odd behavior in this case, such as the ResultSet being set to null. Best to ask for all the required data in the initial streaming query.