How to read all rows from huge table?

So it turns out that the crux of the problem is that by default, Postgres starts in "autoCommit" mode, and also it needs/uses cursors to be able to "page" through data (ex: read the first 10K results, then the next, then the next), however cursors can only exist within a transaction. So the default is to read all rows, always, into RAM, and then allow your program to start processing "the first result row, then the second" after it has all arrived, for two reasons, it's not in a transaction (so cursors don't work), and also a fetch size hasn't been set.

So how the psql command line tool achieves batched response (its FETCH_COUNT setting) for queries, is to "wrap" its select queries within a short-term transaction (if a transaction isn't yet open), so that cursors can work. You can do something like that also with JDBC:

  static void readLargeQueryInChunksJdbcWay(Connection conn, String originalQuery, int fetchCount, ConsumerWithException<ResultSet, SQLException> consumer) throws SQLException {
    boolean originalAutoCommit = conn.getAutoCommit();
    if (originalAutoCommit) {
      conn.setAutoCommit(false); // start temp transaction
    }
    try (Statement statement = conn.createStatement()) {
      statement.setFetchSize(fetchCount);
      ResultSet rs = statement.executeQuery(originalQuery);
      while (rs.next()) {
        consumer.accept(rs); // or just do you work here
      }
    } finally {
      if (originalAutoCommit) {
        conn.setAutoCommit(true); // reset it, also ends (commits) temp transaction
      }
    }
  }
  @FunctionalInterface
  public interface ConsumerWithException<T, E extends Exception> {
    void accept(T t) throws E;
  }

This gives the benefit of requiring less RAM, and, in my results, seemed to run overall faster, even if you don't need to save the RAM. Weird. It also gives the benefit that your processing of the first row "starts faster" (since it process it a page at a time).

And here's how to do it the "raw postgres cursor" way, along with full demo code, though in my experiments it seemed the JDBC way, above, was slightly faster for whatever reason.

Another option would be to have autoCommit mode off, everywhere, though you still have to always manually specify a fetchSize for each new Statement (or you can set a default fetch size in the URL string).

Use a CURSOR in PostgreSQL or let the JDBC-driver handle this for you.

LIMIT and OFFSET will get slow when handling large datasets.

The short version is, call stmt.setFetchSize(50); and conn.setAutoCommit(false); to avoid reading the entire ResultSet into memory.

Here's what the docs say:

Getting results based on a cursor

By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.

Note:

Cursor based ResultSets cannot be used in all situations. There a number of restrictions which will make the driver silently fall back to fetching the whole ResultSet at once.

The connection to the server must be using the V3 protocol. This is the default for (and is only supported by) server versions 7.4 and later.-

The Connection must not be in autocommit mode. The backend closes cursors at the end of transactions, so in autocommit mode the backend will have closed the cursor before anything can be fetched from it.-

The Statement must be created with a ResultSet type of ResultSet.TYPE_FORWARD_ONLY. This is the default, so no code will need to be rewritten to take advantage of this, but it also means that you cannot scroll backwards or otherwise jump around in the ResultSet.-

The query given must be a single statement, not multiple statements strung together with semicolons.

Example 5.2. Setting fetch size to turn cursors on and off.

Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();

// Turn use of the cursor on.
st.setFetchSize(50);
ResultSet rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("a row was returned.");
}
rs.close();

// Turn the cursor off.
st.setFetchSize(0);
rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("many rows were returned.");
}
rs.close();

// Close the statement.
st.close();

How to read all rows from huge table?

Tags:

Java

Postgresql

Jdbc

Related

Recent Posts