Drupal - How does the batch API work internally?

This is how batch works (Based on my Understanding)

1. Initialize

  1. Initialize the batch processing. Based on the clients(Browsers) configuration on whether JavaScript is enabled or not.
  2. JavaScript-enabled clients are identified by the 'has_js' cookie set in drupal.js. If no JavaScript-enabled page has been visited during the current user's browser session, the non-JavaScript version is returned.
  3. If JavaScript enabled Batch uses ajax request the keep the connection alive across the request.
  4. If JavaScript is not enabled Batch uses sets a meta tag in html to make regular refresh intervals to the keep the connection alive across the request.

(This is how the progress bar is updated on the progress of the Job Done.)

Batch Process

  1. For starting the process, Batch creates a Queue and adds all the operations(functions and arguments) that you define in the batch array like,

    $batch = array (
    'operations' => array(
      array('batch_example_process', array($options1, $options2)),
      array('batch_example_process', array($options3, $options4)),
      ),
    'finished' => 'batch_example_finished',
    'title' => t('Processing Example Batch'),
    'init_message' => t('Example Batch is starting.'),
    'progress_message' => t('Processed @current out of @total.'),
    'error_message' => t('Example Batch has encountered an error.'),
    'file' => drupal_get_path('module', 'batch_example') . '/batch_example.inc',
    );
    

    Additionally it also assigns a batch id which is unique across the batches.

  2. Now Batch calls claims the Queue items one by one and executes the function defined with the arguments that are defined in it.

  3. This is a crucial part, The function(Operation) that implements the batch operation should chunk the data and process the data very efficiently keeping in mind of PHP's Memory limit, Time out. Failing to do so will end up in your problem.

I ran into a timeout issue using migrate the other day and started to wonder how the batch API works internally.

The Batch Function

The functions that implement Batch should take of the following things very carefully,

  • Number of Items within the operations to process like,

    if (!isset($context['sandbox']['progress'])) {
    $context['sandbox']['progress'] = 0;
    $context['sandbox']['current_node'] = 0;
    $context['sandbox']['max'] = db_result(db_query('SELECT COUNT(DISTINCT nid) FROM {node}'));
    }
    
  • Limiting the number of items to process in one function call like setting up a limit,

    // For this example, we decide that we can safely process 5 nodes at a time without a timeout.
    $limit = 5;
    
  • Update on the process to post-processing like,

    // Update our progress information.
        $context['sandbox']['progress']++;
        $context['sandbox']['current_node'] = $node->nid;
        $context['message'] = t('Now processing %node', array('%node' => $node->title));
    
  • Informing the Batch engine whether the Batch is completed or Not like,

    // Inform the batch engine that we are not finished,
    // and provide an estimation of the completion level we reached.
    if ($context['sandbox']['progress'] != $context['sandbox']['max']) {
      $context['finished'] = $context['sandbox']['progress'] / $context['sandbox']['max'];
     }
    

Most of the Above Points are taken care of the Drupal's Core batch operations if it is Missed in the Implementing function. But it is always best to define in the implementing function

Batch Finished callback

  • This is the last call back called when defined in the batch array Usually a report of how much processed etc...

ANSWERS

If the page with the batch request is closed does the batch processing stop? Will it restart when the same url is opened again? The migrate module sometimes continues but it's probably using queues?

Yes, Ideally it should restart the batch and as said above it is based on the function you implement.

To solve your problem of PHP Time out use Drush batch which is available in migrate module, But first dig out migrate's batch functions and try to chunk your processing data.


If the page with the batch request is closed does the batch processing stop?

Yes, it will be stopped.

Will it restart when the same url is opened again? The migrate module sometimes continues but it's probably using queues?

As Dinesh said its depends on the implementation.

You should run migration using drush, because

Drush runs at the command line and is not subject to any time limits (in particular, PHP's max_execution_time does not apply). So, when you start a migration process running via drush, it simply starts up and keeps running until it's done.

When running processes through a web interface, the PHP max_execution_time (typically 30 seconds if not less) applies. Thus, for long-running processes we need to use the Batch API, which manages the breaking up of a process across multiple requests. So, a migration process will start up, run for 25 seconds or so, then stop and let the Batch API issue a fresh page request, in which the migration process is restarted, ad infinitum.

So, understanding that, why is Drush better?

It's faster

The Batch API introduces a lot of overhead - shutting down and reinvoking the page requests, the migration process needs to run through all the necessary constructors again, database connections reestablished and queries rerun, etc. And, for a partial import, it needs to pick up where it left off - if the first 500 source records have been imported, it needs to find the 501st record. Depending on your source format and how its constructed, this may or not scale - if you're using highwater marks with an SQL source, the query itself can eliminate the earlier records and start right where you left off. If not, then Migrate needs to scroll through the source data looking for the first non-imported record. With, say, a big XML file as your source, after many iterations it may very well take longer than your PHP max_execution_time to get to where you can pick up, and your migration can stall.

It's more reliable

Running migrations through your browser adds your desktop, and your local Internet connection, as points of failure. A network glitch when Batch API is moving to the next page request, a browser crash, an accidental close of the wrong tab or window can all interrupt your migration. Running in drush reduces the moving parts - you eliminate your desktop and local Internet connection as factors.

It's more helpful

If something does go wrong when running in Drush, if there are any useful error messages you'll see them. Failures using the Batch API often get swallowed up and all you get to see is the completely useless "An AJAX HTTP request terminated abnormally. Debugging information follows. Path: /batch?id=901&op=do StatusText: ResponseText: ReadyState: 4 ".

You can find more information on this here.

In the meantime if you want to run the batch even if the browser window is closed, consider Background Process module. It has a submodule Background Batch which does the trick.

This modules takes over the existing Batch API and runs batch jobs in a background process. This means that if you leave the batch page, the jobs continues, and you can return to the progress indicator later.