Batch Processing

CellProfiler is designed to analyze images in a high-throughput manner. Once a pipeline has been established for a set of images, CellProfiler can export batches of images to be analyzed on a computing cluster with the pipeline. We often process tens or even hundreds of thousands of images for one analysis in this manner. We do this by breaking the entire set of images into separate batches, then submitting each of these batches as individual jobs to a cluster. Each individual batch can be separately analyzed from the rest.

Submitting files for batch processing

Below is a basic workflow for submitting your image batches to the cluster.
  1. Create a folder for your project on your cluster. For high throughput analysis, it is recommended to create a separate project folder for each run.
  2. Within this project folder, create the following folders (both of which must be connected to the cluster computing network):
    • Create an images folder, then transfer all of our images to this folder as the input folder. The input folder must be readable by everyone (or at least your cluster) because each of the separate cluster computers will read input files from this folder.
    • Create an output folder where all your output data will be stored. The output folder must be writeable by everyone (or at least your cluster) because each of the separate cluster computers will write output files to this folder.
    If you cannot create folders and set read/write permissions to these folders (or don't know how), ask your Information Technology (IT) department for help.
  3. In the CellProfiler folder panel, set the Default Input and Default Output Folders to the images and output folders created above, respectively.
  4. Create a pipeline for your image set. You should test it on a few example images from your image set. The module settings selected for your pipeline will be applied to all your images, but the results may vary depending on the image quality, so it is critical to insure that your settings be robust against your "worst-case" images.

    For instance, some images may contain no cells. If this happens, the automatic thresholding algorithms will incorrectly choose a very low threshold, and therefore "find" spurious objects. This can be overcome by setting a lower limit on the threshold in the IdentifyPrimaryObjects module.

    The Test mode in CellProfiler may be used for previewing the results of your settings on images of your choice. Please refer to %(TEST_MODE_HELP_REF)s for more details on how to use this utility.

  5. Add the CreateBatchFiles module to the end of your pipeline. This module is needed to resolve the pathnames to your files with respect to your local machine and the cluster computers. If you are processing large batches of images, you may also consider adding ExportToDatabase to your pipeline, after your measurement modules but before the CreateBatchFiles module. This module will export your data either directly to a MySQL database or into a set of comma-separated files (CSVs) along with a script to import your data into a MySQL database. Please refer to the help for these modules in order learn more about which settings are appropriate.
  6. Analyze your images to create a batch file. Click the Analyze images button and the analysis will begin locally processing the first image set only. Do not be surprised if processing the first image set takes much longer than usual if using LoadImages since this module creates a list of all images to be processed which can take a while if there are many of them (this process can be sped up by creating your list of images as a CSV and using the LoadData module to load it).

    At the end of processing the first cycle locally, the CreateBatchFiles module halts execution, creates the proper batch file (a file called Batch_data.mat) and saves it in the Default Output Folder (Step 1). You are now ready to submit this batch file to the cluster to run each of the batches of images on different computers on the cluster.

  7. Submit your batches to the cluster. Log on to your cluster, and navigate to the directory where you have installed CellProfiler on the cluster. A single batch can be submitted with the following command:
    ./python-2.6.sh CellProfiler.py -p <Default_Output_Folder_path>/Batch_data.mat -c -r -b -f <first_image_set_number> -l <last_image_set_number> This command runs the batch by using additional options to CellProfiler that specify the following (type "CellProfiler.py -h" to see a list of available options):
    • -p <Default_Output_Folder_path>/Batch_data.mat: The location of the batch file, where <Default_Output_Folder_path%gt; is the output folder path as seen by the cluster computer.
    • -c: Run "headless", i.e., without the GUI
    • -r: Run the pipeline specified on startup, which is contained in the batch file.
    • -b: Do not build extensions, since by this point, they should already be built.
    • -f <first_image_set_number>: Start processing with the image set specified, <first_image_set_number>
    • -l <last_image_set_number> : Finish processing with the image set specified, <last_image_set_number>
    To submit all the batches for a full image set, you will need a script that calls CellProfiler with these options with sequential image set numbers, e.g, 1-50, 51-100, etc and submit each as an individual job.

    The above notes assume that you are running CellProfiler using our source code (see "Developer's Guide" under Help for more details). If you are using the compiled version, you would replace ./python-2.6.sh CellProfiler.py with the CellProfiler executable file itself and run it from the installation folder.

Once all the jobs are submitted, the cluster will run each batch individually and output any measurements or images specified in the pipeline. Specifying the output filename when calling CellProfiler will also produce an output file containing the measurements for that batch of images in the output folder. Check the output from the batch processes to make sure all batches complete. Batches that fail for transient reasons can be resubmitted.

For additional help on batch processing, please post your questions on the CellProfiler forum.