CellProfiler is designed to analyze images in a high-throughput manner.
Once a pipeline has been established for a set of images, CellProfiler
can export batches of images to be analyzed on a computing cluster with the
pipeline. We often process tens or even hundreds of thousands of images for one analysis in this
manner. We do this by breaking the entire set of images into
separate batches, then submitting each of these batches as individual
jobs to a cluster. Each individual batch can be separately analyzed from
the rest.
Submitting files for batch processing
Below is a basic workflow for submitting your image batches to the cluster.
- Create a folder for your project on your cluster. For high throughput
analysis, it is recommended to create a separate project folder for each run.
- Within this project folder, create the following folders (both of which must be connected to
the cluster computing network):
- Create an images folder, then transfer all of our images to this folder
as the input folder. The input folder must be readable by everyone (or at least your
cluster) because each of the separate cluster computers will read input files from
this folder.
- Create an output folder where all your output data will be stored. The
output folder must be writeable by everyone (or at least your cluster) because
each of the separate cluster computers will write output files to this folder.
If you cannot create folders and set read/write permissions to these folders (or don't know
how), ask your Information Technology (IT) department for help.
- In the CellProfiler folder panel, set the Default Input and Default Output Folders
to the images and output folders created above, respectively.
- Create a pipeline for your image set. You should test it on a few example
images from your image set. The module settings selected for your pipeline will be
applied to all your images, but the results may vary
depending on the image quality, so it is critical to insure that your settings be
robust against your "worst-case" images.
For instance, some images may contain no cells. If this happens, the automatic thresholding
algorithms will incorrectly choose a very low threshold, and therefore "find"
spurious objects. This can be overcome by setting a lower limit on the threshold in
the IdentifyPrimaryObjects module.
The Test mode in CellProfiler may be used for previewing the results of your settings
on images of your choice. Please refer to %(TEST_MODE_HELP_REF)s
for more details on how to use this utility.
- Add the CreateBatchFiles module to the end of your pipeline.
This module is needed to resolve the pathnames to your files with respect to
your local machine and the cluster computers. If you are processing large batches
of images, you may also consider adding ExportToDatabase to your pipeline,
after your measurement modules but before the CreateBatchFiles module. This module
will export your data either directly to a MySQL database or into a set of
comma-separated files (CSVs) along with a script to import your data into a
MySQL database. Please refer to the help for these modules in order learn more
about which settings are appropriate.
- Analyze your images to create a batch file. Click the Analyze images
button and the analysis will begin locally processing the first image set only.
Do not be surprised if processing the first image set takes much longer than usual
if using LoadImages since this module creates a list of all images to be
processed which can take a while if there are many of them (this process can be sped
up by creating your list of images as a CSV and using the LoadData module to load it).
At the end of processing the first cycle locally, the CreateBatchFiles
module halts execution, creates the proper batch file (a file called
Batch_data.mat) and saves it in the Default Output Folder (Step 1). You
are now ready to submit this batch file to the cluster to run each of the batches
of images on different computers on the cluster.
- Submit your batches to the cluster. Log on to your cluster, and navigate
to the directory where you have installed CellProfiler on the cluster. A single
batch can be submitted with the following command:
./python-2.6.sh CellProfiler.py -p <Default_Output_Folder_path>/Batch_data.mat -c -r -b -f <first_image_set_number> -l <last_image_set_number>
This command runs the batch by using additional options to CellProfiler that
specify the following (type "CellProfiler.py -h" to see a list of available options):
-p <Default_Output_Folder_path>/Batch_data.mat: The
location of the batch file, where <Default_Output_Folder_path%gt; is the
output folder path as seen by the cluster computer.
-c: Run "headless", i.e., without the GUI
-r: Run the pipeline specified on startup, which is contained in
the batch file.
-b: Do not build extensions, since by this point, they should
already be built.
-f <first_image_set_number>: Start processing with the image
set specified, <first_image_set_number>
-l <last_image_set_number> : Finish processing with the image
set specified, <last_image_set_number>
To submit all the batches for a full image set, you will need a script that calls
CellProfiler with these options with sequential image set numbers, e.g, 1-50, 51-100,
etc and submit each as an individual job.
The above notes assume that you are running CellProfiler using our source code (see
"Developer's Guide" under Help for more details). If you are using the compiled version,
you would replace ./python-2.6.sh CellProfiler.py with the CellProfiler
executable file itself and run it from the installation folder.
Once all the jobs are submitted, the cluster will run each batch individually
and output any measurements or images specified in the pipeline. Specifying the output filename when
calling CellProfiler will also produce an output file containing the measurements
for that batch of images in the output folder. Check the output from the batch
processes to make sure all batches complete. Batches that fail for transient reasons
can be resubmitted.
For additional help on batch processing, please post your questions on
the CellProfiler forum.