This project was supported by Grant Number R01EB007511 from the National Institute Of Biomedical Imaging And Bioengineering. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of Biomedical Imaging And Bioengineering or the National Institutes of Health.
I gratefully acknowledge the advice and support I have received from Dan Gillespie, Linda Petzold and her research group at UCSB, and Michael Hucka.
Copyright (c) 1999-2009, California Institute of Technology
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
This is version 0.12 of Cain, developed by Sean Mauch,
, at the
Center for Advanced Computing Research
at the
California Institute of Technology.
Cain performs stochastic and deterministic simulations of chemical reactions. It can spawn multiple simulation processes to utilize multi-core computers. It stores models, methods, and simulation output (populations and reaction counts) in an XML format. In addition, SBML models can be imported and exported. The models and methods can be read from input files or edited within the program.
The GUI (Graphical User Interface) is written in Python and uses the wxPython toolkit. The solvers are implemented as command line executables, written in C++, which are driven by Cain. This makes it easy to launch batch jobs. It also simplifies the process of adding new solvers. Cain offers a variety of solvers:
The reactions may have mass-action kinetic laws or arbitrary propensity functions. For the latter, custom command line executables are generated when the simulations are launched. For the former one has the choice of generating a custom executable or of using one of the built-in mass-action solvers. Compiling and launching the solvers is done internally; you do not need to know how to write or compile programs. However, to use the custom executables your computer must have compiler software. Without a compiler you can only simulate systems with mass-action kinetics.
Once you have run a simulation to generate trajectories (possible realizations of the system) you can visualize the results by plotting the species populations or reactions counts. You can also view the output in a table or export it to a spreadsheet.
Cain is free, open source software that is available at http://cain.sourceforge.net/. Distributions are available for Mac OS X, Microsoft Windows, and Linux/Unix. See the appropriate section below for installation instructions.
Mac OS X.
To install Cain, download the disk image and drag the application bundle to
your Applications folder (or wherever you want to place it). The CainExamples
folder contains data files. Drag it to an appropriate location.
To use Cain you will need Python, wxPython, numpy, and matplotlib. Unfortunately, Leopard (10.5) comes with rather old versions of the first three and does not have matplotlib. There are several ways of obtaining the necessary software to run Cain. The easiest solution is to install the Enthought Python Distribution. The EPD is designed for those working in scientific computing and comes with all of the packages that Cain needs. It is a commercial product, but is free for educational use if you are associated with a degree-granting institution.
The other option is to download and install the packages.
Get Python, wxPython, and numpy from the sites indicated above.
(Just download the binaries; installation is a snap.)
The easiest way to get matplotlib is to use the
EasyInstall
module. Just follow the directions in the 'Installing "Easy Install"'
section. Then in an xterm execute the commands:
sudo easy_install matplotlib
To upgrade a package with EasyInstall, use the -U option, for example:
sudo easy_install -U matplotlib
If you do not have the necessary packages installed, Cain will show an error
message when you attempt to launch the application.
In order to compile custom executables (either for kinetic laws that are not mass-action or to speed up simulations that use mass-action kinetics) you will need a C++ compiler. The GNU GCC compiler is freely available, but it is not installed by default. You can get it by installing the Xcode tools on your Mac OS X install disc. Alternatively, you can download the Xcode package from the Apple Developer Connection. You will need to register for a free account. After that, log in and follow the Downloads and the Developer Tools links. Download and install Xcode 3.0. This will install the compilers as well as Apple's integrated development environment.
To uninstall Cain, simply delete the Cain and CainExamples folders.
Microsoft Windows.
For Microsoft Windows, Cain is distributed as an executable.
Download and run the installer. Also download the example data files.
Unzip these and place them in a convenient location.
The mass-action solvers are
pre-compiled. In order to use custom propensities you will need a compiler;
Cain uses Microsoft Visual Studio 2008. If you do not already have MSVS 2008,
get
Microsoft Visual C++ 2008 Express Edition. It is a free, command line version of Visual Studio.
To uninstall Cain, select Cain→Uninstall Cain from the start menu. Then delete the CainExamples folder.
Linux/Unix.
For Linux or Unix, use the platform-independent distribution. You will
need appropriate versions of
Python,
wxPython,
matplotlib, and
numpy,
as well as a C++ compiler. I recommend using
the current version of GNU GCC.
Note that only Python versions 2.4.x and 2.5.x are currently supported
because the numpy package does not work with later versions.
If you do not have the necessary Python packages installed, Cain will show
an error message when you attempt to launch the application.
If you run RedHat Linux the Enthought Python Distribution may be convenient. It includes all of the packages that Cain requires. There is a free version for those associated with educational institutions.
Download the platform-independent distribution and place it in a convenient
location. Then uncompress the zip file.
unzip Cain.zip
Build the mass-action solvers.
cd Cain
make
Then launch the GUI.
python Cain.py&
There are example files distributed in CainExamples.zip.
To uninstall Cain, simply delete the Cain and CainExamples directories.
CentOS 5.2 Linux.
Because CentOS 5.2 is
"enterprise" Linux, it has an old version of Python.
It is recommended that you upgrade to a more recent version.
The easiest approach is to install the
Enthought Python Distribution.
It includes all of the packages that Cain requires. There is a
free version for those associated with educational institutions.
Select Applications→Add/Remove Software to launch the Package Manager.
Search for f2c and then install the libf2c library. Download and save
the Enthought
Python Distribution installer file. You may either install for all users
or just for your own use. Let's assume the former. (If you do not have
administrator privileges, you can install EPD in your home directory.)
In a terminal switch to
superuser with "su". Start the installation with something like
"sh epd_py25-4.1.30101-rh5-x86.installer". Choose an appropriate
location like "/usr/lib/python2.5".
To put python on your path, execute
"export PATH=/usr/lib/python2.5/bin:$PATH" in a terminal.
It will be convenient to add that command to your .bash_profile in your
home directory.
Next ensure that you have a C++ compiler. In the Package Manager install Development Libraries and Development Tools.
If you want to do things the hard way, it is also possible to use the version of python that ships with CentOS 5.2. Cain has some minor problems, but still works. To get wxPython install Fedora Core 6, Python 2.4 common-gtk2-unicode and Fedora Core 6, Python 2.4 gtk2-unicode from the wxPython downloads page. To get numpy and matplotlib you will need to install something like the following packages:
If you use an old version of python and plot any simulation output you may
see the following message in the shell:
** (python:6182): WARNING **: IPP request failed with status 1030
I don't know what causes this error.
After you exit Cain you will need to press Ctrl-c in the shell to get the
prompt back.
RedHat 5.3 Linux.
Follow the same instructions as for CentOS 5.2.
Fedora 10.
Select System→Administration→Add/Remove Software. Install the following packages:
The application window is composed of nine panels and a toolbar. The panels are labeled in the figure below. These will each be described in turn. If you pause the cursor over a title in any of the panels you will see a tool tip that describes the relevant functionality.
The five panels in the top row follow the workflow of running a stochastic simulation. Select a model (a system of species and the reactions that govern their evolution) in the model list. Then select a simulation method. One can use exact or approximate methods to generate realizations of the process. In the recorder panel you specify the species and reactions that you want to record. In the launcher panel you select the number of trajectories to generate and start the simulation. The simulation output is listed in the rightmost panel.
The four panels in the bottom row comprise the model editor. These show the species, reactions, parameters, and compartments for the selected. If no model is selected in the model list panel, the model editor is empty. You can edit a model via the grids and their toolbars in each panel.
Cain comes with a collection of example data files in the CainExamples folder.
Open one of these files, BirthDeath.xml for example.
You will see lists of the defined models and methods in the first two panels.
When you select a model, its species, reactions, etc. are shown in the editor
panels in the bottom row. Select a model and a method. Then select the
species and/or reactions to record. You can
generate a suite of trajectories with the quick launch button
in the launcher frame. A description of
the output will appear in the top, right panel.
Play around with the buttons and panels. Most of the text labels and buttons have tool tips describing their functionality. Just pause the cursor over an object to see what it does. Hopefully most of the widgets do what you expect. Note that you must have a model and a method selected in order to launch a simulation. When you get stuck or confused, come back here and continue reading this manual.
Note that there is a splitter between the top and bottom row of panels. It appears differently on each operating system. On Mac OS X splitters are indicated with what appears to be a small indentation at their centers. You can click and drag the splitter to change the ratio of space allocated to the top and bottom rows. There are also vertical splitters between each of the editor panels in the bottom row.
With Cain you can investigate any number of models during a session. The first panel lists the models by their identifiers. The following actions are available from the model list toolbar:
The simulation methods are listed by their identifiers. The
left half of this panel has much the same functionality as the models
list. One addition however, is the help button
. Hitting the help button will open a window
with documentation on the select method.
Recall that if a simulation method with associated parameters
has been used in a simulation, you cannot edit or delete it without
first deleting the dependent output. You can change its name
by double clicking on the identifier.
In the right half of this panel you select the simulation method. First select the output category. There are several types of output:
Below is a list of the available methods and options for each output category. Each of the solvers use sparse arrays for the state change vectors [Li 2006]. The methods that require exponential deviates use the ziggurat method [Marsaglia 2000].
In the recorder panel you specify the species and/or reactions that you
want to record in the simulation output. Press
or
to select or deselect all of
the species or reactions.
If you modify a model hit the refresh button
to update the recordable items.
The items that may be recorded depend on
the type of simulation. When generating time series data of stochastic or
deterministic, you may select any combination of
species and reactions. When using histograms to record stochastic trajectories,
you may select any combination of species. For trajectories that record
all reaction events, all of the species and reactions are marked as
being recorded.
In this panel you select the number of trajectories that you would like to generate and the number of processes to use. For best performance, set the number of processes to the number of cores that you have in your computer. If you have a dual-core processor, select 2. If you have a dual-socket computer with quad-core processors, select 8.
Note the slider at the bottom of the launcher panel. This allows you to set the priority of the solvers. (Mac OS X and Linux users may be familiar with the nice program, which allows one to set the priority of a process.) By default, the solvers are launched with the lowest possible priority. This way your computer will remain responsive. You can continue to work with Cain, check your email, or surf the web. If your computer is not busy with other tasks, launching with a low priority has a negligible effect on the running time of the simulations.
There are two ways to launch simulations; you can use
either the mass-action launch button
or the compile and launch button
.
The mass-action launch button
will launch the simulation using the built-in mass-action solvers. Of
course you can only use this option if you model uses only mass-action
kinetics. The latter will compile the solver if necessary and then launch using
the custom solver. Compilation typically takes a few seconds. If you
entered any propensity functions which are not proper C++ expressions,
you will be notified of the compilation errors.
Note that you can set the compilation options through the preferences button
in the main tool bar.
Unless you are generating a small number of trajectories (in
which case compiling the solver may take longer than running the
simulation), the compile and launch option will probably be faster.
The mass-action solvers use a function that can evaluate
kinetic laws with any stoichiometry. Evaluating this function is not as
fast as evaluating a specific propensity function.
When a simulation is running, the fraction of trajectories that have
been generated is shown in the progress bar. You can abort a running
simulation with . This will wait for
each processes to finish generating its current trajectory and then
exit. The trajectories that have been generated up to that point will
be stored. You can also kill a simulation
with
. This will kill the solver
processes and store the partial results if possible. Note that you can
repeatedly launch suites of simulations to accumulate more trajectories. You
don't have to calculate them all in a single run.
You can run simulations from the command line if you want. This may be
useful if you want to use several computers to generate the output.
(See the Command Line Solvers
section.) First export a solver with
. You have the option of
exporting a custom solver or a generic mass-action solver.
A custom solver is specific to the selected model. A generic solver may
be used with any model that has mass-action kinetics.
Next export ascii input files with the export jobs button
. This will write an input file for
each process; the trajectories will be split between the processes.
Each file contains a description of the selected model and method
as well as the number of trajectories. Suppose you
export a solver to solver.exe. Then you enter 1000 trajectories
and 4 processes in the launcher window and export the job with a base name
of batch. This will create the solver inputs:
batch_0.txt, batch_1.txt, batch_2.txt, and
batch_3.txt.
You can generate the trajectories for the first batch with the command:
./solver.exe <batch_0.txt >trajectories_0.txtYou can import the simulation results with the import trajectories button
If you select the "Mathematica" method in the methods
editor, the export jobs button changes to the export to Mathematica
button .
The Mathematica notebook defines the ODE's that describe the reactions
and species populations as well as commands for numerically solving
the set of ODE's and plotting the results. The final section in the
notebook has commands for saving times series data from the solution in a text
file. You can import this in Cain with the import trajectories
button
.
Information about the simulation output is displayed in the final frame. The outputs are grouped according to the model and method used to generate them. After selecting an item, the following operations are available in the toolbar:
The species editor allows you to view and edit the species. The
identifier (ID) field is required. In order to be compatible with
SBML, there are a couple of restrictions.
The identifier is a string that starts with an underscore or a letter
and is composed entirely of underscores, letters and digits. Spaces
and special characters like $ are not allowed. "s1",
"species158", "unstableDimer", and
"_pi_314" are all valid identifiers. (Don't enter the
quotes.) "2x4", "species 158", "s-158"
are invalid. Finally, the identifiers must be unique. The name field
is optional. It is an arbitrary string that describes the species, for
example "unstable dimer". "Coolest #*@$ species
ever!!!" is also a valid name, but show a little restraint - this
is science. The compartment field is optional. It does not affect the
simulation output. By default the name and compartment fields are hidden.
Click in the tool bar to show
or hide these columns.
The initial amount field is required. It is the initial population of the species and must evaluate to a non-negative integer. You may enter a number or any Python expression involving the parameters. The following are examples of valid initial amounts.
There is a tool bar for editing the species. You can select rows by clicking on the row label along the left side of the table. The following operations are available:
A Note About Identifiers and Compartments.
Note that in designing a scheme to describe species and compartments
one could use either species identifiers that have compartment scope
or global scope. We follow the SBML
convention that the identifiers have global scope and therefore must
be unique. Consider a trivial problem with two species X and
Y and two compartments A and B. If species
identifiers had compartment scope then one could describe the species
as below.
ID | Compartment |
---|---|
X | A |
Y | A |
X | B |
Y | B |
ID | Compartment |
---|---|
X_A | A |
Y_A | A |
X_B | B |
Y_B | B |
The identifier and name for a reaction are analogous to those for a species. By default the name field is hidden. Specify the reactants and products by typing the species identifiers preceded by their stoichiometries. For example: "s1", "s1 + s2", "2 s2", or "2 s1 + s2 + 2 s3". The stoichiometries must be positive integers. A reaction may have an empty set of reactants or an empty set of products, but not both. (A "reaction" without reactants or products has no effect on the system.) The MA field indicates if the equation has mass-action kinetics. In the Propensity field you can either enter a propensity factor for use in a mass-action kinetic law or you can enter an arbitrary propensity function. The reactions editor has the same tool bar as the species editor. Again, you will be informed of any bad input when you try to launch a simulation.
If the MA field is checked, a reaction will use a mass-action kinetics law. For this case you can enter a number or a Python expression that evaluates to a number in the Propensity field. This non-negative number will be used as the propensity factor in the reaction's propensity function. Below are some examples of reactions and their propensity functions for a propensity factor of k. [X] indicates the population of species X. (0 indicates the empty set; either no reactants or no products.)
Reaction | Propensity Function |
---|---|
0 → X | k |
X → Y | k [X] |
X + Y → Z | k [X] [Y] |
2 X → Y | k [X] ([X] - 1)/ 2 |
C++ Expression | Propensity Function |
---|---|
2.5 | 2.5 |
5*pow(s1, 2) | 5 [s1]2 |
1e5*s2 | 100000 [s2] |
P*s1*s2/(4+s2) | P [s1] [s2] / (4 + [s2]) |
log(Q)*sqrt(s1) | log(Q) √[s1] |
Model parameters are constants that you can use in the expressions for the species initial amounts and the reaction propensities. The ID field is required, but the name is optional (and hidden by default). For the value you can enter a number or any Python expression. You can use the standard mathematical functions and constants as well as other parameters to define the values. You will get an error message if the parameters cannot be evaluated. Below is an example of a valid set of parameters.
ID | Value | Name |
---|---|---|
R | sqrt(10) | Radius |
Area | pi * R**2 | |
Volume | H * Area | |
H | 5.5 | Height |
It is permitted to use the mathematical constants pi and e as parameter identifiers. In this case, the values you assign to them will override their natural values. You may also use the names of Python built-in functions and math functions as parameter identifiers. However, to avoid confusion it best to avoid such names. The following set of parameters are valid, but misleading. Below lambda has the value cos(6).
ID | Value |
---|---|
pi | 3 |
e | 2 |
sin | pi * e |
lambda | cos(sin) |
In panel you can edit the compartments. Recall that in Cain compartments are optional. They are provided primarily to facilitate importing and exporting models in SBML format. The identifier and size fields are required; the name field is optional.
The tool bar at the top of the window provides short-cuts for common
operations. The models, methods, and simulation output
comprise the application state. Each of these are loaded/stored when
you open/save a file. With the file operations you can
clear the state ,
open a file
,
save the state
,
save to a different file name
,
or quit
.
You can seed the random number generator by clicking the die icon
and entering an unsigned 32-bit
integer. (An integer between 0 and 4294967295, inclusive.) This number
is used to generate new states for the Mersenne Twister states. You
won't ordinarily need this feature. It is primarily used for testing
and comparing simulation methods.
In the final group you can edit the application preferences
or open the help browser
. In the
application preferences you can set up the compiler.
Each of the species initial amounts, the propensity factors for mass-action kinetic laws, and the parameter values are interpreted as Python expressions. Python supports the standard numerical operations. Additionally you can use the functions and constants defined in the math module.
For kinetic laws which are not mass-action, the propensity function must be a valid C++ expression. Both Python and C++ use the C math library, so the syntax is almost the same. One difference to note: C++ does not support the power operator x**y. Use pow(x, y) instead. You can use any of standard math functions without the std namespace qualification as well as the constants pi and e. Of course you can also use the model parameters. Use the species identifiers to denote the species populations.
Consider the linear birth-death process presented in Section 1.3 of Stochastic Modelling for Systems Biology. The text introduces some differences between continuous, deterministic modelling and discrete, stochastic modelling. Let X(t) be the population of bacteria which reproduce at a rate of λ and die at a rate of μ. The continuous model of this process is the differential equation
X'(t) = (λ - μ) X(t), X(0) = x0
which has the solution
X(t) = x0 e(λ - μ)t.
We can numerically solve the continuous, deterministic model in Cain.
Open the file BirthDeath.xml and select the
Birth Death 0 1 model, which has parameter values λ = 0
and μ = 1. In the method editor, select the
Deterministic Trajectory category with the default method and options.
Click the mass-action launch button
to generate the solution. You can plot the solution by clicking the
plot button
in the output list
and selecting Population trajectories. The population as a
function of time is plotted below.
The ODE integration method in Cain solves a different formulation of the model than the population-based formulation above. (Select Deterministic Trajectory for the category and ODE, Integrate Reactions for the method.) As the method name suggests, it integrates the reaction counts. The birth-death process can be modelled with the following set of differential equations:
B'(t) = λ X(t), B(0) = 0
D'(t) = μ X(t), D(0) = 0
X'(t) = B'(t) - D'(t), X(0) = x0
which for λ ≠ μ has the solution
B(t) = λ x0
(1 - e(λ - μ)t) / (μ - λ)
D(t) = μ x0
(1 - e(λ - μ)t) / (μ - λ)
X(t) = x0 e(λ - μ)t.
Here B(t) is the number of birth reactions and D(t) is the number of death reactions. The time derivative of the population X'(t) depends on the birth rate and death rate. While the population solutions are the same, the reaction-based formulation of the model carries more information than the population-based formulation. The population depends only on the difference λ - μ. However, the reaction counts depend on the two parameters separately. For λ = 0 and μ = 1 no birth reactions occur. In the output list you can plot with the Cumulative reaction count trajectories option to generate a figure like the one below.
For λ = 10 and μ = 11, the population is the same, but the reaction counts differ. Below is a plot of the reaction counts for this case.
Now consider the discrete stochastic model, which has reaction propensities instead of deterministic reaction rates. This model is composed of the birth reaction X → 2 X and the death reaction X → 0 which have propensities λX and μX, respectively. First we will generate a trajectory that records all of the reactions. Select the Birth Death 0 1 model, and the All Reaction Events category and then generate a trajectory with the mass-action launch button. Below is a plot of the species populations. We see that the population changes by discrete amounts.
Below we use Cain to reproduce the results in the text that demonstrate how increasing λ + μ while holding λ - μ = -1 increases the volatility in the system. For each test, we generate an ensemble of five trajectories and plot these populations along with the deterministic solution.
![]() | ![]() |
λ = 0, μ = 1 | λ = 3, μ = 4 |
![]() | ![]() |
λ = 7, μ = 8 | λ = 10, μ = 11 |
For a simple problem like this we can store and visualize all of the reactions. However, for more complicated models (or longer running times) generating a suite of trajectories may involve billions of reaction events. Storing, and particularly plotting, that much data could be time consuming or just impossible on your computer. Thus instead of storing all of the reaction events, one typically stores snapshots of the populations and reaction counts at set points in time. Below we select the Time Series, Uniform category and the Direct method with 51 frames. For each test, we generate an ensemble of ten trajectories and plot the populations and the cumulative reaction counts. Note that because we are only sampling the state, we don't see the same "noisiness" in the trajectories.
![]() | ![]() |
λ = 0, μ = 1 | |
![]() | ![]() |
λ = 3, μ = 4 | |
![]() | ![]() |
λ = 7, μ = 8 | |
![]() | ![]() |
λ = 10, μ = 11 |
When you use the plot button ,
to plot trajectories, there are six plotting methods. You can plot the
populations, the binned
reaction counts (the reaction counts for each frame),
or the cumulative reaction counts. For each of these
you can plot either the statistics (mean and optionally the standard deviation)
or the trajectories.
Below are the six plotting methods for the
birth-death model with λ = 3 and μ = 4.
![]() | ![]() |
Population Statistics | Population Trajectories |
![]() | ![]() |
Binned Reaction Count Statistics | Binned Reaction Count Trajectories |
![]() | ![]() |
Cumulative Reaction Count Statistics | Cumulative Reaction Count Trajectories |
After you select the plotting method, you can select further options in the plotting window shown below. The window has a list of either the species or the reactions. By clicking on the first checkbox field you can select which items to plot. The "Select all" button will select all of the items. By clicking on a button in the Color field you can select the line color. You can customize the appearance of the lines. The "Color lines" button will automatically color the selected lines according to hue. In the subsequent item fields you can select the line style (solid, dashed, etc.) and the line width. If you select a style for the marker, one will be placed at each frame. In the subsequent fields you customize the appearance of the markers. In the bottom half of the window you can select whether and where the legend will be displayed. Finally you can enter the title and axes labels if you like.
Consider how the value of λ affects the population of X at time t = 2.5. From the plots above it appears that with increasing λ there is greater variance in the population, and also a greater likelihood of extinction (X = 0). However, it is not possible to quantify these observations by looking at trajectory plots. Recording histograms of the state is the right tool for this. We select the Histograms, Transient Behavior output category. Since we are only interested in the population at t = 2.5, we set the end time to that value and set the number of frames to 1. We set the number of bins to 128 and launch simulation with 100000 trajectories for each value of λ. When plotting histograms you choose the species and the frame (time). You can also choose colors and enter a title and axes labels if you like. The plot configuration window for histograms is shown below.
The histograms for each value of λ are shown below. We see that for λ = 0, the mode (most likely population) is 4, but for λ = 3, 7, or 10, the mode is 0. The likelihood of extinction increases with increasing λ.
![]() | ![]() |
λ = 0, μ = 1 | λ = 3, μ = 4 |
![]() | ![]() |
λ = 7, μ = 8 | λ = 10, μ = 11 |
Consider a system with a single species X and two reactions: immigration and death. The immigration reaction is 0→X with a unit propensity factor. The death reaction is X→0 and has propensity factor 0.1. Since both reactions use mass action kinetic laws, the propensities are 1 and 0.1 X, respectively. The analogous deterministic process is X' = 1 - 0.1X. We can find the steady state solution of this by setting X' to zero and solving for X, which yields X = 10. (More accurately it is a stationary point that happens to be a steady state solution.)
The discrete system does not have the same kind of steady state solution as the continuous model. At steady state, there is a probability distribution for the population of X. To determine this distribution open the ImmigrationDeath.xml example file and select ImmigrationDeath10 from the model list and SteadyState from the list of methods. The initial population is 10. The direct method with the elapsed time solver option generates average value histograms. We specify that we will record X and generate two trajectories. We hit
A simulation method may be either deterministic or stochastic. One can obtain a deterministic method by modelling the reactions with ordinary differential equations. Numerically integrating the equations gives an approximate solution.
The simulations may be performed with exact or approximate methods. Gillespie's direct method and Gibson and Bruck's next reaction method are both exact methods. Various formulations of both of these methods are available. For the direct method, there are a variety of ways of generating a discrete deviate, that determines which reaction fires. The next reaction method uses a priority queue. Several data structures can be used to implement a priority queue. The choice of data structure will influence the performance, but not the output.
Tau-leaping is an approximate, discrete, stochastic method. It is used to generate an ensemble of trajectories, each of which is an approximate realization of the stochastic process. The tau-leaping method takes jumps in time and uses Poisson deviates to determine how many times each reaction fires. One can choose fixed time steps or specify a desired accuracy. The latter is the preferred method. There is a hybrid method which combines the direct method and tau-leaping. An adaptation of the direct method is used for reactions that are slow or involve small populations; the tau-leaping method is used for the rest. This offers improved accuracy and performance for the case that some species have small populations. For this hybrid method, one specifies the desired accuracy.
One can model the reactions with a set of ordinary differential equations. In this case one assumes that the populations are continuous and not discrete (integer). One can numerically integrate the differential equations to obtain an approximate solution. Note that since this is a deterministic model, it generates a single solution instead of an ensemble of trajectories.
Each of the stochastic simulation methods use discrete, uniform deviates (random integers). We use the Mersenne Twister 19937 algorithm to generate these. Both of the exact methods also use exponential deviates that determine the reaction times. For these we use the ziggurat method.
We consider discrete stochastic simulations that are modelled with a set of species and a set of reactions that transform the species' amounts. Instead of using a continuum approximation and dealing with species mass or concentration, the amount of each species is a non-negative integer which is the population. Depending on the species, this could be the number of molecules or the number of organisms, etc. Reactions transform a set reactants into a set of products, each being a linear combination of species with integer coefficients.
Consider a system of N species represented by the state vector X(t) = (X1(t), ... XN(t)). Xn(t) is the population of the nth species at time t. There are M reaction channels which change the state of the system. Each reaction is characterized by a propensity function am and a state change vector Vm = (Vm1, ..., VmN). am dt is the probability that the mth reaction will occur in the infinitesimal time interval [t .. t + dt). The state change vector is the difference between the state after the reaction and before the reaction.
To generate a trajectory (a possible realization of the evolution of the system) one starts with an initial state and then repeatedly fires reactions. To fire a reaction, one must answer the two questions:
Once the state vector X has been initialized, Gillespie's direct method proceeds by repeatedly stepping forward in time until a termination condition is reached. (See Exact Stochastic Simulation of Coupled Chemical Reactions.) At each step, one generates two uniform random deviates in the interval (0..1). The first deviate, along with the sum of the propensities, is used to generate an exponential deviate which is the time to the first reaction to fire. The second deviate is used to determine which reaction will fire. Below is the algorithm for a single step.
Consider the computational complexity of the direct method. We assume that the reactions are loosely coupled and hence computing a propensity am is O(1). Thus the cost of computing the propensities is O(M). Determining μ requires iterating over the array of propensities and thus has cost O(M). With our loosely coupled assumption, updating the state has unit cost. Therefore the computational complexity of a step with the direct method is O(M).
To improve the computational complexity of the direct method, we first write it in a more generic way. A time step consists of the following:
There are several ways of improving the performance of the direct method:
The original formulation of the direct method uses the inversion method to generate an exponential deviate. This is easy to program, but is computationally expensive due to the evaluation of the logarithm. There are a couple of recent algorithms (ziggurat and acceptance complement) that have much better performance.
There are many algorithms for generating discrete deviates. The static case (fixed probability mass function) is well studied. The simplest approach is CDF inversion with a linear search. One can implement this with a build-up or chop-down search on the PMF. The method is easy to code and does not require storing the CDF. However, it has linear complexity in the number of events, so it is quite slow. A better approach is CDF inversion with a binary search. For this method, one needs to store the CDF. The binary search results in logarithmic computational complexity. A better approach still is Walker's algorithm, which has constant complexity. Walker's algorithm is a binning approach in which each bin represents either one or two events.
Generating discrete deviates with a dynamically changing PMF is significantly trickier than in the static case. CDF inversion with a linear search adapts well to the dynamic case; it does not have any auxiliary data structures. The faster methods have significant preprocessing costs. In the dynamic case these costs are incurred in updating the PMF. The binary search and Walker's algorithm both have linear preprocessing costs. Thus all three considered algorithms have the same complexity for the combined task of generating a deviate and modifying the PMF. There are algorithms that can both efficiently generate deviates and modify the PMF. In fact, there is a method that has constant complexity. See the documentation of the source code for details.
The original formulation of the direct method uses CDF inversion with a linear search. Subsequent versions have stored the PMF in sorted order or used CDF inversion with a binary search. These modifications have yielded better performance, but have not changed the worst-case computational complexity of the algorithm. Using a more sophisticated discrete deviate generator will improve the performance of the direct method, particularly for large problems.
For representing reactions and the state change vectors, one can use either dense or sparse arrays. Using dense arrays is more efficient for small or tightly coupled problems. Otherwise sparse arrays will yield better performance. Consider loosely coupled problems. For small problems one can expect modest performance benefits (10 %) in using dense arrays. For more than about 30 species, it is better to use sparse arrays.
For loosely coupled problems, it is better to continuously update the sum of the propensities α instead of recomputing it at each time step. Note that this requires some care. One must account for round-off error and periodically recompute the sum.
The following options are available with the direct method. Inversion with a 2-D search is the default; it is an efficient method for most problems. If performance is important (i.e. if it will take a long time to generate the desired number of trajectories) it may be worthwhile to try each method with a small number of trajectories and then select the best method for the problem.
Gillespie's first reaction method generates a uniform random deviate for each reaction at each time step. These uniform deviates are used to compute exponential deviates which are the times at which each reaction will next fire. By selecting the minimum of these times, one identifies the time and the index of the first reaction to fire. The algorithm for a single step is given below.
As with the direct method, using an efficient exponential deviate generator will improve the performance. But with the first reaction method an exponential deviate is generated for each reaction, so using a good generator is critical. One can also improve the efficiency by only computing those propensities that have changed. For this one needs a reaction influence data structure. The implementation of the first reaction method in Cain uses these optimizations.
The first reaction method is not as efficient as the direct method. Taking a step has linear complexity in the number of reactions and it requires more random numbers than the direct method. For small problems it has acceptable performance, but it is not efficient for large problems. The first reaction method may be adapted to re-use the reaction times instead of regenerating them at each step. This method is introduced below.
Gibson and Bruck's next reaction method is an adaptation of the first reaction method. (See "Efficient Exact Stochastic Simulation of Chemical Systems with Many Species and Many Channels.") Instead of computing the time to each reaction, one deals with the time at which a reaction will occur. These times are not computed anew at each time step, but re-used. The reaction times are stored in an indexed priority queue (indexed because the reaction indices are stored with the reaction times). Also, propensities are computed only when they have changed. Below is the algorithm for a single step.
Consider the computational complexity of the next reaction method. We assume that the reactions are loosely coupled and hence computing a propensity am is O(1). Let D be an upper bound on the number of propensities that are affected by firing a single reaction. Then the cost of updating the propensities and the reaction times is O(D). Since the cost of inserting or changing a value in the priority queue is O(log M), the cost of updating the priority queue is O(D log M). Therefore the computational complexity of a step with the next reaction method is O(D log M).
One can reformulate the next reaction method to obtain a more efficient algorithm. The most expensive parts of the algorithm are maintaining the binary heap, updating the state, and generating exponential deviates. Improving the generation of exponential deviates is a minimally invasive procedure. Instead of using the inversion method, one can use the ziggurat method or the acceptance complement method. (See Marsaglia 2000 and Rubin 2006) Reducing the cost of the binary heap operations is a more complicated affair. We present several approaches below.
Indexed Priority Queues
The term priority queue has almost become synonymous with
binary heap. For most applications, a binary heap is an
efficient way of implementing a priority queue. For a heap with M
elements, one can access the minimum element in constant time. The
cost to insert or extract an element or to change the value of an
element is O(log M). Also, the storage requirements are
linear in the number of elements. While a binary heap is rarely the
most efficient data structure for a particular application, it is
usually efficient enough. If performance is important and the heap
operations constitute a significant portion of the computational cost
in an application, then it may be profitable to consider other data
structures.
Linear Search
The simplest method of implementing a priority queue is to store the
elements in an array and use a linear search to find the minimum
element. The computational complexity of finding the minimum element
is O(M). Inserting, deleting, and modifying elements can be
done in constant time. For the next reaction method, linear search is
the most efficient algorithm when the number of reactions is small.
Partitioning
For larger problem sizes, one can utilize the under-appreciated method
of partitioning. One stores the elements in an array, but classifies the
the elements into two categories: lower and upper. One uses a splitting
value to discriminate; the elements in the lower partition are less than
the splitting value. Then one can determine the minimum value in the queue
with a linear search on the elements in the lower partition. Inserting,
erasing, and modifying values can all be done in constant time. However,
there is the overhead of determining in which partition an element belongs.
When the lower partition becomes empty, one must choose a new splitting
value and re-partition the elements (at cost O(M)).
By choosing the splitting value so that there are O(M1/2)
elements in the lower partition, one can attain an average cost of
O(M1/2) for determining the minimum element.
This choice balances the costs of searching and re-partitioning.
The cost of a search, O(M1/2), times the number
of searches before one needs to re-partition, O(M1/2),
has the same complexity as the cost of re-partitioning. There are
several strategies for choosing the splitting value and partitioning
the elements. Partitioning with a linear search is an efficient method
for problems of moderate size.
Binary Heaps
When using indexed binary heaps, there are a few implementation details
that have a significant impact on performance. See the documentation
of the source code for details.
Binary heaps have decent performance for a wide range of problem sizes.
Because the algorithms are fairly simple, they perform well for small
problems. Because of the logarithmic complexity, they are suitable for
fairly large problems.
Hashing
There is a data structure that can perform each of the operations
(finding the minimum element, inserting, removing, and modifying)
in constant time. This is accomplished with hashing. (One could also
refer to the method as bucketing.) The reaction times are stored in
a hash table.
(See "Introduction to Algorithms, Second Edition.")
The hashing function is a linear function of the reaction
time (with a truncation to convert from a floating point value to an
integer index).
The constant in the linear function is chosen to give the desired load.
For hashing with chaining, if the load is O(1), then all
operations can be done in constant time. As with binary heaps, the
implementation is important.
The following options are available with the next reaction method.
With tau-leaping we take steps forward in time. For each reaction we calculate a predicted average propensity. We then generate Poisson deviates to determine how many times each reaction will fire during the step. The advantage of tau-leaping is that it can jump over many reactions and thus may be much more efficient than exact methods. The disadvantage is that it is not an exact method.
There are several options for the tau-leaping solver. By default it will use an adaptive step size and will correct negative populations. You can also choose to not correct negative populations, the simulation will fail if a species is overdrawn. There is also a fixed time step option. This option is only useful for studying the tau-leaping method. With a fixed step size it is difficult to gauge the accuracy of the simulation.
In tau-leaping, one uses an expected value of the propensities in advancing the solution. The propensities are assumed to be constant over the time step. There are several ways of selecting the expected propensity values. The simplest is forward stepping; The expected propensities are the values at the beginning of the step. One can also use midpoint stepping. In this case one advances to the midpoint of the interval with a deterministic step. Then one uses the midpoint propensity values to take a stochastic step and fire the reactions. Midpoint stepping is analogous to a second order Runge-Kutta method for ODE's. One can also use higher order approximations to determine the expected propensities. You can use a fourth order Runge-Kutta scheme with deterministic steps to choose the expected propensities and then take a stochastic step with these values. Note that regardless of how you choose the expected propensities, the tau-leaping solver is still a first-order accurate stochastic method. That is, you can choose a first, second, or fourth order method for calculating the expected propensities, but you still assume that the propensities are constant when taking the stochastic step. Thus it is a first-order stochastic method. However, using higher order formulas for the expected propensities is typically more accurate.
Cain does not currently offer an implementation of the direct method for systems of reactions that have time-dependent propensities. However, we present the method here because it will help us understand hybrid methods. Let α(t) be the sum of the propensities. If each of the propensities was approximately constant on the time scale of 1/α(t), which is the average time to the next reaction, then an approximate solution method could treat them as if they were actually constant. Of course one would need to evaluate all of the propensities at each step. If any of the propensities varied significantly on that time scale then we would need to account for this behavior. In the following exposition we will assume that no propensities become zero during a step.
Consider the exponential distribution with rate parameter λ. The probability density function is λ e-λ t; the mean is 1/λ. Let E be an exponential deviate with unit rate constant. We can obtain an exponential deviate with rate constant λ simply by dividing by λ. Now consider the case that the rate parameter is not constant. A exponential deviate is T where ∫0T λ(t)dt = E. Note that for constant λ this equation reduces to λ T = E.
Recall that when using the direct method one uses exponential deviates to determine when reactions fire. To determine the time to the next reaction we generate a unit exponential deviate and then divide that by the sum of the propensities α. This gives us an exponential deviate with rate parameter α. Now consider a system of reactions in which the reaction propensities are functions of time. In order to determine the time to the next reaction we need to generate a unit exponential deviate E and then numerically solve ∫tt+T α(x)dx = E for T.
To solve for T we can numerically integrate α(t). Below is a simple algorithm for this.
T = 0 while α(t+T) Δt < E: E -= α(t+T) Δt T += Δt T += E / α(T)
You might recognize the above algorithm as the forward Euler method, the simplest method for integrating ordinary differential equations. The accuracy of this method depends on Δt. There are more accurate methods of numerically integrating α(t). The midpoint method and the fourth-order Runge-Kutta method are good options.
So now we know how to determine when the next reaction fires, but how do we determine which reaction fires? To do this, we integrate each of the reaction propensities: pmfi = ∫tt+T ai(x)dx. To select a reaction we draw a discrete deviate with this weighted probability mass function. Below we use the forward Euler method to calculate the time step T and the probability mass function pmf used to pick a reaction. We assume that Δt has been initialized to an appropriate value.
s = 0 for i in 1..N: pmfi = 0 pi = ai(t) s += pi T = 0 while s Δt < E: E -= s Δt T += Δt for i in 1..N: pmfi += pi Δt pi = ai(t+T) s += pmfi Δt = E / s T += Δt for i in 1..N: pmfi += pi Δt
The hybrid direct/tau-leaping method combines the direct method and the tau-leaping method. It is more accurate than tau-leaping for problems that have species with small populations. For some problems it is also faster than tau-leaping. Recall that tau-leaping is only efficient if many reactions firing during a time step. This hybrid method divides the reactions into two groups: volatile/slow and stable. We use the direct method to simulate the reactions in the volatile/slow group and tau-leaping to simulate the stable reactions.
Like regular tau-leaping, one specifies an accuracy goal with the allowed error ε. One assumes that the expected value of the reaction propensities is constant during a time step. The time step is chosen so that the expected relative change in any propensity is less than ε. A reaction is volatile if firing it a single time would produce a relative change of more than ε in any of its reactants. Consider these examples with ε = 0.1: The reaction X → Y is volatile if x < 10. The reaction X + Y → Z is volatile if either x < 10 or y < 10. The reaction 2 X → Y is volatile if x < 20.
Reactions that are "e;slow"e; are also simulated with the direct method. A reaction is classified as slow if it would fire few times during a time step. The threshold for few times is 0.1. During a time step one first computes the tau-leaping step τ. Then any reactions in the stable group that have become volatile or slow are moved to the volatile/slow group.
To take a step with the hybrid method we determine a time step τ for the stable reactions and generate a unit exponential deviate e for the volatile/slow reactions. Let σ be the sum of the PMF for the discrete deviate generator. If e ≤ σ τ, we reduce the time step to e/σ and take a tau-leaping step as well as fire a volatile/slow reaction. Otherwise we reduce e by σ τ and save this value for the next step, update the PMF with the integrated propensities, and take a tau-leaping step. To integrate the propensities one can use the forward Euler method, the midpoint method, or the fourth order Runge-Kutta method.
One can generate an approximate deterministic trajectory by considering the system of reactions as a set of ordinary differential equations and then numerically integrating these equations to determine the reactions counts and species populations. There are many scheme for numerically integrating ODE's. The Cain solver uses the Cash-Karp variant of the Runge-Kutta method. This is a fifth-order explicit method with an adaptive step size. There are also a number of solvers with fixed step size. These are primarily useful for testing algorithms. The adaptive step size solver is preferred for normal work.
For a test problem we consider the auto-regulatory network presented in Stochastic Modelling for Systems Biology. There are five species: Gene, P2Gene, Rna, P, and P2, with initial amounts 10, 0, 1, 0, and 0, respectively. There are eight reactions which have mass-action kinetic laws. The table below shows the reactions and propensity factors.
Reaction | Rate constant |
---|---|
Gene + P2 → P2Gene | 1 |
P2Gene → Gene + P2 | 10 |
Gene → Gene + Rna | 0.01 |
Rna → Rna + P | 10 |
2 P → P2 | 1 |
P2 → 2 P | 1 |
Rna → 0 | 0.1 |
P → 0 | 0.01 |
The first figures below shows a single trajectory. A close-up is shown in the next figure. We can see that the system is fairly noisy.
Auto-regulatory system on the time interval [0..50].
Auto-regulatory system on the time interval [0..5].
In order to present a range of problem sizes, we duplicate the species and reactions. For a test problem with 50 species and 80 reactions we have 10 auto-regulatory groups. The reaction propensity factors in each group are scaled by a unit, uniform random deviate. We study systems ranging from 5 to 50,000 species.
The table below shows the performance for various formulations of the direct method. Using a linear search is efficient for a small number of reactions, but does not scale well to larger problems. In the first row we recompute the sum of the propensities at each time step. (This is the original formulation of the direct method.) In the next row we see that immediately updating the sum significantly improves the performance. The following two rows show the effect of ordering the reactions. In the former we periodically sort the reactions and in the latter we swap reactions when modifying the propensities. Ordering the reactions pays off for the largest problem size, but for the rest the overhead outweighs the benefits.
The 2-D search method has the best overall performance. It is fast for small problems and scales well enough to best the more sophisticated methods. Because the auto-regulatory network is so noisy, ordering the reactions hurts the performance of the method.
The binary search on a complete CDF has good performance for the smallest problem size, but has poor scalability. Ordering the reactions is a significant help, but the method is still very slow for large problems. The binary search on a partial, recursive CDF is fairly slow for the smallest problem, but has good scalability. The method is in the running for the second best overall performance.
Because of its complexity, the composition rejection method has poor performance for small problems. However, it has excellent scalability. It edges out the 2-D search method for the test with 80,000 reactions. Although its complexity is independent of the number of reactions, the execution time rises with problem size largely because of caching effects. As with all of the other methods, larger problems and increased storage requirements lead to cache misses. The composition rejection method is tied with the binary search on a partial CDF for the second best overall performance.
Species | 5 | 50 | 500 | 5,000 | 50,000 | |
---|---|---|---|---|---|---|
Reactions | 8 | 80 | 800 | 8,000 | 80,000 | |
Algorithm | Option | |||||
Linear Search | Delayed update | 101 | 264 | 1859 | 17145 | 168455 |
Immediate update | 109 | 163 | 780 | 6572 | 63113 | |
Complete sort | 107 | 197 | 976 | 7443 | 22862 | |
Bubble sort | 110 | 205 | 1001 | 7420 | 25872 | |
2-D Search | Default | 109 | 130 | 218 | 347 | 1262 |
Complete sort | 115 | 148 | 247 | 402 | 1566 | |
Bubble sort | 124 | 149 | 220 | 328 | 1674 | |
Binary Search | Complete CDF | 105 | 219 | 1196 | 10378 | 103209 |
Complete CDF, sorted | 114 | 202 | 835 | 3825 | 30273 | |
Partial, recursive CDF | 232 | 328 | 433 | 552 | 1314 | |
Rejection | Composition | 341 | 365 | 437 | 482 | 1189 |
In the next table we show the performance of the first reaction method. We consider a simple implementation and two implementations that take innovations from the next reaction method. Because a step in the first reaction method has linear computational complexity in the number of reactions, all of the formulations have poor scalability. The simple formulation is fairly slow for small problem sizes. Even for small problems, there is a heavy price for computing the propensity function and an exponential deviate for each reaction. Using the reaction influence graph to reduce recomputing the propensity functions is a moderate help. Storing absolute times instead of the waiting times greatly improves performance. By storing the absolute times, one avoids computing the propensity functions and an exponential deviate for all of the reactions at each time step. Only the reactions influenced by the fired reaction need to be recomputed. However, this formulation is still not competitive with the direct method.
Species | 5 | 50 | 500 | 5,000 | 50,000 |
---|---|---|---|---|---|
Reactions | 8 | 80 | 800 | 8,000 | 80,000 |
Option | |||||
Simple | 201 | 1968 | 19843 | 159133 | 1789500 |
Reaction influence | 194 | 1510 | 13324 | 110828 | 890948 |
Absolute time | 133 | 249 | 1211 | 10368 | 102316 |
In the table below we show the performance for various formulations of the next reaction method. Using a linear search is only efficient for a small number of reactions. Manual loop unrolling improves its performance, but it is still not practical for large problems.
The size adaptive and cost adaptive versions of the partition method have pretty good performance. They are competitive with more sophisticated methods up to the test with 800 reactions, but the square root complexity shows in the larger tests.
The binary heap methods have good performance. On 64-bit processors the pair formulation is typically better than the pointer formulation. (Vice-versa for 32-bit processors.)
Using hashing for the priority queue yields the best overall performance for the next reaction method. It is efficient for small problems and has good scalability.
Species | 5 | 50 | 500 | 5,000 | 50,000 | |
---|---|---|---|---|---|---|
Reactions | 8 | 80 | 800 | 8,000 | 80,000 | |
Algorithm | Option | |||||
Linear Search | Simple | 124 | 386 | 2990 | 28902 | 287909 |
Unrolled | 120 | 228 | 1116 | 9557 | 94156 | |
Partition | Fixed size | 139 | 381 | 582 | 1455 | 5175 |
Size adaptive | 163 | 193 | 285 | 500 | 1735 | |
Cost adaptive | 124 | 196 | 303 | 537 | 1828 | |
Propensities | 146 | 191 | 333 | 723 | 2515 | |
Binary Heap | Pointer | 166 | 199 | 290 | 413 | 1448 |
Pair | 154 | 192 | 272 | 374 | 1304 | |
Hashing | Chaining | 151 | 187 | 307 | 320 | 964 |
The table below shows the best performing formulation in each category. Only the methods based on a linear search perform poorly. The rest at least offer reasonable performance. The direct method with a 2-D search and the next reaction method that uses a hash table offer the best overall performance. The former is faster up to the test with 800 reactions; the latter has better performance for the large problems.
Species | 5 | 50 | 500 | 5,000 | 50,000 | ||
---|---|---|---|---|---|---|---|
Reactions | 8 | 80 | 800 | 8,000 | 80,000 | ||
Method | Algorithm | Option | |||||
Direct | Linear search | Complete sort | 107 | 197 | 976 | 7443 | 22862 |
Direct | 2-D search | Default | 109 | 130 | 218 | 347 | 1262 |
Direct | Binary search | Partial, recursive CDF | 232 | 328 | 433 | 552 | 1314 |
Direct | Rejection | Composition | 341 | 365 | 437 | 482 | 1189 |
First reaction | Linear search | Absolute time | 133 | 249 | 1211 | 10368 | 102316 |
Next reaction | Linear search | Unrolled | 120 | 228 | 1116 | 9557 | 94156 |
Next reaction | Partition | Cost adaptive | 124 | 196 | 303 | 537 | 1828 |
Next reaction | Binary heap | Pair | 154 | 192 | 272 | 374 | 1304 |
Next reaction | Hashing | Chaining | 151 | 187 | 307 | 320 | 964 |
Of course the performance of the various formulations depends upon the problem. The species populations could be highly variable, or fairly stable. The range of propensities could large or small. However, the performance results for the auto-regulatory network are very typical. Most problems give similar results. The biggest difference is that for some systems ordering the reactions is useful when using the direct method. The auto-regulatory system is too noisy for this to improve performance.
Overview.
Cain is written in Python and uses the
wxPython GUI toolkit.
For plotting simulation results, it uses
matplotlib and
numpy. Cain utilizes command line
executables to perform the simulations. These executables read a description
of the problem (model, method, and number of trajectories)
from stdin and write the trajectories to stdout. Cain launches the
executables; it sends the input and captures the output with pipes. When
launching a job, the user select the number of processes to use. Cain launches
this number of executables. It asks the pool of solvers to each generate
trajectories one at a time until the desired number has been collected.
This allows the user to stop or abort a running simulation and store
the trajectories that have been generated so far.
Source code.
The source code for Cain, both Python application code and C++ solver
code is available in
STLib.
The Cain application is in stlib/applications/stochastic.
The Python source code is split into the three top-level directories:
gui, io, and state. The script Cain.py
launches the application. Thus you can launch Cain with the shell command
python Cain.py. The solvers are in the solvers directory.
There is a makefile there. The executables use STLib's stochastic and
numerical packages, located in stlib/src. Consult
STLib's documentation for information about the stochastic simulation
methods, random number generators, etc.
Mac OS X Distribution.
For Mac OS X, Cain is distributed as an application bundle. The Python source
code and the command line executables are placed in a folder called
Cain.app. From the Finder, this appears as an application called
Cain. You can make the application bundle by executing
make bundle in stlib/applications/stochastic.
This copies the appropriate content
into the Cain.app/Contents/Resources directory.
I pack the application bundle and example data files (in
stlib/data/stochastic) into a disk image for easy distribution.
To make a disk image, use Disk Utility. Click New Image and make a 40 MB image called Cain. Quit Disk Utility. In the mounted devices, changed the name to Cain. Drag the application bundle to the disk image. Rename the data folder "stochastic" to "CainExamples". Remove the folder "CainExamples/sbml/bmd-2007-9-25". Drag the data folder to the disk image. Finally eject the disk image.
Microsoft Windows Distribution.
For MS Windows, Cain is distributed as a frozen executable.
I use py2exe to accomplish this. (See
that web site for details on how the Python interpreter and the Python source
are packed into a stand-alone executable.)
Execute make win in stlib/applications/stochastic
to build the Cain executable in the dist directory.
I use Inno Setup to make an
installer.
Cain uses command line executables to carry out the simulations. The executables an in the solvers directory.
XML
Cain stores models, methods, simulation output, and random number state
in an XML format. See the Cain XML File Format
section for the specification.
SBML
Cain can import and export SBML models. However, it has limited ability
to parse kinetic laws; complicated expressions may not parsed. In this case
you have to enter the propensity function in the Reaction Editor.
If the SBML model has reversible reactions, they will each be split into
two irreversible reactions. (The stochastic simulation algorithms only work
for irreversible reactions.) You will need to correct the propensity
functions. Also, only mass-action
kinetic laws can be exported to SBML. Other kinetic laws are omitted.
Input for solvers.
For batch processing, you can export a text file for input to one of the
solvers. The different categories of solvers require different inputs.
However, the input for each of the solvers starts with the following:
<number of species> <number of reactions> <list of initial amounts> <packed reactions> <list of propensity factors> <number of species to record> <list of species to record> <number of reactions to record> <list of reactions to record> <maximum allowed steps> <number of solver parameters> <list of solver parameters>To make the text processing easier and to make the files easier to read, each term in brackets occupies a single line. Note the following about the input fields:
<number of reactants> <index1> <stoichiometry1> ... <indexM> <stoichiometryM> <number of products> <index1> <stoichiometry1> ... <indexN> <stoichiometryN>An empty set of reactants or products is indicated with a single zero.
Below are a couple examples of packed reactions:
Following the above input that is common to all solvers, is solver-specific input. Consult the source code for documentation. As an example, each of the exact methods (direct, first reaction, next reaction) that record time series data at uniformly spaced interval use the following solver-specific input:
<number of frames> <list of frame times> <list of MT 19937 state> <number of trajectories>The state of the Mersenne Twister 19937 is a list of 624, 32-bit unsigned integers followed by an array index that specifies the current position in the list. Thus the state is defined with 625 integers.
Consider the following simple problem with one species and two reactions:
birth X → 2 X and death X → 0. Let the propensity
factors be 0.9 and 1.1, respectively. Let the initial population
of X be 50. We wish to use the direct method to simulate the process
from time t = 0 to t = 10, recording all species and reaction
each second (resulting in
11 frames). Enter this model in Cain, set the number of trajectories to
1000, and export it as a batch job with
the file name input.txt. To do this,
click the disk icon in the Launcher
panel. Below is the resulting data file (with most of the Mersenne Twister
state omitted.)
1 2 50 1 0 1 1 0 2 1 0 1 0 0.90000000000000002 1.1000000000000001 1 0 2 0 1 0 0 11 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 2147483648 1477382859 ... 1489482022 816301522 625 1000
Solver output.
The different categories of solvers produce different output.
Consult the source code for documentation.
Below is the file format for exact solvers that record trajectories with
frames. As before, each term in brackets occupies a single line.
<number of species> <number of reactions> <number of species to record> <list of species to record> <number of reactions to record> <list of reactions to record> <output class> <number of frames> <list of frame times> for each task: <number of trajectories> for each trajectory: <list of initial MT 19937 state> <success or failure> <list of populations> <list of reaction counts> <list of final MT 19937 state>
We can use the input file that we exported above to generate a trajectory with the direct method.
./solvers/Direct2DSearch.exe <input.txt >output.txtThe contents of the output file are shown below. Again most of the Mersenne Twister state is omitted.
1 2 1 0 2 0 1 TrajectoryFrames 11 0 1 2 3 4 5 6 7 8 9 10 1 2147483648 1477382859 ... 1489482022 816301522 625 success 50 68 58 37 42 25 21 12 8 7 11 0 0 71 53 120 112 147 160 192 200 223 248 246 275 256 294 264 306 275 318 285 324 3669219105 1764262773 ... 664743223 247954458 616
It is easy to add a new simulation method to Cain if it is similar to one of the built-in methods. Just write a program that reads the same input format and generates the same output format as one of the other solvers. You can write your program in any language and use any software or hardware resources that you have on your machine. Although it is not required, it is a good idea to make your program serial. Cain utilizes concurrency by running multiple instances of a solver. Place your program in the solvers directory. Then edit the _data variable in the file state/simulationMethods.py. Insert a description of your method into one of the existing categories.
<?xml version="1.0" encoding="utf-8"?> <cain> <listOfModels> One or more <model> elements. </listOfModels> <listOfMethods> One or more <method> elements. </listOfMethods> <listOfOutput> One or more output elements. </listOfOutput> <random> Zero or more <stateMT19937> elements. </random> </cain>In the next few sections we will describe each of the top-level elements. Each element attribute has one of the following formats:
<model id="Identifier" name="String"> <listOfParameters> One or more <parameter> elements. </listOfParameters> <listOfCompartments> One or more <compartment> elements. </listOfCompartments> <listOfSpecies> One or more <species> elements. </listOfSpecies> <listOfReactions> One or more <reaction> elements. </listOfReactions> </model>Parameters are Python expressions which may use mathematical function and other parameter identifiers. Parameters must evaluate to a numerical value.
<parameter id="Identifier" expression="PythonExpression" name="String"/>Compartments are only used for information. They do not affect simulation output.
<compartment id="Identifier" name="String" spatialDimensions="Dimension" size="Number" constant="Boolean" outside="Identifier"/>The initial amount of a species must evaluate to a non-negative integer.
<species id="Identifier" initialAmount="PythonExpression" name="String" compartment="Identifier"/>The reaction element is simpler than its SBML counterpart. There is no reversible attribute. In stochastic simulation one represents a reversible reaction by specifying both the forward and backward reactions along with their kinetic laws. Note that while the listOfReactants and listOfProducts elements are optional, at least one of the two must be present. Instead of containing a kineticLaw element, the reaction element has the propensity attribute. For mass action kinetics, the propensity is a python expression.
<reaction id="Identifier" massAction="true" propensity="PythonExpression" name="String"> <listOfReactants> One or more <speciesReference> elements. </listOfReactants> <listOfProducts> One or more <speciesReference> elements. </listOfProducts> </reaction>If the reaction does not use a mass action kinetics law, the propensity is a C++ expression. (See the Reaction Editor section.)
<reaction id="Identifier" massAction="false" propensity="C++Expression" name="String"> ... </reaction>The speciesReference element is used to represent reactants and products. The stoichiometry attribute must be a positive integer. Omitting it indicates that the stoichiometry is one.
<speciesReference species="Identifier" stoichiometry="Integer"/>
state/simulationMethods.py
. The category indicates what kind of
output the solver produces: histograms, trajectories with state recorded at
specified frames, or trajectories that record every reaction. Next is the
simulation method: direct, next reaction, tau-leaping, etc. Most methods
have a number of options. For instance, with tau-leaping you can choose
to use adaptive step size or a fixed step size.
The starting time of the simulation is zero. The time interval is specified
by defining the end time. For solvers that record time series data at
uniformly-spaced intervals, the number of frames defines the frame times.
For solvers that generate histograms
one needs to specify the number of bins. The solvers dynamically adjust the
bin width to maintain this constant number of bins. Some solvers require
a parameter value. For example tau-leaping with an adaptive step size uses
an error tolerance parameters. Which solvers require a parameters is
indicated in state/simulationMethods.py
.
<method id="Identifier" category="Integer" method="Integer" options="Integer" endTime="Number" numberOfFrames="Integer" numberOfBins="Integer" solverParameter="Number"/>
histogramFrames
- Record histograms of species populations
at specified frames.
trajectoryFrames
- Record the species populations and
reactions counts at specified frames. In this way one can plot the
realizations as a function of time.
trajectoryAllReactions
- Record every reaction event for
each trajectory.
<histogramFrames model="Identifier" method="Identifier" numberOfTrajectories="Integer"> <frameTimes> List of numbers. </frameTimes> <recordedSpecies> List of indices. </recordedSpecies> One <histogram> element for each frame and each recorded species. </histogramFrames>For a histogram one stores the lower bound and bin width as attributes. The number of bins can be deduces from the lists of bin values. One also stores the frame index and recorded species index as attributes. The histogram bin values are stored in two lists. By computing the histogram distance between the two parts, one can estimate the error in the combined histogram.
<histogram lowerBound="Number" width="Number" frame="Integer" species="Integer"> <firstHistogram> List of numbers. </firstHistogram> <secondHistogram> List of numbers. </secondHistogram> </histogram>In a trajectoryFrames element one records the list of frame times, the indices of the recorded species and the indices of the recorded reactions. For each trajectory generated, there is a list of populations and a list of reactions counts. The number of populations in each list is the product of the number of frames and the number of recorded species. The populations at a given frame are contiguous. Likewise for the recorded species.
<trajectoryFrames model="Identifier" method="Identifier"> <frameTimes> List of numbers. </frameTimes> <recordedSpecies> List of indices. </recordedSpecies> <recordedReactions> List of indices. </recordedReactions> For each trajectory: <populations> List of numbers. </populations> <reactionCounts> List of numbers. </reactionCounts> </trajectoryFrames>In a trajectoryAllReactions element one stores the simulation end time as an attribute. Unlike the other simulation output elements this quantity can't be deduced from a list of frame times. For each trajectory generated one records a list of reaction indices and a list of reaction times. Each index/time pair specifies a reaction event.
<trajectoryAllReactions model="Identifier" method="Identifier" endTime="Number"> For each trajectory: <indices> List of indices. </indices> <times> List of numbers. </times> </trajectoryAllReactions>
<random seed="Integer"> For each state: <stateMT19937> List of integers. </stateMT19937> </random>
Why is this program called Cain?
I couldn't think of a Catchy And Informative Name.
Where can I get the latest version of Cain?
http://cain.sourceforge.net/
I found an error. What should I do?
Don't panic! Send an email to . Please include
any relevant data files and a description of what causes the problem. If
Cain wrote a file called ErrorLog.txt, include that as well.
I have a question or feature request. Who should I contact?
Why can't I edit the model or the method?
You have generated output using that model or that method.
Once you have done this, Cain will not let you modify them. If
it did, the model or method would no longer correspond
to the generated output. If you delete the output
you will be able to edit the model or method.
Saving plots in PNG (Portable Network Graphics) format may not work. Select another format, like PDF (Portable Document Format), instead.
The status bar at the bottom of the main window does not work correctly on OS X, so the tool tips are not displayed. This is a known issue with wxPython.
Version 2.8.9.2 of wxPython may not correctly display tables in the documentation. I would recommend using version 2.8.9.1. You can also view the PDF version (available as a download) for the correct output.