Skip to content
Grady Weyenberg edited this page Feb 18, 2015 · 11 revisions

BatchExperiments Tutorial

A short tutorial designed to introduce a new user to the basic features and terminology of BatchExperiments. This is based on my experience learning the system, and does not cover many details. It is intended to be a compact introduction to get a new user familiar with the workflow and terminology of BatchExperiments, so that they may be better equipped to tackle the Technical Report.

At the time I began to learn BatchExperiments, I could not find such a compact introductory presentation, instead having to flip back and forth between the Technical Report and the online documentation. As of right now the simulations in the code are purely a technical demonstration, and do not make a great deal of statistical sense. I may fix this at some point.

Experiment Registries

BatchExperiments manages all of the information needed to execute your simulation in a conceptual object known as a registry. A registry consists of two parts: a ExperimentRegistry object in R, which is the primary method for the user to interact with the registry; and a database on the filesystem where the status and results are collected. We will refer to a generic instance of the R object as a reg. When a reg is created, it is associated with a filesystem database. The reg object can be then deleted and recreated at will, as all of the persistent state of the simulation is stored in the filesystem database, not within R.

When first setting up a new simulation, use makeExperimentRegistry to create a new filesystem database and obtain a reg object associated with it. The most immediately relevant parameters of this function are id, which is a string used to name the simulation, and packages which is a list of names of packages which are required by your simulation code.

reg <- makeExperimentRegistry(id="myExperiment", packages=c("glm","nlme4"))

NB: There is also a file.dir parameter which can be used to provide a path where the filesystem database will be placed. The default value places the database in the "<id>-files" directory within R's current working directory.

Once the database is created, the reg object can reconstructed (in a later R session, e.g.) with the loadRegistry function, by passing it a path to the folder containing the filesystem database.

reg <- loadRegistry("myExperiment-files")

Defining Problems

A problem is a method for generating a dataset suitable for further analysis. It consists of a static object, which does not change across the replications; or a dynamic function, which is called on each replication to produce a random dataset; or both.

As an example, we will implement a bootstrap study of simple linear regression models. In bootstrap investigations one resamples the original observations to generate the test data sets. Thus, we can consider the static part of the problem to be the original data frame, and the dynamic portion of the problem is to produce a vector of indices which can be used to extract the resampled data from the static part.

The dynamic function can accept an arbitrary set of parameters, which can be used to control the properties of the generated data. The values taken by of most of these parameters will be defined later in a parameter design. However, parameters named static and job are special, and are handled differently: the static parameter (if it exists) will always contain the static object; while the job parameter (if it exists) will contain a special type of ExperimentJob object which is unique to each replicate in the simulation. Either of these special parameters may be omitted if it is not required. In particular, the job object is rarely necessary.

This function will act as the dynamic part of a bootstrap problem, where the static object is an observed data frame. It resamples n row indices from the static object.

bootstrapRows <- function(static, n, ...){
  sample.int(nrow(static), size=n, replace=TRUE, ...)
}

Once a problem is defined, it must be added to the registry using the addProblem function. As with registries, the id parameter should be a unique string which will be used for later reference to the problem. If your problem lacks either a static or dynamic part, that parameter may be omitted.

addProblem(reg, id="faithful", static=faithful, dynamic=bootstrapRows)

Note that we can study a different observed dataset adding a new problem with a different static object.

addProblem(reg, id="trees", static=trees, dynamic=bootstrapRows)

Defining Algorithms

An algorithm is a function which transforms a problem instance into some type of result object. The algorithm signature is handled in a similar fashion to the dynamic problem function. Parameters named job and static work exactly as described in the problem function; the dynamic parameter will contain the result of the call to the dynamic function; and any other parameters will be supplied via an algorithm parameter design.

This example algorithm function will work with the previously supplied problems to fit a generalized linear model to the static data object, with the rows resampled. It returns a complete model object.

fitGLM <- function(static,dynamic,formula,family){
  resampled <- static[dynamic,]
  glm(formula=formula, family=eval(parse(text=family)),data=resampled)
}

NB: Parameters supplied through a design must be one of the R atomic types. Use of a combination of parse and eval may provide a way to get around this requirement, as was done here. This restriction also applies to problems.

As with a problem, an algorithm must be added to the registry after being defined.

addAlgorithm(reg,"glmAlgorithm", fitGLM)

Defining Experimental Designs

Both the problems and the solutions described so far are incomplete, because the functions have parameters which have yet to be defined. For instance, the bootstrapRows problem has an explicit parameter n which controls the number of observations in the resampled data, while in fitGLM the glm() parameters formula, describing the linear model, and the generalized family of the error distribution must be supplied.

Suppose that for our GLM algorithm we wished to investigate the properties of two linear models crossed with two error families. Such a complete factorial design is the easiest design to construct, but more complicated designs can also be created. First we construct vectors containing the desired values of each parameter in the glm algorithm. As mentioned previously, parameters supplied through a design must be one of the R atomic types.

faithfulModels <- c("waiting~eruptions",
                    "waiting~eruptions+I(eruptions^2)")
glmFamilies <- c("gaussian", "Gamma(link=\"log\")") 

The actual factorial design is created using the makeDesign function. The first argument to makeDesign is the registry id of the algorithm or problem for which we are going to create a parameter design. The exhaustive argument is a named list of parameter values from which to construct the factorial design, where the names of the elements in the list will be used as parameter names for the associated values when the algorithm function is called.

desList <- list(formula=faithfulModels, family=glmFamilies)
faithGLMdesign <- makeDesign("glmAlgorithm", exhaustive=desList)

For example, the first combination of parameters in the resulting fitGLM algorithm design will produce calls to glm such as

glm(formula=waiting~eruptions,family=gaussian,data=resampled)

A design for a problem is constructed in the same manner. In our case the only parameter in the problem is n, the resampled dataset size. This sets up a design to explore three different values for this parameter.

faithBootDesign <- makeDesign("faithful", exhaustive=list(n=c(50,100,200)))     

We now have a complete description of the problem and algorithm combinations we wish to explore. We can now add the design to the registry, specifying how many replicates of each combination we wish to simulate. These individual replicates are the atomic computing unit of a Batch Experiment, and are known as jobs.

addExperiments(reg, prob.designs=faithBootDesign, 
                    algo.designs=faithGLMdesign, repls = 500)

After submitting this command, the registry will be ready to generate and analyze 500 problem replicates under all combinations of design parameters. However, it will not begin execution of the jobs until instructed to.

Details of the current state of the registry's jobs can be obtained using showStatus. At this point all of the jobs should be in the Submitted state. As they progress through the computation pipeline their status changes: Started, Running, and then Done, Error, or Expired depending on the result.

showStatus(reg)

Testing the algorithm can be done using the testJob function. This will make R attempt to run the first job in the registry in the local process. This will not alter the status of the job or registry, nor save the results of the tested job. Results of the run will simply be printed to the R console.

NB: Local process execution means that testJob will not use any of the computing environment settings described in the subsequent section. It is therefore possible for a job to run successfully using testJob but still fail when you attempt to execute the full simulation.

ClusterFunctions

The main attraction of BatchExperiments is the capacity to apply your generic problems and algorithms in a variety of computing environments, in particular on clusters equipped with Batch Scheduling Systems, such as Torque. The computing environment configuration is specified by a cluster function, several common examples of which are included in the package.

The out-of-the-box configuration uses makeClusterFunctionsInteractive(). These cluster functions will simply run all jobs locally in the host R process, and is mostly useful for testing purposes. A slightly more useful cluster function is makeClusterFunctionsMulticore(), which will spawn multiple R sessions on the host machine in order to take advantage of multiple CPU cores which may be present. The cluster function can be set using the setConfig command.

setConfig(cluster.functions=makeClusterFunctionsMulticore())

NB: This is a quick method of setting the cluster function within the host session. Once you have your computing environment set up, a more permanent and safe solution is to write a BatcheExperiments configuration file, where settings such as these can be specified. See [this BatchJobs wiki page] for more details.

Torque

However, you are probably using BatchExperiments to control a computing scheduler, such as Torque. When a job is submitted with this cluster function, BatchExperiments will create and submit a script to the scheduler using qsub. The script will be built from a template file which must be created first, a simple example of which is shown below.

#!/bin/bash
## The torque.tmpl file, a template for building PBS scripts
#PBS -n <%= job.name %>
#PBS -o <%= log.file %>
#PBS -j oe                                  #redirect stderr to stdout
#PBS -l walltime=<%= resources$walltime %>,ncpus=<%= resources$ncpus %>
#PBS -q <%= resources$queue %>
R CMD BATCH --no-save --no-restore "<%= rscript %>" /dev/stdout

The general form of this script should be familiar to Torque users. The template variables (anything between <%= and %>) are arbitrary R code that will be filled in by BatchExperiments with values appropriate to the job when the script is submitted to qsub. This example script employs a few special variables defined by BatchExperiments, namely the job.name, log.file, resources, and rscript objects.

The all of the formerly mentioned special template objects will be automatically generated by BatchExperiments for each job, with the exception of the resources list, for which appropriate values must be supplied. Default values for the resource list may be set using setConfig.

setConfig(default.resources = list(queue="default",
                                   walltime="12:00:00",
                                   ncpus=1))

NB: Since R usually does all computations under a single process, there is typically no need to request more than a single CPU for each Torque task. R will submit multiple Torque jobs simultaneously to achieve the parallelization.

Once the template file is in place, the cluster functions can be set to use torque by calling makeClusterFunctionsTorque with the path to the script template as the first parameter.

setConfig(cluster.functions=makeClusterFunctionsTorque("./torque.tmpl"))

A final useful configuration variable is max.concurrent.jobs. As the name implies, this variable controls how many jobs the cluster functions will attempt to queue up with the scheduler at a single time.

setConfig(max.concurrent.jobs=32) #value depends on your local config

Submitting Jobs

Now that we have described a set of experiments to execute and configured the computing environment settings, we are ready to begin the computation. Submitting jobs for execution is done with the submitJobs function.

The simplest invocation of submitJobs is shown below.

submitJobs(reg)

NB: This function will not return until all jobs have been successfully submitted to whatever system is set up to execute them. Depending on your configuration, this could be a long time. However, interrupting the submitJobs function will not corrupt your experiment in any way, it will simply leave it in a state where some of the jobs have been completed and some have yet to be run.

submitJobs takes an optional second argument which is a vector of job indices to submit to the computation engine. These vectors can be easily obtained using the find* functions. There is a find function available to query the registry for jobs in various states of computation. For example, to resume a partially completed simulation run, one could enter

submitJobs(reg, findNotStarted(reg))

Other highly useful find functions are findErrors and findDone.

Chunking Jobs

A default call to submitJobs will instruct BatchExperiments to start a new R session for each individual job. If an individual replicate of your experiment takes a short amount of time to complete, then the overhead associated with initializing R (or having Torque schedule and execute your request) will result in a significant loss of efficiency.

The solution to this problem is to chunk your jobs. For the unfamiliar chunk is a basic R function which will partition a vector into a list of sub-vectors. The user can specify either the number of elements in each sub-vector or the total number of sub-vectors, as well as if the elements should be shuffled before partitioning. This is useful because if you supply submitJobs with a list of job index vectors (as opposed to a simple vector of indices), it will group the jobs in a single sub-vector together and serially execute them within a single R process. The Technical Report recommends chunking your jobs to that each chunk takes at least 5-10 minutes to complete.

submitJobs(reg, chunk(findNotSumbitted(reg), 
                      chunk.size=50, shuffle=TRUE))

Manipulating Results

getRegParams <- function(job,res){
  x <- as.list(coef(res))
  names(x) <- paste("b",seq_along(x)-1,sep="") #b0,b1,...
  x
}
res <- reduceResultsExperiments(reg, fun=getRegParams)