This repository contains the necessary files to reproduce experiments from the bipca paper. We recommend that you use the docker image bipca-experiment:latest, however all of the necessary files for reproducing our experimental installation are contained within.
The best way to reproduce the bipca environment is to launch a container from bipca-experiment:latest. Every experiment in the paper was created with this image. Below we detail container configuration and walk through what the image does at runtime.
To get started immediately, you can run
docker run -it --rm -p 8080:8080 -p 8029:8787 -e USER=$(id --name -u) -e USERID=$(id -g) --name bipca -v /data:/data bipca-experiment:latest,
changing /data:/data to <your_local_data_directory>:/data. This will launch jupyter-lab on host port 8080 and rstudio on port 8029.
The details of this command are provided in "An example run command".
The environment variables for this image are important. They are:
USER: The username thatjupyter-labwill be run under, as well as the user forr-sessionand any commands run from the entrypoint. This defaults torstudio-bipcaUSERID: The Userid assigned to$USER. This defaults to1000.PASSWORDThe password assigned to$USER. Defaults tobipcaROOT: gives$USERsudo. Defaults totrueDISABLE_AUTH: disables rstudio password authentication. Defaults totrue- no login splash is displayed when connecting to rstudio.JUPYTER: enablejupyter-labon container start. DefaulttrueRSTUDIO: enablerstudioon container start. Defaulttrue
By default, this image installs bipca at runtime using pip install -e /bipca/python. What does this mean? Any directory that is a valid python module can be mounted at /bipca and installed. In particular, we use this for development of bipca: by mounting a host copy of the bipca github repo to /bipca, we can make changes to the source code that propagate into the container. Without any mounting, the python directory of a version of bipca (whichever version was last pulled into the submodule of this repo) was copied to /bipca when the image was built, so that version is installed. If a host-local bipca from the bipca github repository is mounted at the containers /bipca on runtime, the container pip will monitor host-side changes to bipca (such as pulling the latest commits from master, changing branches, or adding new code).
Let's review what exactly this container does given the default settings.
- Creates
$USER rstudio-bipcawith$USERID 1000and$PASSWORD bipca. If you're running unix, this means that any process launched within the container when logged in as default will run under theUSERID 1000, which will show up in your hostpsas whichever user on your host machine hasUSERID 1000. - Gives
sudoto$USERwithin the container - Installs
bipcafrom the path/bipcaas an editable module usingpip - Launches a
jupyter-labsession on port8080. This is an extremely permissive notebook environment. There is no password or token associated with it, so any host port mapped to8080in the container you launch will have full access to the notebook. - Launches
rstudio-serverlistening on port8787. Since$DISABLE_AUTHistrueby default, when port8787of the docker container is requested, the requestor is automatically logged in to$USER.
The following command was used for all experiments in the paper:
docker run -it --rm -p 8080:8080 -p 8029:8787 -e USER=$(id --name -u) -e USERID=$(id -g) --name bipca -v ~/bipca:/bipca -v /data:/data bipca-experiment:latest
Its anatomy:
docker run ... bipca-experiment:latestrun a container from the imagebipca-experiment:latest-itinteractive session--rmremove the container when it is stopped-p 8080:8080forward container port8080to host port8080-p 8029:8787forward container port8787to host port8029-e USER=$(id --name -u)rename the default user to the user that is callingdocker run. This changes the rstudio username, as well as any places within the container that the username is shown, such as the shell prompt orps-e USERID=$(id -g)change the id of $USER to the id of the current user on the host. This is especially important on systems which run docker native, e.g. linux, as processes run within the container (for examplejupyter-laborr-session) will export to the host's process manager (viewed byhtoporps) under this$USERID. By setting it to the current user, you ensure that your docker processes are mapped correctly to your username in the host.--name bipcanames the forthcoming container asbipca, which makes it easy to manipulate outside of the container-v ~/bipca:/bipcamount the directory of the host side bipca installation at/home/$(id --name -u)/bipca(or whatever your shell links to~) to the container-side volume/bipca(see above section "bipcainstallation".-v /data:/datamount the host/datadirectory as a volume in the container at/data. As with any volume mounted in this way, changes made to the container/datawill persist into the host/data.
The following files can be used as a reference for rebuilding our environment inside of your own environment without using a dockerfile or docker image.
- The experimental files to replicate experiments from the bipca paper. See
bipca-experiment/experiment. - Scripts for normalization and preprocessing using comparison methods:
runNormalization.r - Build scripts for the environment that all experiments were run in, barring basic installation of
conda/mamba, python, rstudio, and littler. These scripts are contained inbipca-experiment/build-scripts
The following are used to build the docker image bipca-experiment:latest, but are not important for independently reproducing the experiments in your own environment.
- Jupyter notebook and ipython configuration files for mapping setting up jupyter lab as it was used in the paper, ccache and R makevars for speeding up R installations. These are less important for recreating the experiments, but they are contained in
bipca-experiment/root - Configuration files for a jupyter service running in
s6. These are unncessary for most, but the way that our docker image is setup, it usess6to manage jupyter on container start./etc/services.d/jupyter. - The dockerfile used to build
bipca-experiment:latest, contained inbipca-experiment/dockerfiles/bipca-base.dockerfile - A build script that increments versions automatically and tags them,
bipca-exxperiment/build.sh.
If you intend to build or modify the docker image ad hoc, we recommend that you either base your image on bipca-experiment:latest, or use bipca-experiment/dockerfiles/bipca-base.dockerfile. We recommend you roll your own build command, as bipca-experiment/build.sh is really only a convenience wrapper for internal development and publishing. To build latest, cd to the root of this repository and run:
docker build -t <IMAGE NAME> --build-arg GITHUB_PAT=<GITHUB_PAT> --target=final -f dockerfiles/bipca-base.dockerfile .
<IMAGE NAME> for latest is bipca-experiment:latest. <GITHUB_PAT> is an argument that passes a github personal access token as an environment variable. This is used by the R package remotes to install repositories from github. It is not always necessary, but if you are working with a lot of other people on the same IP who are not using a PAT, your IP can be locked by github when a certain number of daily requests is reached.