paper repository for Vazquez-Garcia & Obermayer et al.
contents:
-
data
: contains input data for paper figures (prepared usingprepare_data.R
-
paper_figures.Rmd
: R code to produce paper figures (uses data indata
, nothing else needed)sessionInfo
**R version 4.3.2 (2023-10-31)**Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: grid, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: dendextend(v.1.17.1), lme4(v.1.1-35.1), gtools(v.3.9.5), RColorBrewer(v.1.1-3), variancePartition(v.1.32.5), BiocParallel(v.1.36.0), limma(v.3.58.1), readxl(v.1.4.3), pROC(v.1.18.5), glmnet(v.4.1-8), Matrix(v.1.6-5), car(v.3.1-2), carData(v.3.0-5), ggrepel(v.0.9.5), circlize(v.0.4.16), ComplexHeatmap(v.2.18.0), cowplot(v.1.1.3), scales(v.1.3.0), caret(v.6.0-94), lattice(v.0.21-9), lubridate(v.1.9.3), forcats(v.1.0.0), stringr(v.1.5.1), purrr(v.1.0.2), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.2.1), tidyverse(v.2.0.0), dplyr(v.1.1.4), ggpubr(v.0.6.0) and ggplot2(v.3.5.1)
loaded via a namespace (and not attached): bitops(v.1.0-7), Rdpack(v.2.6), gridExtra(v.2.3), rlang(v.1.1.3), magrittr(v.2.0.3), clue(v.0.3-65), GetoptLong(v.1.0.5), matrixStats(v.1.2.0), compiler(v.4.3.2), png(v.0.1-8), vctrs(v.0.6.5), reshape2(v.1.4.4), pkgconfig(v.2.0.3), shape(v.1.4.6.1), crayon(v.1.5.2), backports(v.1.4.1), pander(v.0.6.5), caTools(v.1.18.2), utf8(v.1.2.4), prodlim(v.2023.08.28), tzdb(v.0.4.0), nloptr(v.2.0.3), xfun(v.0.42), EnvStats(v.2.8.1), recipes(v.1.0.10), remaCor(v.0.0.18), broom(v.1.0.5), parallel(v.4.3.2), cluster(v.2.1.4), R6(v.2.5.1), stringi(v.1.8.3), boot(v.1.3-28.1), parallelly(v.1.37.0), rpart(v.4.1.21), numDeriv(v.2016.8-1.1), cellranger(v.1.1.0), Rcpp(v.1.0.12), iterators(v.1.0.14), knitr(v.1.45), future.apply(v.1.11.1), IRanges(v.2.36.0), splines(v.4.3.2), nnet(v.7.3-19), timechange(v.0.3.0), tidyselect(v.1.2.0), viridis(v.0.6.5), rstudioapi(v.0.15.0), abind(v.1.4-5), timeDate(v.4032.109), gplots(v.3.1.3.1), doParallel(v.1.0.17), codetools(v.0.2-19), listenv(v.0.9.1), lmerTest(v.3.1-3), plyr(v.1.8.9), Biobase(v.2.62.0), withr(v.3.0.0), future(v.1.33.1), survival(v.3.5-7), pillar(v.1.9.0), KernSmooth(v.2.23-22), foreach(v.1.5.2), stats4(v.4.3.2), generics(v.0.1.3), S4Vectors(v.0.40.2), hms(v.1.1.3), aod(v.1.3.3), munsell(v.0.5.0), minqa(v.1.2.6), globals(v.0.16.2), RhpcBLASctl(v.0.23-42), class(v.7.3-22), glue(v.1.7.0), tools(v.4.3.2), fANCOVA(v.0.6-1), data.table(v.1.15.0), ModelMetrics(v.1.2.2.2), gower(v.1.0.1), ggsignif(v.0.6.4), mvtnorm(v.1.2-4), rbibutils(v.2.2.16), ipred(v.0.9-14), colorspace(v.2.1-0), nlme(v.3.1-163), cli(v.3.6.2), fansi(v.1.0.6), viridisLite(v.0.4.2), lava(v.1.8.0), corpcor(v.1.6.10), gtable(v.0.3.4), rstatix(v.0.7.2), digest(v.0.6.34), BiocGenerics(v.0.48.1), pbkrtest(v.0.5.2), rjson(v.0.2.21), lifecycle(v.1.0.4), hardhat(v.1.3.1), GlobalOptions(v.0.1.2), statmod(v.1.5.0) and MASS(v.7.3-60)
-
swibrid_runs
: contains config files for various SWIBRID runs on human or mouse data, or the simulations-
benchmarks
: config files for the benchmarksdense
: using dense MSA, can be run as is usingswibrid test
in that foldersparse
: using sparse MSA. for this, thesparsecluster
package needs to be installed
-
mouse
: config files for mouse data- download raw fastq files from SRA (accession PRJNA1190672) into
raw_data
and rundemultiplex_dataset.sh
; this will put fastq andinfo.csv
files for individual samples intoinput
and make it possible to run all samples in one go - download mm10 genome from UCSC or elsewhere
- download gencode M12 reference and use
swibrid prepare_annotation
- use
config.yaml
for running all mouse data - use
config_noSg.yaml
for running everything only on Sm + Sa (potentially restrict info files ininput
to reads with Sa primer)
- download raw fastq files from SRA (accession PRJNA1190672) into
-
human
: config files for human dataraw sequencing data for human donors cannot be shared due to patient privacy legislation
demultiplex_dataset.sh
is used to demultiplex input for each run, demultiplexed fastq andinfo.csv
files would be expected ininput
- get hg38 genome and gencode v33 reference, create LAST index
config.yaml
for "regular" runsconfig_reads_averaging.yaml
to use averaging of features over reads not clusterscombine_replicates.sh
to pool reads from technical replicatesplot_bars.sh
andplot_bars.py
to plot isotype fractions as in Fig. 1plot_circles.sh
andplot_circles.py
to create bubble plots of Fig. 1plot_clustering.sh
to create read plots for Fig. 1 and S2plot_breakpoints.sh
andplot_breakpoint_stats.py
to create breakpoint matrix plot of Fig. 2A
-
external
: config files for public datasets (Vincendeau et al. and Panchakshari et al.)- for Vincendeau et al., download data from SRA (PRJNA831666) into the Vincendeau subfolder and run
make_info.py
on every sample to create dummy files with primer locations - for Panchakshari et al., use
get_data.sh
in theHTGTS
folder to download data, collapse read mates withbbmerge
and create info files
- for Vincendeau et al., download data from SRA (PRJNA831666) into the Vincendeau subfolder and run
-
-
supplementary_note.ipynb
: python code to make plots for supplementary note (needsnumpy
,scipy
,pandas
,seaborn
)