TLDR; use parquet instead of CSV. At a minimum, compress bcerror CSVs with gzip before adding to the repo.
Parquet files are much more disk efficient, faster to parse, etc. Not high priority but would be useful to incorporate into remora pipelines where we just want per-base stats.
Might be as simple as:
library(readr)
library(nanoparquet)
write_parquet(read_csv("file.csv"), "file.parquet").
# then inspect to
file.info("file.csv")
file.info("file.parquet")
# reload file in subsequent analyses
read_parquet("file.parquet")
Could also combine multiple CSVs together into one parquet with a column for sample name.
TLDR; use parquet instead of CSV. At a minimum, compress bcerror CSVs with gzip before adding to the repo.
Parquet files are much more disk efficient, faster to parse, etc. Not high priority but would be useful to incorporate into remora pipelines where we just want per-base stats.
Might be as simple as:
Could also combine multiple CSVs together into one parquet with a column for sample name.