I'm a Ph.D. scientist in the biopharmaceutical industry. My background combines hands-on cell-therapy (iPSC, TIL, HSC-derived CAR-iNKT) with quantitative analysis and machine learning.
🧪 scrnaseq-tumor-microenvironment — Complete end-to-end single-cell RNA-seq pipeline on the Salcher 2022 LuCA lung cancer atlas (892K cells × 19 NSCLC studies, ~13 GB). Streaming h5py reads off the published h5ad build a stratified ~92K-cell working subsample without loading the full atlas into RAM. scVI corrects 19-study batch effects; nine cell lineages are annotated by Leiden clustering + marker-gene scoring and benchmarked against LuCA's published expert labels (95% agreement, ARI 0.92). A per-patient cell-composition classifier (scikit-learn) predicting tumor histology is evaluated with leave-one-study-out cross-validation and a label-permutation negative control — and reports an honest null result, with a study confound documented rather than over-claimed. Stack: scanpy · scVI · scikit-learn.
🧬 codon-discovery-pca-kmeans — PCA, UMAP, and K-Means applied to k-mer frequencies from two bacterial genomes (Caulobacter crescentus and E. coli K-12). Started as an MIT IDSS bootcamp case study; rebuilt as a portfolio piece with silhouette-based cluster selection, cross-organism validation, and a UMAP comparison. The polished version reaches a more conservative biological conclusion than the original — an example of replacing visual cluster-counting with quantitative criteria.
🔬 cell-confluency-segmentation — Two-backend (classical + deep-learning) cell-segmentation and confluency-estimation pipeline for adherent-cell microscopy. Classical Otsu + morphology runs in sub-second; an optional Cellpose backend handles touching cells. Validated on synthetic ground-truth data with known cell counts. Includes a fetch script for the public BBBC005 microscopy dataset.
📊 baby-feeding-trend — Pandas-idiomatic pipeline for analyzing real-world infant feeding patterns over time of day, across months. Started as a personal data project on my own child's feeding log; rebuilt with proper CSV parsing, two-level groupby aggregation, rolling-mean trend visualization, and pytest unit tests. Ships with a synthetic data generator so the analysis runs end-to-end without sharing private data.
- Ph.D., Molecular & Cellular Life Sciences — University of Wyoming
- MIT IDSS Applied Data Science Program — Data Science & Machine Learning (Mar–Jun 2024)
- 4 peer-reviewed publications — 2 first-author (Frontiers in Neurology, Journal of Neurotrauma), 2 middle-author (Scientific Reports / Nature, European Journal of Neuroscience)
- 4+ years cell-therapy industry experience at uBriGene Biosciences and Forecyte Bio (CDMO settings — CAR-T, TIL, iPSC, plasmid SME, HSC-derived CAR-iNKT)
Python · pandas · numpy · scikit-learn · scanpy · scVI · anndata · h5py · seaborn · matplotlib · scikit-image · OpenCV · Cellpose · UMAP · pytest · Jupyter · Git · R (single-cell RNA-seq)
Note: earlier publications appear under my prior legal name, Wupu Osimanjiang (changed May 2022).