A Python script to stream, filter, and process the Cellosaurus XML for cell line ancestry values
This project contains a Python script to stream and process the XML file from the Cellosaurus database. The script filters the Cellosaurus database for cell lines with ancestry data and outputs relevant data, including disease site, ancestry percentages, and other key information.
csv file of Cell Lines and Ancestry
xlsx file of Cell Lines and Ancestry
- Streaming: Efficiently process large XML files without loading them entirely into memory.
- Filtering: Filter cell lines based on presence of ancestry data.
- Output: Generates a CSV file with results and an Excel file with results sorted by descending African ancestry % from Cellosaurus.
Clone this repository and install the required dependencies:
git clone https://github.com/YOUR-USERNAME/cellosaurus-xml-processor.git
cd cellosaurus-xml-processor
pip install pandas openpyxl requests