This repository contains code for generating clinical discharge summaries using In-Context Learning (ICL) with differential privacy guarantees. The project uses the MIMIC-IV dataset and various language models through Ollama.
- Python 3.9+
- Access to MIMIC-IV dataset (PhysioNet credentialed user required)
- Ollama installed on your system
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh- Pull the default model:
ollama pull llama3.2Note: You can pull additional models later using
ollama pull MODEL_NAME. Check available models at https://ollama.com/search
-
Disk Space: At least 10GB free space
- ~8GB for MIMIC-IV dataset files
- ~2GB for generated outputs and model files
-
GPU Memory Requirements:
- As a rule of thumb, you need at least twice as much VRAM (in GB) as the billions of parameters in your chosen model
- Example requirements:
- Llama 2 7B: ~14GB VRAM
- Mistral 7B: ~14GB VRAM
- Mixtral 8x7B: ~112GB VRAM
- For smaller GPUs, consider using the smaller variants of these models
-
RAM: Minimum 16GB recommended for processing the MIMIC-IV dataset
- Create and activate a new conda environment:
conda create -n dp-clinical python=3.8
conda activate dp-clinical- Clone the repository:
git clone https://github.com/yourusername/DP-Clinical-ICL.git
cd DP-Clinical-ICL- Install the required packages:
pip install -r requirements.txt-
Request access to MIMIC-IV dataset at:
- MIMIC-IV Clinical Database: https://physionet.org/content/mimiciv/2.2/
- MIMIC-IV Notes: https://www.physionet.org/content/mimic-iv-note/2.2/
-
Navigate to the data directory:
cd data-
Note: The download process requires approximately 10GB of disk space and may take a considerable amount of time depending on your internet connection. It's recommended to use
tmuxto prevent the download from being interrupted if your connection drops:
# Install tmux if not already installed sudo apt-get install tmux # Create a new tmux session tmux new -s mimic_download # Now run the download commands inside tmux # To detach from the session: press Ctrl+B, then D # To reattach to the session later: tmux attach -t mimic_download
- Run the commands to download the necessary files:
wget -r -N -c -np --user [YOUR_USERNAME] --ask-password https://physionet.org/files/mimic-iv-note/2.2/
wget -r -N -c -np --user [YOUR_USERNAME] --ask-password https://physionet.org/files/mimiciv/2.2/- If everything went well, you should have the following structure:
data/
├── generated/
│ └── [Generated datasets will be saved here]
└── physionet.org/
└── files/
├── mimic-iv-note/
│ └── 2.2/
│ └── note/
│ └── discharge.csv.gz
└── mimiciv/
└── 2.2/
└── hosp/
├── diagnoses_icd.csv.gz
├── procedures_icd.csv.gz
├── d_icd_procedures.csv.gz
└── d_icd_diagnoses.csv.gz
Before running the generation script, you need to process the MIMIC-IV dataset to create the required format. The extract_data_amc.py script handles this by:
- Loading and merging the necessary MIMIC-IV files
- Formatting ICD codes correctly
- Creating the required data structure with discharge summaries and their associated codes
Note: The extraction process typically takes a few minutes to complete, as it needs to process and merge large CSV files. The exact time depends on your system's CPU and memory speed.
- Make sure all MIMIC-IV files are in place as shown in the directory structure above
- Navigate to the root directory again:
cd ..- Run the extraction script:
python extract_data_amc.pyThis will create two files in your data/ directory:
mimiciv_icd10.feather: The main dataset file containing:- Discharge summaries (
text) - ICD-10 codes (
target,icd10_diag,icd10_proc) - Code descriptions (
long_title)
- Discharge summaries (
mimiciv_icd10_split.feather: Train/validation/test split information
You can check if the extraction was successful by verifying that both files exist and contain data:
ls -l data/mimiciv_icd10*.featherThe extracted dataset should contain properly formatted records with:
- Diagnostic codes including periods (e.g., "I25.10")
- Procedure codes without periods (e.g., "02HN3DZ")
- Non-empty text fields
- At least one ICD-10 code per record
Only after successful data extraction should you proceed to running DP_ICL_gen.py.
If you want to use your own dataset instead of MIMIC-IV, you'll need to format it according to the following specifications:
Your dataset should be saved as a Feather file (.feather) with the following columns:
_id: Unique identifier for each recordtext: The clinical discharge summary texttarget: List of ICD-10 codes associated with the text (in order: first every diagnostic code, then every procedure code)icd10_diag: List of ICD-10 diagnostic codesicd10_proc: List of ICD-10 procedure codeslong_title: List of long descriptions for the ICD codes
-
The ICD-10 codes should be properly formatted:
- Diagnostic codes should include a period after the first 3 characters (e.g., "A01.1")
- Procedure codes should not include periods
-
Each record should have:
- Non-empty text field
- At least one ICD-10 code in either diagnostic or procedure codes
- Corresponding long titles for each code
-
To use your custom dataset:
- Use the
--custom_dataset_pathparameter to specify a different location:python DP_ICL_gen.py --custom_dataset_path /path/to/your/dataset.feather
- Use the
{
'_id': 1234,
'text': 'Patient admitted with chest pain...',
'target': ['I25.10', 'Z95.5', '02HN3DZ'],
'icd10_diag': ['I25.10', 'Z95.5'],
'icd10_proc': ['02HN3DZ'],
'long_title': ['Atherosclerotic heart disease', 'Presence of coronary stent', 'Insertion of stent into coronary artery']
}This will create processed dataset files in the data/ directory.
The main script for generating clinical summaries is DP_ICL_gen.py. Here are some key parameters and how to use them:
python DP_ICL_gen.py --model_name llama3.2 --num_shots 5 --generated_dataset_size 100--model_name: Specify which Ollama model to use (e.g.,llama2,mistral,mixtral)--num_shots: Number of few-shot examples (default: 5)--generated_dataset_size: Number of samples to generate (default: 100)--prompt_index: Index of the prompt template to use (0-2)--temperature: Temperature for generation (default: 0.7)
To add your own prompt, you can modify the prompts list in DP_ICL_gen.py. The prompts are defined starting at line 58 of the file:
prompts = [
"Generate a clinical discharge summary...", # Prompt 0
"""Please generate a realistic and concise clinical...""", # Prompt 1
"""Please generate a realistic, concise, and professional...""", # Prompt 2
"""[ADD HERE YOUR OWN PROMPT]
ICD10-CODES= """ # Prompt 3 (Custom)
]NOTE:
- The prompt should end with
ICD10-CODES=so that the script can insert the ICD10 codes in the right place
To use your custom prompt:
- Open
DP_ICL_gen.py - Find the
promptslist - Replace the text in index 3 at line 111 (
[ADD HERE YOUR OWN PROMPT]) with your custom prompt - Run the script with
--prompt_index 3
- Generate 100 samples using llama3.2 with custom prompt (inserted in the script):
python DP_ICL_gen.py --model_name llama3.2 --num_shots 5 --generated_dataset_size 100 --prompt_index 3- Generate samples with higher temperature for more diversity:
python DP_ICL_gen.py --model_name mistral --temperature 0.9 --num_shots 3 --generated_dataset_size 50- Non-private generation (without DP):
python DP_ICL_gen.py --model_name llama2 --nonprivate --num_shots 5 --generated_dataset_size 100Generated datasets will be saved in the data/generated/ directory in both Feather and CSV formats. The filename will include the parameters used for generation, making it easy to identify different runs.
You can check available models at https://ollama.com/search. Make sure to have the desired model pulled in Ollama before running the generation script.
To pull a model run:
ollama pull [MODEL_NAME]- The script uses the Sentence Transformers model 'all-MiniLM-L6-v2' for embedding calculations
- For private generation, different epsilon values can be specified using the
--epsilonsparameter - The
--cardioflag can be used to filter for cardiology-related codes only