DP-Clinical-ICL

This repository contains code for generating clinical discharge summaries using In-Context Learning (ICL) with differential privacy guarantees. The project uses the MIMIC-IV dataset and various language models through Ollama.

Prerequisites

Python 3.9+
Access to MIMIC-IV dataset (PhysioNet credentialed user required)
Ollama installed on your system

Ollama Setup

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull the default model:

ollama pull llama3.2

Note: You can pull additional models later using ollama pull MODEL_NAME. Check available models at https://ollama.com/search

System Requirements

Disk Space: At least 10GB free space
- ~8GB for MIMIC-IV dataset files
- ~2GB for generated outputs and model files
GPU Memory Requirements:
- As a rule of thumb, you need at least twice as much VRAM (in GB) as the billions of parameters in your chosen model
- Example requirements:
  - Llama 2 7B: ~14GB VRAM
  - Mistral 7B: ~14GB VRAM
  - Mixtral 8x7B: ~112GB VRAM
  - For smaller GPUs, consider using the smaller variants of these models
RAM: Minimum 16GB recommended for processing the MIMIC-IV dataset

Installation

Create and activate a new conda environment:

conda create -n dp-clinical python=3.8
conda activate dp-clinical

Clone the repository:

git clone https://github.com/yourusername/DP-Clinical-ICL.git
cd DP-Clinical-ICL

Install the required packages:

pip install -r requirements.txt

Dataset Setup

Request access to MIMIC-IV dataset at:
- MIMIC-IV Clinical Database: https://physionet.org/content/mimiciv/2.2/
- MIMIC-IV Notes: https://www.physionet.org/content/mimic-iv-note/2.2/
Navigate to the data directory:

cd data

Note: The download process requires approximately 10GB of disk space and may take a considerable amount of time depending on your internet connection. It's recommended to use tmux to prevent the download from being interrupted if your connection drops:

# Install tmux if not already installed
sudo apt-get install tmux

# Create a new tmux session
tmux new -s mimic_download

# Now run the download commands inside tmux
# To detach from the session: press Ctrl+B, then D
# To reattach to the session later: tmux attach -t mimic_download

Run the commands to download the necessary files:

wget -r -N -c -np --user [YOUR_USERNAME] --ask-password https://physionet.org/files/mimic-iv-note/2.2/
wget -r -N -c -np --user [YOUR_USERNAME] --ask-password https://physionet.org/files/mimiciv/2.2/

If everything went well, you should have the following structure:

data/
├── generated/
│   └── [Generated datasets will be saved here]
└── physionet.org/
    └── files/
        ├── mimic-iv-note/
        │   └── 2.2/
        │       └── note/
        │           └── discharge.csv.gz
        └── mimiciv/
            └── 2.2/
                └── hosp/
                    ├── diagnoses_icd.csv.gz
                    ├── procedures_icd.csv.gz
                    ├── d_icd_procedures.csv.gz
                    └── d_icd_diagnoses.csv.gz

Data Extraction

Before running the generation script, you need to process the MIMIC-IV dataset to create the required format. The extract_data_amc.py script handles this by:

Loading and merging the necessary MIMIC-IV files
Formatting ICD codes correctly
Creating the required data structure with discharge summaries and their associated codes

Note: The extraction process typically takes a few minutes to complete, as it needs to process and merge large CSV files. The exact time depends on your system's CPU and memory speed.

Running the Extraction

Make sure all MIMIC-IV files are in place as shown in the directory structure above
Navigate to the root directory again:

cd ..

Run the extraction script:

python extract_data_amc.py

This will create two files in your data/ directory:

mimiciv_icd10.feather: The main dataset file containing:
- Discharge summaries (text)
- ICD-10 codes (target, icd10_diag, icd10_proc)
- Code descriptions (long_title)
mimiciv_icd10_split.feather: Train/validation/test split information

Verifying the Extraction

You can check if the extraction was successful by verifying that both files exist and contain data:

ls -l data/mimiciv_icd10*.feather

The extracted dataset should contain properly formatted records with:

Diagnostic codes including periods (e.g., "I25.10")
Procedure codes without periods (e.g., "02HN3DZ")
Non-empty text fields
At least one ICD-10 code per record

Only after successful data extraction should you proceed to running DP_ICL_gen.py.

Using Custom Dataset

If you want to use your own dataset instead of MIMIC-IV, you'll need to format it according to the following specifications:

Required File Format

Your dataset should be saved as a Feather file (.feather) with the following columns:

_id: Unique identifier for each record
text: The clinical discharge summary text
target: List of ICD-10 codes associated with the text (in order: first every diagnostic code, then every procedure code)
icd10_diag: List of ICD-10 diagnostic codes
icd10_proc: List of ICD-10 procedure codes
long_title: List of long descriptions for the ICD codes

Data Requirements

The ICD-10 codes should be properly formatted:
- Diagnostic codes should include a period after the first 3 characters (e.g., "A01.1")
- Procedure codes should not include periods
Each record should have:
- Non-empty text field
- At least one ICD-10 code in either diagnostic or procedure codes
- Corresponding long titles for each code
To use your custom dataset:
- Use the --custom_dataset_path parameter to specify a different location:
```
python DP_ICL_gen.py --custom_dataset_path /path/to/your/dataset.feather
```

Example Data Format

{
    '_id': 1234,
    'text': 'Patient admitted with chest pain...',
    'target': ['I25.10', 'Z95.5', '02HN3DZ'],
    'icd10_diag': ['I25.10', 'Z95.5'],
    'icd10_proc': ['02HN3DZ'],
    'long_title': ['Atherosclerotic heart disease', 'Presence of coronary stent', 'Insertion of stent into coronary artery']
}

This will create processed dataset files in the data/ directory.

Data Generation

The main script for generating clinical summaries is DP_ICL_gen.py. Here are some key parameters and how to use them:

Basic Usage

python DP_ICL_gen.py --model_name llama3.2 --num_shots 5 --generated_dataset_size 100

Important Parameters

--model_name: Specify which Ollama model to use (e.g., llama2, mistral, mixtral)
--num_shots: Number of few-shot examples (default: 5)
--generated_dataset_size: Number of samples to generate (default: 100)
--prompt_index: Index of the prompt template to use (0-2)
--temperature: Temperature for generation (default: 0.7)

Custom Prompt

To add your own prompt, you can modify the prompts list in DP_ICL_gen.py. The prompts are defined starting at line 58 of the file:

prompts = [
    "Generate a clinical discharge summary...",  # Prompt 0
    """Please generate a realistic and concise clinical...""",  # Prompt 1
    """Please generate a realistic, concise, and professional...""",  # Prompt 2
    """[ADD HERE YOUR OWN PROMPT]
ICD10-CODES= """  # Prompt 3 (Custom)
]

NOTE:

The prompt should end with ICD10-CODES= so that the script can insert the ICD10 codes in the right place

To use your custom prompt:

Open DP_ICL_gen.py
Find the prompts list
Replace the text in index 3 at line 111 ([ADD HERE YOUR OWN PROMPT]) with your custom prompt
Run the script with --prompt_index 3

Example Commands

Generate 100 samples using llama3.2 with custom prompt (inserted in the script):

python DP_ICL_gen.py --model_name llama3.2 --num_shots 5 --generated_dataset_size 100 --prompt_index 3

Generate samples with higher temperature for more diversity:

python DP_ICL_gen.py --model_name mistral --temperature 0.9 --num_shots 3 --generated_dataset_size 50

Non-private generation (without DP):

python DP_ICL_gen.py --model_name llama2 --nonprivate --num_shots 5 --generated_dataset_size 100

Output

Generated datasets will be saved in the data/generated/ directory in both Feather and CSV formats. The filename will include the parameters used for generation, making it easy to identify different runs.

Available Models

You can check available models at https://ollama.com/search. Make sure to have the desired model pulled in Ollama before running the generation script.

To pull a model run:

ollama pull [MODEL_NAME]

Notes

The script uses the Sentence Transformers model 'all-MiniLM-L6-v2' for embedding calculations
For private generation, different epsilon values can be specified using the --epsilons parameter
The --cardio flag can be used to filter for cardiology-related codes only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DP-Clinical-ICL

Prerequisites

Ollama Setup

System Requirements

Installation

Dataset Setup

Data Extraction

Running the Extraction

Verifying the Extraction

Using Custom Dataset

Required File Format

Data Requirements

Example Data Format

Data Generation

Basic Usage

Important Parameters

Custom Prompt

Example Commands

Output

Available Models

Notes

FilesExpand file tree

README_OLD.md

Latest commit

History

README_OLD.md

File metadata and controls

DP-Clinical-ICL

Prerequisites

Ollama Setup

System Requirements

Installation

Dataset Setup

Data Extraction

Running the Extraction

Verifying the Extraction

Using Custom Dataset

Required File Format

Data Requirements

Example Data Format

Data Generation

Basic Usage

Important Parameters

Custom Prompt

Example Commands

Output

Available Models

Notes