DP-Clinical-ICL

A user-friendly web application for generating clinical discharge summaries. This tool helps medical researchers create realistic patient discharge summaries while maintaining privacy. It uses artificial intelligence (specifically, In-Context Learning) and can apply privacy protection to the generated data.

Important: This application is designed to work with MIMIC-IV version 2.2. Other versions may not be compatible.

Note for Technical Users: This README provides user-friendly instructions focused on using the Streamlit web interface. If you want to understand the technical details, command-line usage, custom dataset format, or implementation details, please check README_OLD.md.

What You Need Before Starting

Computer Requirements:
- Graphics Card (GPU):
  - An NVIDIA GPU with at least 14GB of VRAM is practically required
  - While the application can run on CPU, generation would take many hours or even days
- At least 16GB of RAM (memory)
- At least 10GB of free disk space
- Any operating system (Windows, Mac, or Linux)
Software Requirements:
- Python 3.9 or newer (if you don't have it, visit Python's download page)
- Conda (download from Anaconda's website)
- Ollama (will be installed automatically by our script)
Access Requirements:
- MIMIC-IV version 2.2 dataset access credentials
  - Visit PhysioNet MIMIC-IV 2.2
  - Create an account and complete the required training (see more in the Step 2: Acquire MIMIC access section)

Installation Guide

Step 1: Opening a Terminal

On Windows:

Press the Windows key + R
Type "cmd" and press Enter
- Or search for "Command Prompt" in the Start menu

On Mac:

Press Command + Space
Type "Terminal" and press Enter
- Or find Terminal in Applications > Utilities

On Linux:

Press Ctrl + Alt + T
- Or search for "Terminal" in your applications menu

Step 2: Installing the Application

First, download the code:
```
git clone https://github.com/DataTools4Heart/DP-Clinical-ICL.git
```
- If this doesn't work, you may need to install Git first
Move into the downloaded folder:
```
cd DP-Clinical-ICL
```
Make the setup script executable and run it:
```
chmod +x run_app.sh
./run_app.sh
```

The setup script will automatically:

Set up a new environment for the application
Install all required software
Install Ollama (the AI model manager)
Download the default AI model (llama3.2)
Start the application in your web browser

If everything works correctly, your default web browser should open automatically with the application running.

Using the Application

The application works like a step-by-step wizard, with four main steps shown in the left sidebar:

Note About Step Progression:

You can skip any step that was already completed in a previous session

The app will detect existing files and configurations

For example:

If you've already downloaded the dataset, you can skip directly to extraction

If you've already extracted the data, you can go straight to generation

System checks can be skipped if you've verified your setup before

Just click the desired step in the left sidebar to navigate

Step 1: System Check

This first page checks if your computer meets all requirements:

Shows how much memory (RAM) you have
Checks your available disk space
Verifies GPU compatibility and memory
- This is crucial as generation speed depends heavily on GPU availability
- A compatible GPU reduces generation time from hours to minutes
- The app will warn you if no GPU is found or if GPU memory is insufficient
- While you can proceed without a GPU, it's not recommended for practical use

If any requirements aren't met, you'll see clear error messages explaining what's missing. Pay special attention to GPU-related messages, as they will significantly impact the usability of the application.

You can skip this step in future sessions if your system configuration hasn't changed.

Step 2: Acquire MIMIC access

Mimic is a credentialed dataset. You need to acquire access to it before you can download the dataset. For those who already have access, you can skip this step. The training is a specimen research training, it is nothing technical or specific to the application and chances are trained clinicians are already able to complete the test. As an alternative you can use your own dataset, properly formatted as indicated in the Using Your Own Dataset section.

Go to PhysioNet
Create an account or login to your existing account
Go to PhysioNet MIMIC-IV 2.2
Scroll down to the bottom of the page, here you will see this:

Click on "CITI Data or Specimens Only Research"
Complete the training following the instructions
Once the training is completed, you will be able to download the dataset

Step 3: Dataset Download

Here you'll download the medical records database:

Enter your PhysioNet username and password
Click the "Download Dataset" button
Wait for the download to complete (about 8GB of data)
- This might take from some minutes to some hours depending on your internet speed
- The app will show download progress
- It's safe to leave this running in the background

If you've already downloaded the dataset in a previous session, you can skip this step.

Step 4: Data Extraction

This step prepares the downloaded data:

Click the "Extract Data" button
Wait while the app processes the files
- This typically takes 15-30 minutes
- You'll see progress messages as it works
- It's normal if it seems slow at first
- Don't close the browser window during this step

If you've already extracted the data and the files exist, you can skip directly to Data Generation.

Step 5: Data Generation

This is where you create new discharge summaries. You have several options to control how they're generated:

Generation Time Estimates: The application will show you an estimated completion time based on your settings. For example:

With 100 samples and 5-shot setting:

Expect ~5 hours on an NVIDIA RTX 3090

Each sample takes about 1-2 minutes to generate

The number of shots affects generation time

More samples = longer total time

These estimates assume GPU availability

Times will be longer on less powerful GPUs

CPU-only generation is not recommended (could take days)

Important Note About Generation Times:

With a compatible GPU (14GB+ VRAM): Expect about 1-2 minutes per summary

Without a GPU: Generation could take 30+ minutes per summary

For bulk generation (e.g., 100 summaries), the difference is hours vs days

We strongly recommend using a computer with a compatible GPU

This is typically the only step you'll need to repeat in subsequent sessions, as it creates new summaries each time.

Basic Options:

AI Model (default is "llama3.2"):
- Think of this like choosing which expert writes your summaries
- Stick with the default unless you have a specific reason to change it
- Other options include "llama2", "mistral", or "mixtral"
Number of Examples (default is 5):
- How many real examples the AI should look at
- More examples = better quality but slower generation
- Recommended: start with 5 and adjust if needed
Number of Summaries (default is 100):
- How many new summaries you want to create
- Start small (like 10) for testing
- Larger numbers take longer to generate

Advanced Options:

Temperature (default is 0.7):
- Controls how creative the AI can be
- Lower (0.1-0.3): Very consistent, repetitive outputs
- Medium (0.5-0.7): Good balance for medical text
- Higher (0.7-1.0): More varied but potentially less accurate
Privacy Protection:
- "Non-private": No privacy protection
- "Default epsilons": Standard privacy protection
- "Custom epsilons": Advanced privacy settings (consult with privacy experts)
Custom Instructions:
- You can write your own instructions for the AI
- Use the large text box to enter specific requirements
- Must end with "ICD10-CODES= "
- The default instructions work well for most cases

Working with Generated Files

All generated files are saved in a folder called "data/generated"
Each file name includes information about how it was generated
You can download files directly from the web interface
Files remain available until you clear them using the "Clear Generated Files List" button

Using Your Own Dataset

Instead of using MIMIC-IV, you can use your own clinical dataset. Here's what you need to know:

Dataset Requirements

Your dataset must be in a specific format (Feather file, .feather) with these columns:

_id: A unique number for each record
text: The actual discharge summary
target: All the ICD-10 codes for this record
icd10_diag: Just the diagnostic codes
icd10_proc: Just the procedure codes
long_title: Descriptions of what each code means

Important Format Rules

The ICD-10 codes must be written correctly:
- Diagnostic codes need a period after the first 3 characters (example: "A01.1")
- Procedure codes should not have periods (example: "02HN3DZ")
Each record in your dataset must have:
- Text content (not empty)
- At least one ICD-10 code
- Descriptions for all codes

Example Record

Here's what a single record in your dataset should look like:

Record ID: 1234
Text: "Patient admitted with chest pain..."
Diagnostic Codes: ["I25.10", "Z95.5"]
Procedure Codes: ["02HN3DZ"]
Code Descriptions: ["Atherosclerotic heart disease", "Presence of coronary stent", "Insertion of stent into coronary artery"]

Common Problems and Solutions

"The application seems frozen":
- This is normal during data extraction and generation
- Look for progress messages at the bottom of the page
- Don't close the browser window
"Out of memory" errors:
- Try generating fewer summaries at once
- Close other large applications
- Use a smaller AI model
"Files not found" errors:
- Make sure the dataset download completed successfully
- Try the download step again
"Model not found" errors:
- Wait a few minutes after starting the application
- The model might still be downloading

Getting Help

If you encounter problems:

Check the error messages in the application
Look through the Troubleshooting section above
Contact your institution's IT support
Contact the maintainer directly:
- Michele Miranda
- Email: [email protected] or [email protected]
Create an issue on our GitHub page

Citation

If you use this tool in your research, please cite: [Add citation information]

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Images		Images
data		data
.gitignore		.gitignore
DP_ICL_gen.py		DP_ICL_gen.py
Extract Data AMC.ipynb		Extract Data AMC.ipynb
Privacy_tests.py		Privacy_tests.py
README.md		README.md
README_OLD.md		README_OLD.md
app.py		app.py
display_data.ipynb		display_data.ipynb
extract_data_amc.py		extract_data_amc.py
requirements.txt		requirements.txt
run_app.sh		run_app.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DP-Clinical-ICL

What You Need Before Starting

Installation Guide

Step 1: Opening a Terminal

On Windows:

On Mac:

On Linux:

Step 2: Installing the Application

Using the Application

Step 1: System Check

Step 2: Acquire MIMIC access

Step 3: Dataset Download

Step 4: Data Extraction

Step 5: Data Generation

Basic Options:

Advanced Options:

Working with Generated Files

Using Your Own Dataset

Dataset Requirements

Important Format Rules

Example Record

Common Problems and Solutions

Getting Help

Citation

About

Uh oh!

Releases

Packages

Languages

DataTools4Heart/DP-Clinical-ICL

Folders and files

Latest commit

History

Repository files navigation

DP-Clinical-ICL

What You Need Before Starting

Installation Guide

Step 1: Opening a Terminal

On Windows:

On Mac:

On Linux:

Step 2: Installing the Application

Using the Application

Step 1: System Check

Step 2: Acquire MIMIC access

Step 3: Dataset Download

Step 4: Data Extraction

Step 5: Data Generation

Basic Options:

Advanced Options:

Working with Generated Files

Using Your Own Dataset

Dataset Requirements

Important Format Rules

Example Record

Common Problems and Solutions

Getting Help

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages