DataTools4Heart Feature Extraction Suite

This repository contains feature extraction definitions that process patient data represented in the DT4H CDM and tranform it to a tabular format, which would be used to train ML models. Feature extraction process is realized via four main concepts, namely populations, feature groups, feature sets and pipelines.

Broadly, the feature extraction suite extracts patients' data from the FHIR patient data repository based on population definition.

Then, feature groups' main aim is to extract a group of raw features for specific healthcare resources such as conditions, medications, lab measurements, etc. For each feature group a timeseries table is created such that

Each record specified matching to the FHIR query of the feature group will be mapped as a row in the table
Each feature defined in the feature group will be converted a column in the table

In the next step, feature sets work on the timeseries data generated by the feature groups to extract the final tabular dataset. Feature sets allow the following dataset manipulations:

Identification of reference time points that would lead to data points in the final dataset
Grouping data based on the reference time points in configurable time periods
Applying aggregations on the grouped data

Pipelines are used to associate feature sets and populations. This indicates that a dataset, as configured by the feature sets, will be generated for the specified population in the pipeline.

Current Definitions

When looked into the current definitions, the feature group defined so far are mainly driven by the DT4H CDM profiles, vital signs, encounters, electrocardiographs, medications, etc.

The study-features.json contains the input (independent) and output (dependent) variables that are required for the sub-study 1, which is "Medication prescription in patients with acute heart failure and chronic kidney disease or hyperkalaemia".

Deployment Guideline (with Nginx)

Prerequisites

Completing the deployment instructions of the data-ingestion-suite.

Clone the Repository

After mapping the data source to the common data model, the feature extraction process can be started. DT4H feature extraction configurations are maintained in the project’s GitHub repository.

Navigate into a working directory to run the tools: <workspaceDir>

git clone https://github.com/DataTools4Heart/feature-extraction-suite

Run Docker Containers

Run the following scripts in the <workspaceDir>:

sh ./feature-extraction-suite/docker/pull.sh
sh ./feature-extraction-suite/docker/run.sh

Running Behind Nginx Configuration

For feature-extraction-suite deployment, data-ingestion-suite must first be deployed successfully and mapping must be run. If you used the Nginx Docker container during the data-ingestion-suite deployment, update the Nginx config for feature-extraction-suite by following these steps:

Navigate into the working directory:

cd <workspaceDir>

Edit the ./data-ingestion-suite/docker/proxy/nginx.conf file and uncomment the following lines:

# location /dt4h/feast {
#     proxy_pass http://dt4h-feast:8085/onfhir-feast;
#     proxy_set_header Host $host;
#     proxy_set_header X-Real-IP $remote_addr;
# }

Restart the Nginx container:

./data-ingestion-suite/docker/proxy/restart.sh

Or, if your host machine is already running Nginx, insert the following proxy configuration and restart Nginx:

location /dt4h/feast {
    proxy_pass http://<hostname>:<port>/onfhir-feast;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Starting Feature Extraction

To start the feature extraction process for a specific study, use the following cURL commands. Replace <hostname> with your server hostname.

Study 1

curl -X POST 'http://<hostname>/dt4h/feast/api/DataSource/myFhirServer/FeatureSet/study1-fs/Population/study1_cohort/$extract?entityMatching=pid|pid,encounterId|encounterId&reset=true'

Study 2

curl -X POST 'http://<hostname>/dt4h/feast/api/DataSource/myFhirServer/FeatureSet/study1-fs/Population/study2_cohort/$extract?entityMatching=pid|pid,encounterId|encounterId&reset=true'

Study 3

curl -X POST 'http://<hostname>/dt4h/feast/api/DataSource/myFhirServer/FeatureSet/study3-fs/Population/study3_cohort/$extract?entityMatching=pid|pid&reset=true'

The extraction process may take a long time to complete depending on the size of data.
After completion, the dataset will be available in the following location. For example:

<workspaceDir>/feature-extraction-suite/output-data/myFhirServer/dataset/study1-fs/<datasetId>/part-00000-550c22da-d8e3-4113-8b3a-8d935e77ee06-c000.snappy.parquet

Dataset Statistics

For statistics (metadata) about the datasets:

https://<hostname>/dt4h/feast/api/Dataset

For statistics (metadata) about a specific dataset:

https://<hostname>/dt4h/feast/api/Dataset/<datasetId>

Clean Installation from Scratch

Use this section to completely remove all feature-extraction-suite containers, volumes, and data, then perform a fresh installation.

1. Stop containers and remove all data

Run the clean-and-stop script to stop all containers and remove associated volumes:

Warning: This will permanently delete all persisted data including extracted datasets, metadata(s) and feature extraction history.

sh ./feature-extraction-suite/docker/clean-and-stop.sh

2. (Optional) Clean data-ingestion-suite

If you also want to perform a clean installation of the data-ingestion-suite, follow the instructions in the data-ingestion-suite README - Clean Installation from Scratch section before proceeding.

3. Pull the latest updates

# Pull the latest feature extraction suite code
cd feature-extraction-suite
git pull
cd ..

# Pull the latest images
sh ./feature-extraction-suite/docker/pull.sh

4. Start the containers

After completing the above steps (and ensuring data-ingestion-suite is running if you cleaned it), start the feature extraction suite:

sh ./feature-extraction-suite/docker/run.sh
sh ./data-ingestion-suite/docker/proxy/restart.sh # Optional

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
definitions		definitions
docker		docker
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataTools4Heart Feature Extraction Suite

Current Definitions

Deployment Guideline (with Nginx)

Prerequisites

Clone the Repository

Run Docker Containers

Running Behind Nginx Configuration

Starting Feature Extraction

Study 1

Study 2

Study 3

Dataset Statistics

Clean Installation from Scratch

1. Stop containers and remove all data

2. (Optional) Clean data-ingestion-suite

3. Pull the latest updates

4. Start the containers

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

DataTools4Heart/feature-extraction-suite

Folders and files

Latest commit

History

Repository files navigation

DataTools4Heart Feature Extraction Suite

Current Definitions

Deployment Guideline (with Nginx)

Prerequisites

Clone the Repository

Run Docker Containers

Running Behind Nginx Configuration

Starting Feature Extraction

Study 1

Study 2

Study 3

Dataset Statistics

Clean Installation from Scratch

1. Stop containers and remove all data

2. (Optional) Clean data-ingestion-suite

3. Pull the latest updates

4. Start the containers

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages