Data harmonization pipeline and data analysis scripts validation #4

hcadavid · 2024-09-24T08:56:04Z

hcadavid
Sep 24, 2024
Maintainer

As previously discussed I updated the pairing rules so that participants with (HF, MI, Stroke, etc) conditions with onset dates that can't be calculated are also included as FHIR resources (the conditions are included with an 'undefined' onset date).

I re-ran the pipeline (the new database is available at /groups/umcg-lifelines/tmp01/projects/ov22_0581/pheno_lifelines_sqlite/db-lifelines-sep23.db). I also re-run the queries for generating CSV files discriminating incident/prevalent Stroke, MI, and HF. I did this on a new folder to keep the original one for comparison purposes:

/groups/umcg-lifelines/tmp01/projects/ov22_0581/pheno_lifelines_sqlite/queries_output_sep23

This time I also extracted the participants whose conditions are reported as active, but that can't be identified as prevalent or incident (when working with the harmonized FHIR data), as the corresponding onset time can't be inferred -for example, because the date of the assessment with the '1' in stroke_followup_adu_q_1 is not available- ('xxxx_undefined_osdate.csv'). Now only 35 participants were skipped in total, but now there are more conditions with an 'undefined' onset date.

Now I see that the differences between the results of the queries on the FHIR data, and the scripts that use the original lifelines' raw data, are due to some limitations of the intermediate FHIR representation when there are missing values. For instance, when processing the raw data directly, you can tell that a condition is incident just by checking that there is a '1' on xxxxx_followup_adu_q_1, even if other details are missing. When making the same calculation on the harmonized FHIR data, on the other hand, the only way to determine if a condition is incident or prevalent is by comparing the 'onset' date of the condition with a reference date (in this case, the baseline assessment). Hence, when the onset date of a condition can't be inferred during the harmonization process, the detail of its prevalent/incident can no longer be determined.

Please let me know your thoughts. This may eventually require further discussion with @baukearends depending on which elements we want to include in the analysis/prediction model training. For example, would there be alternative ways to estimate these on-set dates when the assessment date is unavailable?

KasiaSmietanka · 2024-09-24T09:29:04Z

KasiaSmietanka
Sep 24, 2024

Re date for stroke_followup_adu_q_1 can't we use the date when the questionnaire was filled? As far as I remember, this should be available for most of the individuals.
For the prediction purpose we need to know the approximate date when the event happened. If it is prevalent (event before a baseline) or incident (event after a baseline). Most commonly used CVD prediction models are designed for the general population without prevalent CVD disease.

5 replies

hcadavid Sep 24, 2024
Maintainer Author

Yes, this is the one being used for estimating the onset date as proposed on this document ('date' column on 1a_v_1_results.csv, 1b_q_1_results.csv, 1c_q_1_results.csv, etc.). As you said, it is indeed available for most individuals, and hence the ones with the missing 'date' correspond to the small subset for which the onset date cannot be estimated. At the next meeting, we could discuss the best way to handle this subset (during pre-processing for example).

In any case, could you look at the updated results and see if the mismatch with yours improved?

KasiaSmietanka Sep 24, 2024

Would it be possible for you to share the no. of identified incident/prevalent cases of each of the endpoints, as you did before? Thanks.

hcadavid Sep 24, 2024
Maintainer Author

Sure! I just added a file with the counts of the queries (or the number of rows on each csv file)

/groups/umcg-lifelines/tmp01/projects/ov22_0581/pheno_lifelines_sqlite/queries_output_sep23/csv_counts.txt

KasiaSmietanka Sep 25, 2024

Thank you. Our numbers are quite similar, with the differences likely resulting from the exclusion of individuals due to data inconsistency. In my approach, I do not remove individuals with inconsistent data. I believe this dataset is now ready for comparison with CBS.

Hector’s counts Kasia’s counts

hf_prevalent.csv 839 844

hf_incident.csv 2193 2238

mi_prevalent.csv 1366 1377

mi_incident.csv 1014 1021

stroke_prevalent.csv 980 986

stroke_incident.csv 745 777

KasiaSmietanka Sep 25, 2024

@baukearends - I am wondering if it is possible to transfer the processed self-reported outcomes to the CBS platform. This would allow us to verify more granularly whether the self-reported outcomes correlate with ICD diagnosis codes. Please let me know your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MyDigiTwinNL

Data harmonization pipeline and data analysis scripts validation #4

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MyDigiTwinNL

Data harmonization pipeline and data analysis scripts validation #4

Uh oh!

Uh oh!

hcadavid Sep 24, 2024 Maintainer

Replies: 1 comment · 5 replies

Uh oh!

KasiaSmietanka Sep 24, 2024

Uh oh!

hcadavid Sep 24, 2024 Maintainer Author

Uh oh!

KasiaSmietanka Sep 24, 2024

Uh oh!

hcadavid Sep 24, 2024 Maintainer Author

Uh oh!

KasiaSmietanka Sep 25, 2024

Uh oh!

KasiaSmietanka Sep 25, 2024

hcadavid
Sep 24, 2024
Maintainer

Replies: 1 comment 5 replies

KasiaSmietanka
Sep 24, 2024

hcadavid Sep 24, 2024
Maintainer Author

hcadavid Sep 24, 2024
Maintainer Author