Skip to content

Data Protocols

Giuseppe Tavella edited this page Jul 21, 2023 · 9 revisions

Final dataframe

After cleaning, formatting and organizing data, this is the final dataframe you'll be working with to do your analyses.

Note: the number of Components can vary.

column type example 1 example 2
blood_type string A AB
rh_factor integer 0 1
sex integer 1 0
kell_antigen string KK
rh_pheno string CcDEe
id_test string 010165 014422
date_test date 8/13/2017 16:55:00 10/21/2018 17:10:00
num_test integer 4 11
MCHC float 33.6100 28.7500
PLT float 178.0000 208.0000
.. .. .. ..
HIV_NAT string negative negative

Meaning of booleans 0-1:

column 0 1
sex female male
rh_factor negative positive

So, what changes are we going to apply from the initial dataframe loaded directly from user, to the dataframe ready for analysis?

What columns will be renamed, deleted, added, modified? What data types will change? How will missing values change?

Structure

Every new Blood Test contains different information.

It contains metadata about the blood itself (e.g. the blood type), about the test (e.g. when the blood was donated) and about the person (e.g. male or female).

Each Blood Test also contains the blood components values (e.g. hemoglobin) which is the whole point of analyzing blood in the first place.

Now, most components are numeric values (float), however some components also have string values (e.g. HIV COMBO) such as positive, negative etc.

Hence the following structure:

Blood Test > Metadata, Components

As the name suggests Metadata is all about extra information or context.

Metadata > Blood, Test, Person

And the actual blood components values.

Components > Numeric, Non-numeric

Let's now better understand what we can consider as Metadata and what we consider as Values. This will help us have a crystal-clear view of what data to analyze, what data to leave out, what to do with what, and eventually have predictable data types to work with during development - regardless of user input.

Columns behavior

On Metadata

The column names are

  • case insensitive. blood type = Blood Type
  • order sensitive. blood_type = blood type != type blood
  • only characters allowed are _ and . blood_type = blood type != blood-type
  • cannot have both short and long version, either one or the other.

Any naming different than this, and it will be treated as a different column.

actual column name match no match
blood_type Blood Type, blood type, bloodtype blood-type, type blood
date_test Date test, date test Test date, test-date
HGB hbg, _hbg H-BG

After some self-debating, I've decided. Column names are unique, and column names who don't match exactly the already defined column names, will be left out.

Therefore, when playing with column names, leave them exactly as they are. Don't modify anything, keep the case as it is.

On Components

Same goes for Components. You need to type in in exactly the component you mean, in short format.

Spelling WBCF instead of WBC can mean two completely different things.

Thus, whatever blood component was mispelled will be automatically be left out because it cannot find a corresponding component. Thus, before starting any analysis, make sure that all the data has been loaded correctly.

Metadata

Blood

Features / Columns
blood_type
rh_pheno
kell_antigen

Test

Features / Columns
date_test
id_test
num_test

Person

Features / Columns
sex

Components

Numeric

Numeric Blood Values consitute the majority of our data, and it's pretty much what any blood test is all about.

See table below.

Non-numeric

See table below.

Components map

We use this to map values, so . We can use to Typical flow: User input types in , in development, we always refer to a certain blood component with the short version.

short long type
WBC White Blood Count numeric
BC Blood Count numeric
HGB Hemoglobin numeric
HCT Hematocrit numeric
MCV Mean Corpuscular Volume numeric
MCH Mean Corpuscular Hemoglobin numeric
MCHC Mean Corpuscular Hemoglobin Concentration numeric
PLT Platelets numeric
NEUT Neutrophils numeric
LYMPH Lymphocytes numeric
EOS Eosinophils numeric
chol Cholesterol numeric
ALT Alanine Transaminase numeric
AST Aspartate Transferase numeric
creat Creatinine numeric
trigl Triglycerides numeric
prot_tot Total protein numeric
ser_iron Serum iron numeric
anti_HCV Hepatitis C virus Antibodies non-numeric
HCVNAT Nucleic Acid Amplification Testing for HCV non-numeric
HIV_NAT Nucleic Acid Amplification for HIV non-numeric
HBV_NAT Nucleic Acid Amplification for Hepatitis B virus non-numeric
virology screening virology (HBsAg- HCV- HIV- syphilis) non-numeric
HIV_COMBO ? non-numeric
anti_syph Syphilis Antibodies non-numeric
HBsAg Hepatitis B surface antigen non-numeric

Metadata map

short long type
blood_type Blood Type string
rh_pheno Rh phenotype string
kell_antigen Kell Antigen string
date_test Date test date
id_test ID test string
num_test Test number integer
sex Sex boolean string

How the data will be transformed: \

blood_type (A+ AB- O+ ..) --> blood_type (A AB O ..) AND new column rh_factor (+|-)
sex (M|F) --> sex (1|0)

Reference scale

My hemoglobin is low - how do you know? You can only know based on the comparison you make between the hemoglobin value and the interval, or range, of the recommended, normal values. We call the latter a Reference scale. \

component = long version of the blood component. Lymphocites, platelets, white blood cells etc.
measure = in what scale the component gets measured. 10^6/μl, mg/dl, μg/dl etc.
min = minimum value for this component
max = maximum value for this component

By default, I will use the Reference scale of my laboratory, where they do my analysis. Here:

comp measure min max
WBC  10^3/μl 3.70 10.00
BC  10^6/μl 4.06 6.00
HGB  x g/dl 12.00 16.50
HCT  % 37.00 50.00
MCV fl 80.00 98.00
MCH  pg 26.00 31.20
MCHC  g/dl 31.00 36.00
PLT 10^3/μl 150.00 400.00
NEUT  % 40.00 70.00
LYMPH  % 20.00 40.00
EOS  % 0.00 7.00
chol  mg/dl 0.00 239.00
ALT  U/L 0.00 50.00
AST  U/L 17.00 59.00
creat  mg/dl 0.66 1.25
trigl  mg/dl 0.00 200.00
prot_tot g/dl 6.30 8.30
ser_iron  μg/dl 37.00 158.00

Clone this wiki locally