-
Notifications
You must be signed in to change notification settings - Fork 0
Data Protocols
After cleaning, formatting and organizing data, this is the final dataframe you'll be working with to do your analyses.
Note: the number of Components can vary.
| column | type | example 1 | example 2 |
|---|---|---|---|
| blood_type | string | A | AB |
| rh_factor | integer | 0 | 1 |
| sex | integer | 1 | 0 |
| kell_antigen | string | KK | |
| rh_pheno | string | CcDEe | |
| id_test | string | 010165 | 014422 |
| date_test | date | 8/13/2017 16:55:00 | 10/21/2018 17:10:00 |
| num_test | integer | 4 | 11 |
| MCHC | float | 33.6100 | 28.7500 |
| PLT | float | 178.0000 | 208.0000 |
| .. | .. | .. | .. |
| HIV_NAT | string | negative | negative |
Meaning of booleans 0-1:
| column | 0 | 1 |
|---|---|---|
| sex | female | male |
| rh_factor | negative | positive |
So, what changes are we going to apply from the initial dataframe loaded directly from user, to the dataframe ready for analysis?
What columns will be renamed, deleted, added, modified? What data types will change? How will missing values change?
Every new Blood Test contains different information.
It contains metadata about the blood itself (e.g. the blood type), about the test (e.g. when the blood was donated) and about the person (e.g. male or female).
Each Blood Test also contains the blood components values (e.g. hemoglobin) which is the whole point of analyzing blood in the first place.
Now, most components are numeric values (float), however some components also have string values (e.g. HIV COMBO) such as positive, negative etc.
Hence the following structure:
Blood Test > Metadata, Components
As the name suggests Metadata is all about extra information or context.
Metadata > Blood, Test, Person
And the actual blood components values.
Components > Numeric, Non-numeric
Let's now better understand what we can consider as Metadata and what we consider as Values. This will help us have a crystal-clear view of what data to analyze, what data to leave out, what to do with what, and eventually have predictable data types to work with during development - regardless of user input.
The column names are
- case insensitive. blood type = Blood Type
- order sensitive. blood_type = blood type != type blood
- only characters allowed are
_and. blood_type = blood type != blood-type - cannot have both short and long version, either one or the other.
Any naming different than this, and it will be treated as a different column.
| actual column name | match | no match |
|---|---|---|
| blood_type | Blood Type, blood type, bloodtype | blood-type, type blood |
| date_test | Date test, date test | Test date, test-date |
| HGB | hbg, _hbg | H-BG |
After some self-debating, I've decided. Column names are unique, and column names who don't match exactly the already defined column names, will be left out.
Therefore, when playing with column names, leave them exactly as they are. Don't modify anything, keep the case as it is.
Same goes for Components. You need to type in in exactly the component you mean, in short format.
Spelling WBCF instead of WBC can mean two completely different things.
Thus, whatever blood component was mispelled will be automatically be left out because it cannot find a corresponding component. Thus, before starting any analysis, make sure that all the data has been loaded correctly.
| Features / Columns |
|---|
| blood_type |
| rh_pheno |
| kell_antigen |
| Features / Columns |
|---|
| date_test |
| id_test |
| num_test |
| Features / Columns |
|---|
| sex |
Numeric Blood Values consitute the majority of our data, and it's pretty much what any blood test is all about.
See table below.
See table below.
We use this to map values, so . We can use to Typical flow: User input types in , in development, we always refer to a certain blood component with the short version.
| short | long | type |
|---|---|---|
| WBC | White Blood Count | numeric |
| BC | Blood Count | numeric |
| HGB | Hemoglobin | numeric |
| HCT | Hematocrit | numeric |
| MCV | Mean Corpuscular Volume | numeric |
| MCH | Mean Corpuscular Hemoglobin | numeric |
| MCHC | Mean Corpuscular Hemoglobin Concentration | numeric |
| PLT | Platelets | numeric |
| NEUT | Neutrophils | numeric |
| LYMPH | Lymphocytes | numeric |
| EOS | Eosinophils | numeric |
| chol | Cholesterol | numeric |
| ALT | Alanine Transaminase | numeric |
| AST | Aspartate Transferase | numeric |
| creat | Creatinine | numeric |
| trigl | Triglycerides | numeric |
| prot_tot | Total protein | numeric |
| ser_iron | Serum iron | numeric |
| anti_HCV | Hepatitis C virus Antibodies | non-numeric |
| HCVNAT | Nucleic Acid Amplification Testing for HCV | non-numeric |
| HIV_NAT | Nucleic Acid Amplification for HIV | non-numeric |
| HBV_NAT | Nucleic Acid Amplification for Hepatitis B virus | non-numeric |
| virology | screening virology (HBsAg- HCV- HIV- syphilis) | non-numeric |
| HIV_COMBO | ? | non-numeric |
| anti_syph | Syphilis Antibodies | non-numeric |
| HBsAg | Hepatitis B surface antigen | non-numeric |
| short | long | type |
|---|---|---|
| blood_type | Blood Type | string |
| rh_pheno | Rh phenotype | string |
| kell_antigen | Kell Antigen | string |
| date_test | Date test | date |
| id_test | ID test | string |
| num_test | Test number | integer |
| sex | Sex | boolean string |
How the data will be transformed: \
blood_type (A+ AB- O+ ..) --> blood_type (A AB O ..) AND new column rh_factor (+|-)
sex (M|F) --> sex (1|0)
My hemoglobin is low - how do you know? You can only know based on the comparison you make between the hemoglobin value and the interval, or range, of the recommended, normal values. We call the latter a Reference scale. \
component = long version of the blood component. Lymphocites, platelets, white blood cells etc.
measure = in what scale the component gets measured. 10^6/μl, mg/dl, μg/dl etc.
min = minimum value for this component
max = maximum value for this component
By default, I will use the Reference scale of my laboratory, where they do my analysis. Here:
| comp | measure | min | max |
|---|---|---|---|
| WBC | 10^3/μl | 3.70 | 10.00 |
| BC | 10^6/μl | 4.06 | 6.00 |
| HGB | x g/dl | 12.00 | 16.50 |
| HCT | % | 37.00 | 50.00 |
| MCV | fl | 80.00 | 98.00 |
| MCH | pg | 26.00 | 31.20 |
| MCHC | g/dl | 31.00 | 36.00 |
| PLT | 10^3/μl | 150.00 | 400.00 |
| NEUT | % | 40.00 | 70.00 |
| LYMPH | % | 20.00 | 40.00 |
| EOS | % | 0.00 | 7.00 |
| chol | mg/dl | 0.00 | 239.00 |
| ALT | U/L | 0.00 | 50.00 |
| AST | U/L | 17.00 | 59.00 |
| creat | mg/dl | 0.66 | 1.25 |
| trigl | mg/dl | 0.00 | 200.00 |
| prot_tot | g/dl | 6.30 | 8.30 |
| ser_iron | μg/dl | 37.00 | 158.00 |