|
1 | 1 | # Measurements |
2 | 2 |
|
3 | 3 | The `eds.measurements` pipeline's role is to detect and normalise numerical measurements within a medical document. |
4 | | -We use simple regular expressions to extract and normalize measurements, and use `Measurement` classes to store them. |
5 | | - |
6 | | -!!! warning |
7 | | - |
8 | | - The ``measurements`` pipeline is still in active development and has not been rigorously validated. |
9 | | - If you come across a measurement expression that goes undetected, please file an issue ! |
| 4 | +We use simple regular expressions to extract and normalize measurements, and use `SimpleMeasurement` classes to store them. |
10 | 5 |
|
11 | 6 | ## Scope |
12 | 7 |
|
13 | | -The `eds.measurements` pipeline can extract simple (eg `3cm`) measurements. |
14 | | -It can detect elliptic enumerations (eg `32, 33 et 34kg`) of measurements of the same type and split the measurements accordingly. |
| 8 | +By default, the `eds.measurements` pipeline lets you match all measurements, i.e measurements in most units as well as unitless measurements. If a unit is not in our register, |
| 9 | +then you can add It manually. If not, the measurement will be matched without Its unit. |
15 | 10 |
|
16 | | -The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit. |
17 | | - |
18 | | -The current pipeline annotates the following measurements out of the box: |
| 11 | +If you prefer matching specific measurements only, you can create your own measurement config anda set `all_measurements` parameter to `False`. Nevertheless, some default measurements configs are already provided out of the box: |
19 | 12 |
|
20 | 13 | | Measurement name | Example | |
21 | 14 | | ---------------- | ---------------------- | |
22 | 15 | | `eds.size` | `1m50`, `1.50m` | |
23 | 16 | | `eds.weight` | `12kg`, `1kg300` | |
24 | 17 | | `eds.bmi` | `BMI: 24`, `24 kg.m-2` | |
25 | 18 | | `eds.volume` | `2 cac`, `8ml` | |
| 19 | +| `eds.bool` | `positive`, `negatif` | |
| 20 | + |
| 21 | +The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit (eg `span._.value.g_per_cl` or `span._.value.kg_per_m3` for a density). |
| 22 | + |
| 23 | +The measurements that can be extracted can have one or many of the following characteristics: |
| 24 | +- Unitless measurements |
| 25 | +- Measurements with unit |
| 26 | +- Measurements with range indication (escpecially < or >) |
| 27 | +- Measurements with power |
| 28 | + |
| 29 | +The measurement can be written in many complex forms. Among them, this pipe can detect: |
| 30 | +- Measurements with range indication, numerical value, power and units in many different orders and separated by customizable stop words |
| 31 | +- Composed units (eg `1m50`) |
| 32 | +- Measurement with "unitless patterns", i.e some textual information next to a numerical value which allows us to retrieve a unit even if It is not written (eg in the text `Height: 80`, this pipe will a detect the numlerical value `80`and match It to the unit `kg`) |
| 33 | +- Elliptic enumerations (eg `32, 33 et 34mol`) of measurements of the same type and split the measurements accordingly |
26 | 34 |
|
27 | 35 | ## Usage |
28 | 36 |
|
| 37 | +This pipe works better with `eds.dates` and `eds.tables` pipe at the same time. These pipes let `eds.measurements` skip dates as measurements and make a specific matching for each table, benefitting of the structured data. |
| 38 | + |
| 39 | +The matched measurements are labeled with a default measurement name if available (eg `eds.size`), else `eds.measurement` if any measure is linked to the dimension of the measure's unit and if `all_measurements` is set to `True`. |
| 40 | + |
| 41 | +As said before, each matched measurement can be accessed via the `span._.value`. This gives you a `SimpleMeasurement` object with the following attributes : |
| 42 | +- `value_range` ("<", "=" or ">") |
| 43 | +- `value` |
| 44 | +- `unit` |
| 45 | +- `registry` (This attribute stores the entire unit config like the link between each unit, Its dimension like `length`, `quantity of matter`...) |
| 46 | + |
| 47 | +`SimpleMeasurement` objects are especially usefull when converting measurements to an other specified unit with the same dimension (eg densities stay densities). To do so, simply call your `SimpleMeasurement` followed by `.` + name of the usual unit abbreviation with `per` and `_` as separators (eg `object.kg_per_dm3`, `mol_per_l`, `g_per_cm2`). |
| 48 | + |
| 49 | +Moreover, for now, `SimpleMeasurement` objects can be manipulated with the following operations: |
| 50 | +- compared with an other `SimpleMeasurement` object with the same dimension with automatic conversion (eg a density in kg_per_m3 and a density in g_per_l) |
| 51 | +- summed with an other `SimpleMeasurement` object with the same dimension with automatic conversion |
| 52 | +- substracted with an other `SimpleMeasurement` object with the same dimension with automatic conversion |
| 53 | + |
| 54 | +Note that for all operations listed above, different `value_range` attributes between two units do not matter: by default, the `value_range` of the first measurement is kept. |
| 55 | + |
| 56 | +Below is a complete example on a use case where we want to extract size, weigth and bmi measurements a simple text. |
| 57 | + |
29 | 58 | ```python |
30 | 59 | import spacy |
31 | 60 |
|
@@ -77,7 +106,7 @@ str(measurements[4]._.value.kg_per_m2) |
77 | 106 |
|
78 | 107 | ## Custom measurement |
79 | 108 |
|
80 | | -You can declare custom measurements by changing the patterns |
| 109 | +You can declare custom measurements by changing the patterns. |
81 | 110 |
|
82 | 111 | ```python |
83 | 112 | import spacy |
@@ -114,21 +143,24 @@ nlp.add_pipe( |
114 | 143 | ## Declared extensions |
115 | 144 |
|
116 | 145 | The `eds.measurements` pipeline declares a single [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object, |
117 | | -the `value` attribute that is a `Measurement` instance. |
| 146 | +the `value` attribute that is a `SimpleMeasurement` instance. |
118 | 147 |
|
119 | 148 | ## Configuration |
120 | 149 |
|
121 | 150 | The pipeline can be configured using the following parameters : |
122 | 151 |
|
123 | | -| Parameter | Explanation | Default | |
124 | | -| ----------------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------- | |
125 | | -| `measurements` | A list or dict of the measurements to extract | `["eds.size", "eds.weight", "eds.angle"]` | |
126 | | -| `units_config` | A dict describing the units with lexical patterns, dimensions, scales, ... | ... | |
127 | | -| `number_terms` | A dict describing the textual forms of common numbers | ... | |
128 | | -| `stopwords` | A list of stopwords that do not matter when placed between a unitless trigger | ... | |
129 | | -| `unit_divisors` | A list of terms used to divide two units (like: m / s) | ... | |
130 | | -| `ignore_excluded` | Whether to ignore excluded tokens for matching | `False` | |
131 | | -| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` | |
| 152 | +| Parameter | Explanation | Default | |
| 153 | +| ------------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------- | |
| 154 | +| `measurements` | A list or dict of the measurements to extract | `None` # Extract measurements from all units | |
| 155 | +| `units_config` | A dict describing the units with lexical patterns, dimensions, scales, ... | ... # Config of mostly all commonly used units | |
| 156 | +| `number_terms` | A dict describing the textual forms of common numbers | ... # Config of mostly all commonly used textual forms of common numbers | |
| 157 | +| `value_range_terms` | A dict describing the textual forms of ranges ("<", "=" or ">") | ... # Config of mostly all commonly used range terms | |
| 158 | +| `stopwords_unitless` | A list of stopwords that do not matter when placed between a unitless trigger | `["par", "sur", "de", "a", ":", ",", "et"]` | |
| 159 | +| `stopwords_measure_unit` | A list of stopwords that do not matter when placed between a measure and a unit | `["|", "¦", "…", "."]` | |
| 160 | +| `measure_before_unit` | A bool to tell if the numerical value is usually placed before the unit | `["par", "sur", "de", "a", ":", ",", "et"]` | |
| 161 | +| `unit_divisors` | A list of terms used to divide two units (like: m / s) | `["/", "par"]` | |
| 162 | +| `ignore_excluded` | Whether to ignore excluded tokens for matching | `False` | |
| 163 | +| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` | |
132 | 164 |
|
133 | 165 | ## Authors and citation |
134 | 166 |
|
|
0 commit comments