Skip to content

Commit ebeb79c

Browse files
committed
🥅 Improved measurements + Improved tables
1 parent bb86cc0 commit ebeb79c

File tree

11 files changed

+3363
-413
lines changed

11 files changed

+3363
-413
lines changed

docs/pipelines/misc/measurements.md

Lines changed: 54 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,60 @@
11
# Measurements
22

33
The `eds.measurements` pipeline's role is to detect and normalise numerical measurements within a medical document.
4-
We use simple regular expressions to extract and normalize measurements, and use `Measurement` classes to store them.
5-
6-
!!! warning
7-
8-
The ``measurements`` pipeline is still in active development and has not been rigorously validated.
9-
If you come across a measurement expression that goes undetected, please file an issue !
4+
We use simple regular expressions to extract and normalize measurements, and use `SimpleMeasurement` classes to store them.
105

116
## Scope
127

13-
The `eds.measurements` pipeline can extract simple (eg `3cm`) measurements.
14-
It can detect elliptic enumerations (eg `32, 33 et 34kg`) of measurements of the same type and split the measurements accordingly.
8+
By default, the `eds.measurements` pipeline lets you match all measurements, i.e measurements in most units as well as unitless measurements. If a unit is not in our register,
9+
then you can add It manually. If not, the measurement will be matched without Its unit.
1510

16-
The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit.
17-
18-
The current pipeline annotates the following measurements out of the box:
11+
If you prefer matching specific measurements only, you can create your own measurement config anda set `all_measurements` parameter to `False`. Nevertheless, some default measurements configs are already provided out of the box:
1912

2013
| Measurement name | Example |
2114
| ---------------- | ---------------------- |
2215
| `eds.size` | `1m50`, `1.50m` |
2316
| `eds.weight` | `12kg`, `1kg300` |
2417
| `eds.bmi` | `BMI: 24`, `24 kg.m-2` |
2518
| `eds.volume` | `2 cac`, `8ml` |
19+
| `eds.bool` | `positive`, `negatif` |
20+
21+
The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit (eg `span._.value.g_per_cl` or `span._.value.kg_per_m3` for a density).
22+
23+
The measurements that can be extracted can have one or many of the following characteristics:
24+
- Unitless measurements
25+
- Measurements with unit
26+
- Measurements with range indication (escpecially < or >)
27+
- Measurements with power
28+
29+
The measurement can be written in many complex forms. Among them, this pipe can detect:
30+
- Measurements with range indication, numerical value, power and units in many different orders and separated by customizable stop words
31+
- Composed units (eg `1m50`)
32+
- Measurement with "unitless patterns", i.e some textual information next to a numerical value which allows us to retrieve a unit even if It is not written (eg in the text `Height: 80`, this pipe will a detect the numlerical value `80`and match It to the unit `kg`)
33+
- Elliptic enumerations (eg `32, 33 et 34mol`) of measurements of the same type and split the measurements accordingly
2634

2735
## Usage
2836

37+
This pipe works better with `eds.dates` and `eds.tables` pipe at the same time. These pipes let `eds.measurements` skip dates as measurements and make a specific matching for each table, benefitting of the structured data.
38+
39+
The matched measurements are labeled with a default measurement name if available (eg `eds.size`), else `eds.measurement` if any measure is linked to the dimension of the measure's unit and if `all_measurements` is set to `True`.
40+
41+
As said before, each matched measurement can be accessed via the `span._.value`. This gives you a `SimpleMeasurement` object with the following attributes :
42+
- `value_range` ("<", "=" or ">")
43+
- `value`
44+
- `unit`
45+
- `registry` (This attribute stores the entire unit config like the link between each unit, Its dimension like `length`, `quantity of matter`...)
46+
47+
`SimpleMeasurement` objects are especially usefull when converting measurements to an other specified unit with the same dimension (eg densities stay densities). To do so, simply call your `SimpleMeasurement` followed by `.` + name of the usual unit abbreviation with `per` and `_` as separators (eg `object.kg_per_dm3`, `mol_per_l`, `g_per_cm2`).
48+
49+
Moreover, for now, `SimpleMeasurement` objects can be manipulated with the following operations:
50+
- compared with an other `SimpleMeasurement` object with the same dimension with automatic conversion (eg a density in kg_per_m3 and a density in g_per_l)
51+
- summed with an other `SimpleMeasurement` object with the same dimension with automatic conversion
52+
- substracted with an other `SimpleMeasurement` object with the same dimension with automatic conversion
53+
54+
Note that for all operations listed above, different `value_range` attributes between two units do not matter: by default, the `value_range` of the first measurement is kept.
55+
56+
Below is a complete example on a use case where we want to extract size, weigth and bmi measurements a simple text.
57+
2958
```python
3059
import spacy
3160

@@ -77,7 +106,7 @@ str(measurements[4]._.value.kg_per_m2)
77106

78107
## Custom measurement
79108

80-
You can declare custom measurements by changing the patterns
109+
You can declare custom measurements by changing the patterns.
81110

82111
```python
83112
import spacy
@@ -114,21 +143,24 @@ nlp.add_pipe(
114143
## Declared extensions
115144

116145
The `eds.measurements` pipeline declares a single [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object,
117-
the `value` attribute that is a `Measurement` instance.
146+
the `value` attribute that is a `SimpleMeasurement` instance.
118147

119148
## Configuration
120149

121150
The pipeline can be configured using the following parameters :
122151

123-
| Parameter | Explanation | Default |
124-
| ----------------- | -------------------------------------------------------------------------- | -------------------------------------------------------------------- |
125-
| `measurements` | A list or dict of the measurements to extract | `["eds.size", "eds.weight", "eds.angle"]` |
126-
| `units_config` | A dict describing the units with lexical patterns, dimensions, scales, ... | ... |
127-
| `number_terms` | A dict describing the textual forms of common numbers | ... |
128-
| `stopwords` | A list of stopwords that do not matter when placed between a unitless trigger | ... |
129-
| `unit_divisors` | A list of terms used to divide two units (like: m / s) | ... |
130-
| `ignore_excluded` | Whether to ignore excluded tokens for matching | `False` |
131-
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` |
152+
| Parameter | Explanation | Default |
153+
| ------------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
154+
| `measurements` | A list or dict of the measurements to extract | `None` # Extract measurements from all units |
155+
| `units_config` | A dict describing the units with lexical patterns, dimensions, scales, ... | ... # Config of mostly all commonly used units |
156+
| `number_terms` | A dict describing the textual forms of common numbers | ... # Config of mostly all commonly used textual forms of common numbers |
157+
| `value_range_terms` | A dict describing the textual forms of ranges ("<", "=" or ">") | ... # Config of mostly all commonly used range terms |
158+
| `stopwords_unitless` | A list of stopwords that do not matter when placed between a unitless trigger | `["par", "sur", "de", "a", ":", ",", "et"]` |
159+
| `stopwords_measure_unit` | A list of stopwords that do not matter when placed between a measure and a unit | `["|", "¦", "…", "."]` |
160+
| `measure_before_unit` | A bool to tell if the numerical value is usually placed before the unit | `["par", "sur", "de", "a", ":", ",", "et"]` |
161+
| `unit_divisors` | A list of terms used to divide two units (like: m / s) | `["/", "par"]` |
162+
| `ignore_excluded` | Whether to ignore excluded tokens for matching | `False` |
163+
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"NORM"` |
132164

133165
## Authors and citation
134166

docs/pipelines/misc/tables.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Tables
2+
3+
The `eds.tables` pipeline's role is to detect tables present in a medical document.
4+
We use simple regular expressions to extract tables like text.
5+
6+
## Usage
7+
8+
This pipe lets you match different forms of tables. They can have a frame or not, rows can be spread on multiple consecutive lines (in case of a bad parsing for example)... You can also indicate the presence of headers with the `col_names` and `row_names` boolean parameters.
9+
10+
Each matched table is returned as a `Span` object. You can then access to an equivalent dictionnary formatted table with `table` extension or use `to_pandas_table()` to get the equivalent pandas DataFrame. The key of the dictionnary is determined as folowed:
11+
- If `col_names` is True, then, the dictionnary keys are the names of the columns (str).
12+
- Elif `row_names` is True, then, the dictionnary keys are the names (str).
13+
- Else the dictionnary keys are indexes of the columns (int).
14+
15+
`to_pandas_table()` can be customised with `as_spans` parameter. If set to `True`, then the pandas dataframe will contain the cells as spans, else the pandas dataframe will contain the cells as raw strings.
16+
17+
```python
18+
import spacy
19+
20+
nlp = spacy.blank("fr")
21+
nlp.add_pipe("eds.normalizer")
22+
nlp.add_pipe("eds.tables")
23+
24+
text = """
25+
SERVICE
26+
MEDECINE INTENSIVE –
27+
REANIMATION
28+
Réanimation / Surveillance Continue
29+
Médicale
30+
31+
COMPTE RENDU D'HOSPITALISATION du 05/06/2020 au 10/06/2020
32+
Madame DUPONT Marie, née le 16/05/1900, âgée de 20 ans, a été hospitalisée en réanimation du
33+
05/06/1920 au 10/06/1920 pour intoxication médicamenteuse volontaire.
34+
35+
36+
Examens complémentaires
37+
Hématologie
38+
Numération
39+
Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
40+
Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
41+
Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
42+
Hématocrite ¦% ¦44.2 ¦39.2-48.6
43+
VGM ¦fL ¦94.4 + ¦79.6-94
44+
TCMH ¦pg ¦31.6 ¦27.3-32.8
45+
CCMH ¦g/dL ¦33.5 ¦32.4-36.3
46+
Plaquettes ¦x10*9/L ¦191 ¦172-398
47+
VMP ¦fL ¦11.5 + ¦7.4-10.8
48+
49+
Sur le plan neurologique : Devant la persistance d'une confusion à distance de l'intoxication au
50+
...
51+
52+
2/2Pat : <NOM> <Prenom>|F |<date> | <ipp> |Intitulé RCP
53+
54+
"""
55+
56+
doc = nlp(text)
57+
58+
# A table span
59+
table = doc.spans["tables"][0]
60+
# Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
61+
# Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
62+
# Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
63+
# Hématocrite ¦% ¦44.2 ¦39.2-48.6
64+
# VGM ¦fL ¦94.4 + ¦79.6-94
65+
# TCMH ¦pg ¦31.6 ¦27.3-32.8
66+
# CCMH ¦g/dL ¦33.5 ¦32.4-36.3
67+
# Plaquettes ¦x10*9/L ¦191 ¦172-398
68+
# VMP ¦fL ¦11.5 + ¦7.4-10.8
69+
70+
# Convert span to Pandas table
71+
df = table._.to_pd_table(as_spans=False)
72+
type(df)
73+
# >> pandas.core.frame.DataFrame
74+
```
75+
The pd DataFrame:
76+
| | 0 | 1 | 2 | 3 |
77+
| ---: | :---------- | :------- | :----- | :-------- |
78+
| 0 | Leucocytes | x10*9/L | 4.97 | 4.09-11 |
79+
| 1 | Hématies | x10*12/L | 4.68 | 4.53-5.79 |
80+
| 2 | Hémoglobine | g/dL | 14.8 | 13.4-16.7 |
81+
| 3 | Hématocrite | % | 44.2 | 39.2-48.6 |
82+
| 4 | VGM | fL | 94.4 + | 79.6-94 |
83+
| 5 | TCMH | pg | 31.6 | 27.3-32.8 |
84+
| 6 | CCMH | g/dL | 33.5 | 32.4-36.3 |
85+
| 7 | Plaquettes | x10*9/L | 191 | 172-398 |
86+
| 8 | VMP | fL | 11.5 + | 7.4-10.8 |
87+
88+
## Declared extensions
89+
90+
The `eds.tables` pipeline declares two [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object. The first one is `to_pd_table()` method which returns a parsed pandas version of the table. The second one is `table` which contains the table stored as a dictionnary containing cells as `Span` objects.
91+
92+
## Configuration
93+
94+
The pipeline can be configured using the following parameters :
95+
96+
| Parameter | Explanation | Default |
97+
| ----------------- | ------------------------------------------------ | ---------------------- |
98+
| `tables_pattern` | Pattern to identify table spans | `rf"(\b.*{sep}.*\n)+"` |
99+
| `sep_pattern` | Pattern to identify column separation | `r"¦"` |
100+
| `ignore_excluded` | Ignore excluded tokens | `True` |
101+
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"TEXT"` |
102+
103+
## Authors and citation
104+
105+
The `eds.tables` pipeline was developed by AP-HP's Data Science team.

edsnlp/pipelines/factories.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
from .misc.measurements.factory import create_component as measurements
1616
from .misc.reason.factory import create_component as reason
1717
from .misc.sections.factory import create_component as sections
18+
from .misc.tables.factory import create_component as tables
1819
from .ner.adicap.factory import create_component as adicap
1920
from .ner.cim10.factory import create_component as cim10
2021
from .ner.covid.factory import create_component as covid

edsnlp/pipelines/misc/measurements/factory.py

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,15 @@
1515
ignore_excluded=True,
1616
units_config=patterns.units_config,
1717
number_terms=patterns.number_terms,
18+
value_range_terms=patterns.value_range_terms,
1819
unit_divisors=patterns.unit_divisors,
1920
measurements=None,
20-
stopwords=patterns.stopwords,
21+
stopwords_unitless=patterns.stopwords_unitless,
22+
stopwords_measure_unit=patterns.stopwords_measure_unit,
23+
measure_before_unit=False,
24+
parse_doc=True,
25+
parse_tables=True,
26+
all_measurements=True,
2127
)
2228

2329

@@ -29,7 +35,13 @@ def create_component(
2935
measurements: Optional[Union[Dict[str, MeasureConfig], List[str]]],
3036
units_config: Dict[str, UnitConfig],
3137
number_terms: Dict[str, List[str]],
32-
stopwords: List[str],
38+
value_range_terms: Dict[str, List[str]],
39+
all_measurements: bool,
40+
parse_tables: bool,
41+
parse_doc: bool,
42+
stopwords_unitless: List[str],
43+
stopwords_measure_unit: List[str],
44+
measure_before_unit: bool,
3345
unit_divisors: List[str],
3446
ignore_excluded: bool,
3547
attr: str,
@@ -38,9 +50,15 @@ def create_component(
3850
nlp,
3951
units_config=units_config,
4052
number_terms=number_terms,
53+
value_range_terms=value_range_terms,
54+
all_measurements=all_measurements,
55+
parse_tables=parse_tables,
56+
parse_doc=parse_doc,
4157
unit_divisors=unit_divisors,
4258
measurements=measurements,
43-
stopwords=stopwords,
59+
stopwords_unitless=stopwords_unitless,
60+
stopwords_measure_unit=stopwords_measure_unit,
61+
measure_before_unit=measure_before_unit,
4462
attr=attr,
4563
ignore_excluded=ignore_excluded,
4664
)

0 commit comments

Comments
 (0)