Skip to content

Commit 07871ed

Browse files
author
Christel Gérardin
committed
Bug fixes, table pipe is more robust, Updated doc
1 parent fed2a02 commit 07871ed

File tree

5 files changed

+172
-18
lines changed

5 files changed

+172
-18
lines changed

docs/pipelines/misc/measurements.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,15 @@ We use simple regular expressions to extract and normalize measurements, and use
88
By default, the `eds.measurements` pipeline lets you match all measurements, i.e measurements in most units as well as unitless measurements. If a unit is not in our register,
99
then you can add It manually. If not, the measurement will be matched without Its unit.
1010

11-
If you prefer to match specific measurements only, you can create your own measurement config. Nevertheless, some default measurements configs are already provided out of the box:
11+
If you prefer matching specific measurements only, you can create your own measurement config anda set `all_measurements` parameter to `False`. Nevertheless, some default measurements configs are already provided out of the box:
1212

1313
| Measurement name | Example |
1414
| ---------------- | ---------------------- |
1515
| `eds.size` | `1m50`, `1.50m` |
1616
| `eds.weight` | `12kg`, `1kg300` |
1717
| `eds.bmi` | `BMI: 24`, `24 kg.m-2` |
1818
| `eds.volume` | `2 cac`, `8ml` |
19+
| `eds.bool` | `positive`, `negatif` |
1920

2021
The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit (eg `span._.value.g_per_cl` or `span._.value.kg_per_m3` for a density).
2122

@@ -25,15 +26,17 @@ The measurements that can be extracted can have one or many of the following cha
2526
- Measurements with range indication (escpecially < or >)
2627
- Measurements with power
2728

28-
The measurement can be written in many coplex forms. Among them, this pipe can detect:
29+
The measurement can be written in many complex forms. Among them, this pipe can detect:
2930
- Measurements with range indication, numerical value, power and units in many different orders and separated by customizable stop words
3031
- Composed units (eg `1m50`)
3132
- Measurement with "unitless patterns", i.e some textual information next to a numerical value which allows us to retrieve a unit even if It is not written (eg in the text `Height: 80`, this pipe will a detect the numlerical value `80`and match It to the unit `kg`)
3233
- Elliptic enumerations (eg `32, 33 et 34mol`) of measurements of the same type and split the measurements accordingly
3334

3435
## Usage
3536

36-
The matched measurements are labelised with `eds.measurement` by default. However, if you are only creating your own measurement or using a predefined one, your measurements will be labeled with the name of this measurement (eg `eds.weight`).
37+
This pipe works better with `eds.dates` and `eds.tables` pipe at the same time. These pipes let `eds.measurements` skip dates as measurements and make a specific matching for each table, benefitting of the structured data.
38+
39+
The matched measurements are labeled with a default measurement name if available (eg `eds.size`), else `eds.measurement` if any measure is linked to the dimension of the measure's unit and if `all_measurements` is set to `True`.
3740

3841
As said before, each matched measurement can be accessed via the `span._.value`. This gives you a `SimpleMeasurement` object with the following attributes :
3942
- `value_range` ("<", "=" or ">")

docs/pipelines/misc/tables.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Tables
2+
3+
The `eds.tables` pipeline's role is to detect tables present in a medical document.
4+
We use simple regular expressions to extract tables like text.
5+
6+
## Usage
7+
8+
This pipe lets you match different forms of tables. They can have a frame or not, rows can be spread on multiple consecutive lines (in case of a bad parsing for example)... You can also indicate the presence of headers with the `col_names` and `row_names` boolean parameters.
9+
10+
Each matched table is returned as a `Span` object. You can then access to an equivalent dictionnary formatted table with `table` extension or use `to_pandas_table()` to get the equivalent pandas DataFrame. The key of the dictionnary is determined as folowed:
11+
- If `col_names` is True, then, the dictionnary keys are the names of the columns (str).
12+
- Elif `row_names` is True, then, the dictionnary keys are the names (str).
13+
- Else the dictionnary keys are indexes of the columns (int).
14+
15+
`to_pandas_table()` can be customised with `as_spans` parameter. If set to `True`, then the pandas dataframe will contain the cells as spans, else the pandas dataframe will contain the cells as raw strings.
16+
17+
```python
18+
import spacy
19+
20+
nlp = spacy.blank("fr")
21+
nlp.add_pipe("eds.normalizer")
22+
nlp.add_pipe("eds.tables")
23+
24+
text = """
25+
SERVICE
26+
MEDECINE INTENSIVE –
27+
REANIMATION
28+
Réanimation / Surveillance Continue
29+
Médicale
30+
31+
COMPTE RENDU D'HOSPITALISATION du 05/06/2020 au 10/06/2020
32+
Madame DUPONT Marie, née le 16/05/1900, âgée de 20 ans, a été hospitalisée en réanimation du
33+
05/06/1920 au 10/06/1920 pour intoxication médicamenteuse volontaire.
34+
35+
36+
Examens complémentaires
37+
Hématologie
38+
Numération
39+
Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
40+
Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
41+
Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
42+
Hématocrite ¦% ¦44.2 ¦39.2-48.6
43+
VGM ¦fL ¦94.4 + ¦79.6-94
44+
TCMH ¦pg ¦31.6 ¦27.3-32.8
45+
CCMH ¦g/dL ¦33.5 ¦32.4-36.3
46+
Plaquettes ¦x10*9/L ¦191 ¦172-398
47+
VMP ¦fL ¦11.5 + ¦7.4-10.8
48+
49+
Sur le plan neurologique : Devant la persistance d'une confusion à distance de l'intoxication au
50+
...
51+
52+
2/2Pat : <NOM> <Prenom>|F |<date> | <ipp> |Intitulé RCP
53+
54+
"""
55+
56+
doc = nlp(text)
57+
58+
# A table span
59+
table = doc.spans["tables"][0]
60+
# Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
61+
# Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
62+
# Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
63+
# Hématocrite ¦% ¦44.2 ¦39.2-48.6
64+
# VGM ¦fL ¦94.4 + ¦79.6-94
65+
# TCMH ¦pg ¦31.6 ¦27.3-32.8
66+
# CCMH ¦g/dL ¦33.5 ¦32.4-36.3
67+
# Plaquettes ¦x10*9/L ¦191 ¦172-398
68+
# VMP ¦fL ¦11.5 + ¦7.4-10.8
69+
70+
# Convert span to Pandas table
71+
df = table._.to_pd_table(as_spans=False)
72+
type(df)
73+
# >> pandas.core.frame.DataFrame
74+
```
75+
The pd DataFrame:
76+
| | 0 | 1 | 2 | 3 |
77+
| ---: | :---------- | :------- | :----- | :-------- |
78+
| 0 | Leucocytes | x10*9/L | 4.97 | 4.09-11 |
79+
| 1 | Hématies | x10*12/L | 4.68 | 4.53-5.79 |
80+
| 2 | Hémoglobine | g/dL | 14.8 | 13.4-16.7 |
81+
| 3 | Hématocrite | % | 44.2 | 39.2-48.6 |
82+
| 4 | VGM | fL | 94.4 + | 79.6-94 |
83+
| 5 | TCMH | pg | 31.6 | 27.3-32.8 |
84+
| 6 | CCMH | g/dL | 33.5 | 32.4-36.3 |
85+
| 7 | Plaquettes | x10*9/L | 191 | 172-398 |
86+
| 8 | VMP | fL | 11.5 + | 7.4-10.8 |
87+
88+
## Declared extensions
89+
90+
The `eds.tables` pipeline declares two [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object. The first one is `to_pd_table()` method which returns a parsed pandas version of the table. The second one is `table` which contains the table stored as a dictionnary containing cells as `Span` objects.
91+
92+
## Configuration
93+
94+
The pipeline can be configured using the following parameters :
95+
96+
| Parameter | Explanation | Default |
97+
| ----------------- | ------------------------------------------------ | ---------------------- |
98+
| `tables_pattern` | Pattern to identify table spans | `rf"(\b.*{sep}.*\n)+"` |
99+
| `sep_pattern` | Pattern to identify column separation | `r"¦"` |
100+
| `ignore_excluded` | Ignore excluded tokens | `True` |
101+
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"TEXT"` |
102+
103+
## Authors and citation
104+
105+
The `eds.tables` pipeline was developed by AP-HP's Data Science team.

edsnlp/pipelines/misc/tables/factory.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from typing import Dict, List, Optional, Union
1+
from typing import List, Optional
22

33
from spacy.language import Language
44

@@ -20,8 +20,8 @@
2020
def create_component(
2121
nlp: Language,
2222
name: str,
23-
tables_pattern: Optional[Dict[str, Union[List[str], str]]],
24-
sep_pattern: Optional[str],
23+
tables_pattern: Optional[List[str]],
24+
sep_pattern: Optional[List[str]],
2525
attr: str,
2626
ignore_excluded: bool,
2727
col_names: Optional[bool] = False,
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
sep = r"¦"
2-
regex = rf"(?:{sep}?(?:[^{sep}\n]*{sep})+[^{sep}\n]*{sep}?\n)+"
1+
sep = [r"¦", r"|"]
2+
regex = [r"(?:¦?(?:[^¦\n]*¦)+[^¦\n]*¦?\n)+", r"(?:\|?(?:[^\|\n]*\|)+[^\|\n]*\|?\n)+"]

edsnlp/pipelines/misc/tables/tables.py

Lines changed: 56 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from typing import Dict, Optional, Union
1+
from typing import List, Optional
22

33
import pandas as pd
44
from spacy.language import Language
@@ -19,10 +19,10 @@ class TablesMatcher:
1919
----------
2020
nlp : Language
2121
spaCy nlp pipeline to use for matching.
22-
tables_pattern : Optional[str]
23-
The regex pattern to identify tables.
24-
sep_pattern : Optional[str]
25-
The regex pattern to identify separators
22+
tables_pattern : Optional[List[str]]
23+
The regex patterns to identify tables.
24+
sep_pattern : Optional[List[str]]
25+
The regex patterns to identify separators
2626
in the detected tables
2727
col_names : Optional[bool]
2828
Whether the tables_pattern matches column names
@@ -39,9 +39,9 @@ class TablesMatcher:
3939
def __init__(
4040
self,
4141
nlp: Language,
42-
tables_pattern: Optional[str],
43-
sep_pattern: Optional[str],
44-
attr: Union[Dict[str, str], str],
42+
tables_pattern: Optional[List[str]],
43+
sep_pattern: Optional[List[str]],
44+
attr: str,
4545
ignore_excluded: bool,
4646
col_names: Optional[bool] = False,
4747
row_names: Optional[bool] = False,
@@ -54,7 +54,7 @@ def __init__(
5454
sep_pattern = patterns.sep
5555

5656
self.regex_matcher = RegexMatcher(attr=attr, ignore_excluded=True)
57-
self.regex_matcher.add("table", [tables_pattern])
57+
self.regex_matcher.add("table", tables_pattern)
5858

5959
self.term_matcher = EDSPhraseMatcher(nlp.vocab, attr=attr, ignore_excluded=True)
6060
self.term_matcher.build_patterns(
@@ -138,7 +138,53 @@ def get_tables(self, matches):
138138
if all(row[-1].start == row[-1].end for row in processed_table):
139139
processed_table = [row[:-1] for row in processed_table]
140140

141-
tables_list.append(processed_table)
141+
# Check if all rows have the same dimension.
142+
# If not, try to merge neighbour rows
143+
# to find a new table
144+
row_len = len(processed_table[0])
145+
if not all(len(row) == row_len for row in processed_table):
146+
147+
# Method to find all possible lengths of the rows
148+
def divisors(n):
149+
result = set()
150+
for i in range(1, int(n**0.5) + 1):
151+
if n % i == 0:
152+
result.add(i)
153+
result.add(n // i)
154+
return sorted(list(result))
155+
156+
if self.col_names:
157+
n_rows = len(processed_table) - 1
158+
else:
159+
n_rows = len(processed_table)
160+
161+
for n_rows_to_merge in divisors(n_rows):
162+
row_len = sum(len(row) for row in processed_table[:n_rows_to_merge])
163+
if all(
164+
sum(
165+
len(row)
166+
for row in processed_table[
167+
i * n_rows_to_merge : (i + 1) * n_rows_to_merge
168+
]
169+
)
170+
== row_len
171+
for i in range(n_rows // n_rows_to_merge)
172+
):
173+
processed_table = [
174+
[
175+
cell
176+
for subrow in processed_table[
177+
i * n_rows_to_merge : (i + 1) * n_rows_to_merge
178+
]
179+
for cell in subrow
180+
]
181+
for i in range(n_rows // n_rows_to_merge)
182+
]
183+
tables_list.append(processed_table)
184+
break
185+
continue
186+
else:
187+
tables_list.append(processed_table)
142188

143189
# Convert to dictionnaries according to self.col_names
144190
# and self.row_names

0 commit comments

Comments
 (0)