|
| 1 | +# Tables |
| 2 | + |
| 3 | +The `eds.tables` pipeline's role is to detect tables present in a medical document. |
| 4 | +We use simple regular expressions to extract tables like text. |
| 5 | + |
| 6 | +## Usage |
| 7 | + |
| 8 | +This pipe lets you match different forms of tables. They can have a frame or not, rows can be spread on multiple consecutive lines (in case of a bad parsing for example)... You can also indicate the presence of headers with the `col_names` and `row_names` boolean parameters. |
| 9 | + |
| 10 | +Each matched table is returned as a `Span` object. You can then access to an equivalent dictionnary formatted table with `table` extension or use `to_pandas_table()` to get the equivalent pandas DataFrame. The key of the dictionnary is determined as folowed: |
| 11 | +- If `col_names` is True, then, the dictionnary keys are the names of the columns (str). |
| 12 | +- Elif `row_names` is True, then, the dictionnary keys are the names (str). |
| 13 | +- Else the dictionnary keys are indexes of the columns (int). |
| 14 | + |
| 15 | +`to_pandas_table()` can be customised with `as_spans` parameter. If set to `True`, then the pandas dataframe will contain the cells as spans, else the pandas dataframe will contain the cells as raw strings. |
| 16 | + |
| 17 | +```python |
| 18 | +import spacy |
| 19 | + |
| 20 | +nlp = spacy.blank("fr") |
| 21 | +nlp.add_pipe("eds.normalizer") |
| 22 | +nlp.add_pipe("eds.tables") |
| 23 | + |
| 24 | +text = """ |
| 25 | +SERVICE |
| 26 | +MEDECINE INTENSIVE – |
| 27 | +REANIMATION |
| 28 | +Réanimation / Surveillance Continue |
| 29 | +Médicale |
| 30 | +
|
| 31 | +COMPTE RENDU D'HOSPITALISATION du 05/06/2020 au 10/06/2020 |
| 32 | +Madame DUPONT Marie, née le 16/05/1900, âgée de 20 ans, a été hospitalisée en réanimation du |
| 33 | +05/06/1920 au 10/06/1920 pour intoxication médicamenteuse volontaire. |
| 34 | +
|
| 35 | +
|
| 36 | +Examens complémentaires |
| 37 | +Hématologie |
| 38 | +Numération |
| 39 | +Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11 |
| 40 | +Hématies ¦x10*12/L¦4.68 ¦4.53-5.79 |
| 41 | +Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7 |
| 42 | +Hématocrite ¦% ¦44.2 ¦39.2-48.6 |
| 43 | +VGM ¦fL ¦94.4 + ¦79.6-94 |
| 44 | +TCMH ¦pg ¦31.6 ¦27.3-32.8 |
| 45 | +CCMH ¦g/dL ¦33.5 ¦32.4-36.3 |
| 46 | +Plaquettes ¦x10*9/L ¦191 ¦172-398 |
| 47 | +VMP ¦fL ¦11.5 + ¦7.4-10.8 |
| 48 | +
|
| 49 | +Sur le plan neurologique : Devant la persistance d'une confusion à distance de l'intoxication au |
| 50 | +... |
| 51 | +
|
| 52 | +2/2Pat : <NOM> <Prenom>|F |<date> | <ipp> |Intitulé RCP |
| 53 | +
|
| 54 | +""" |
| 55 | + |
| 56 | +doc = nlp(text) |
| 57 | + |
| 58 | +# A table span |
| 59 | +table = doc.spans["tables"][0] |
| 60 | +# Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11 |
| 61 | +# Hématies ¦x10*12/L¦4.68 ¦4.53-5.79 |
| 62 | +# Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7 |
| 63 | +# Hématocrite ¦% ¦44.2 ¦39.2-48.6 |
| 64 | +# VGM ¦fL ¦94.4 + ¦79.6-94 |
| 65 | +# TCMH ¦pg ¦31.6 ¦27.3-32.8 |
| 66 | +# CCMH ¦g/dL ¦33.5 ¦32.4-36.3 |
| 67 | +# Plaquettes ¦x10*9/L ¦191 ¦172-398 |
| 68 | +# VMP ¦fL ¦11.5 + ¦7.4-10.8 |
| 69 | + |
| 70 | +# Convert span to Pandas table |
| 71 | +df = table._.to_pd_table(as_spans=False) |
| 72 | +type(df) |
| 73 | +# >> pandas.core.frame.DataFrame |
| 74 | +``` |
| 75 | +The pd DataFrame: |
| 76 | +| | 0 | 1 | 2 | 3 | |
| 77 | +| ---: | :---------- | :------- | :----- | :-------- | |
| 78 | +| 0 | Leucocytes | x10*9/L | 4.97 | 4.09-11 | |
| 79 | +| 1 | Hématies | x10*12/L | 4.68 | 4.53-5.79 | |
| 80 | +| 2 | Hémoglobine | g/dL | 14.8 | 13.4-16.7 | |
| 81 | +| 3 | Hématocrite | % | 44.2 | 39.2-48.6 | |
| 82 | +| 4 | VGM | fL | 94.4 + | 79.6-94 | |
| 83 | +| 5 | TCMH | pg | 31.6 | 27.3-32.8 | |
| 84 | +| 6 | CCMH | g/dL | 33.5 | 32.4-36.3 | |
| 85 | +| 7 | Plaquettes | x10*9/L | 191 | 172-398 | |
| 86 | +| 8 | VMP | fL | 11.5 + | 7.4-10.8 | |
| 87 | + |
| 88 | +## Declared extensions |
| 89 | + |
| 90 | +The `eds.tables` pipeline declares two [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object. The first one is `to_pd_table()` method which returns a parsed pandas version of the table. The second one is `table` which contains the table stored as a dictionnary containing cells as `Span` objects. |
| 91 | + |
| 92 | +## Configuration |
| 93 | + |
| 94 | +The pipeline can be configured using the following parameters : |
| 95 | + |
| 96 | +| Parameter | Explanation | Default | |
| 97 | +| ----------------- | ------------------------------------------------ | ---------------------- | |
| 98 | +| `tables_pattern` | Pattern to identify table spans | `rf"(\b.*{sep}.*\n)+"` | |
| 99 | +| `sep_pattern` | Pattern to identify column separation | `r"¦"` | |
| 100 | +| `ignore_excluded` | Ignore excluded tokens | `True` | |
| 101 | +| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"TEXT"` | |
| 102 | + |
| 103 | +## Authors and citation |
| 104 | + |
| 105 | +The `eds.tables` pipeline was developed by AP-HP's Data Science team. |
0 commit comments