aphp
diff --git a/‎docs/pipelines/misc/measurements.md‎
Lines changed: 54 additions & 22 deletions b/‎docs/pipelines/misc/measurements.md‎
Lines changed: 54 additions & 22 deletions
diff --git a/‎docs/pipelines/misc/tables.md‎
Lines changed: 105 additions & 0 deletions b/‎docs/pipelines/misc/tables.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎edsnlp/pipelines/factories.py‎
Lines changed: 1 addition & 0 deletions b/‎edsnlp/pipelines/factories.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎edsnlp/pipelines/misc/measurements/factory.py‎
Lines changed: 21 additions & 3 deletions b/‎edsnlp/pipelines/misc/measurements/factory.py‎
Lines changed: 21 additions & 3 deletions
@@ -1,31 +1,60 @@
 # Measurements
 
 The `eds.measurements` pipeline's role is to detect and normalise numerical measurements within a medical document.
-We use simple regular expressions to extract and normalize measurements, and use `Measurement` classes to store them.
-
-!!! warning
-
-    The ``measurements`` pipeline is still in active development and has not been rigorously validated.
-    If you come across a measurement expression that goes undetected, please file an issue !
+We use simple regular expressions to extract and normalize measurements, and use `SimpleMeasurement` classes to store them.
 
 ## Scope
 
-The `eds.measurements` pipeline can extract simple (eg `3cm`) measurements.
-It can detect elliptic enumerations (eg `32, 33 et 34kg`) of measurements of the same type and split the measurements accordingly.
+By default, the `eds.measurements` pipeline lets you match all measurements, i.e measurements in most units as well as unitless measurements. If a unit is not in our register,
+then you can add It manually. If not, the measurement will be matched without Its unit.
 
-The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit.
-
-The current pipeline annotates the following measurements out of the box:
+If you prefer matching specific measurements only, you can create your own measurement config anda set `all_measurements` parameter to `False`. Nevertheless, some default measurements configs are already provided out of the box:
 
 | Measurement name | Example                |
 | ---------------- | ---------------------- |
 | `eds.size`       | `1m50`, `1.50m`        |
 | `eds.weight`     | `12kg`, `1kg300`       |
 | `eds.bmi`        | `BMI: 24`, `24 kg.m-2` |
 | `eds.volume`     | `2 cac`, `8ml`         |
+| `eds.bool`       | `positive`, `negatif`  |
+
+The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit (eg `span._.value.g_per_cl` or `span._.value.kg_per_m3` for a density).
+
+The measurements that can be extracted can have one or many of the following characteristics:
+- Unitless measurements
+- Measurements with unit
+- Measurements with range indication (escpecially < or >)
+- Measurements with power
+
+The measurement can be written in many complex forms. Among them, this pipe can detect:
+- Measurements with range indication, numerical value, power and units in many different orders and separated by customizable stop words
+- Composed units (eg `1m50`)
+- Measurement with "unitless patterns", i.e some textual information next to a numerical value which allows us to retrieve a unit even if It is not written (eg in the text `Height: 80`, this pipe will a detect the numlerical value `80`and match It to the unit `kg`)
+- Elliptic enumerations (eg `32, 33 et 34mol`) of measurements of the same type and split the measurements accordingly
 
 ## Usage
 
+This pipe works better with `eds.dates` and `eds.tables` pipe at the same time. These pipes let `eds.measurements` skip dates as measurements and make a specific matching for each table, benefitting of the structured data.
+
+The matched measurements are labeled with a default measurement name if available (eg `eds.size`), else `eds.measurement` if any measure is linked to the dimension of the measure's unit and if `all_measurements` is set to `True`.
+
+As said before, each matched measurement can be accessed via the `span._.value`. This gives you a `SimpleMeasurement` object with the following attributes :
+- `value_range` ("<", "=" or ">")
+- `value`
+- `unit`
+- `registry` (This attribute stores the entire unit config like the link between each unit, Its dimension like `length`, `quantity of matter`...)
+
+`SimpleMeasurement` objects are especially usefull when converting measurements to an other specified unit with the same dimension (eg densities stay densities). To do so, simply call your `SimpleMeasurement` followed by `.` + name of the usual unit abbreviation with `per` and `_` as separators (eg `object.kg_per_dm3`, `mol_per_l`, `g_per_cm2`).
+
+Moreover, for now, `SimpleMeasurement` objects can be manipulated with the following operations:
+- compared with an other `SimpleMeasurement` object with the same dimension with automatic conversion (eg a density in kg_per_m3 and a density in g_per_l)
+- summed with an other `SimpleMeasurement` object with the same dimension with automatic conversion
+- substracted with an other `SimpleMeasurement` object with the same dimension with automatic conversion
+
+Note that for all operations listed above, different `value_range` attributes between two units do not matter: by default, the `value_range` of the first measurement is kept.
+
+Below is a complete example on a use case where we want to extract size, weigth and bmi measurements a simple text.
+
 ```python
 import spacy
 
@@ -77,7 +106,7 @@ str(measurements[4]._.value.kg_per_m2)
 
 ## Custom measurement
 
-You can declare custom measurements by changing the patterns
+You can declare custom measurements by changing the patterns.
 
 ```python
 import spacy
@@ -114,21 +143,24 @@ nlp.add_pipe(
 ## Declared extensions
 
 The `eds.measurements` pipeline declares a single [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object,
-the `value` attribute that is a `Measurement` instance.
+the `value` attribute that is a `SimpleMeasurement` instance.
 
 ## Configuration
 
 The pipeline can be configured using the following parameters :
 
-| Parameter         | Explanation                                                                    | Default                                                              |
-| ----------------- | --------------------------------------------------------------------------     | -------------------------------------------------------------------- |
-| `measurements`    | A list or dict of the measurements to extract                                  | `["eds.size", "eds.weight", "eds.angle"]` |
-| `units_config`    | A dict describing the units with lexical patterns, dimensions, scales, ...     | ... |
-| `number_terms`    | A dict describing the textual forms of common numbers                          | ... |
-| `stopwords`       | A list of stopwords that do not matter when placed between a unitless trigger  | ... |
-| `unit_divisors`   | A list of terms used to divide two units (like: m / s)                         | ... |
-| `ignore_excluded` | Whether to ignore excluded tokens for matching                                 | `False`                                                              |
-| `attr`            | spaCy attribute to match on, eg `NORM` or `TEXT`                               | `"NORM"`                                                             |
+| Parameter                | Explanation                                                                      | Default                                                                   |
+| ------------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
+| `measurements`           | A list or dict of the measurements to extract                                    | `None` # Extract measurements from all units                              |
+| `units_config`           | A dict describing the units with lexical patterns, dimensions, scales, ...       | ... # Config of mostly all commonly used units                            |
+| `number_terms`           | A dict describing the textual forms of common numbers                            | ... # Config of mostly all commonly used textual forms of common numbers  |
+| `value_range_terms`      | A dict describing the textual forms of ranges ("<", "=" or ">")                  | ... # Config of mostly all commonly used range terms                      |
+| `stopwords_unitless`     | A list of stopwords that do not matter when placed between a unitless trigger    | `["par", "sur", "de", "a", ":", ",", "et"]`                               |
+| `stopwords_measure_unit` | A list of stopwords that do not matter when placed between a measure and a unit  | `["|", "¦", "…", "."]`                                                    |
+| `measure_before_unit`    | A bool to tell if the numerical value is usually placed before the unit          | `["par", "sur", "de", "a", ":", ",", "et"]`                               |
+| `unit_divisors`          | A list of terms used to divide two units (like: m / s)                           | `["/", "par"]`                                                            |
+| `ignore_excluded`        | Whether to ignore excluded tokens for matching                                   | `False`                                                                   |
+| `attr`                   | spaCy attribute to match on, eg `NORM` or `TEXT`                                 | `"NORM"`                                                                  |
 
 ## Authors and citation
 
 
@@ -0,0 +1,105 @@
+# Tables
+
+The `eds.tables` pipeline's role is to detect tables present in a medical document.
+We use simple regular expressions to extract tables like text.
+
+## Usage
+
+This pipe lets you match different forms of tables. They can have a frame or not, rows can be spread on multiple consecutive lines (in case of a bad parsing for example)... You can also indicate the presence of headers with the `col_names` and `row_names` boolean parameters.
+
+Each matched table is returned as a `Span` object. You can then access to an equivalent dictionnary formatted table with `table` extension or use `to_pandas_table()` to get the equivalent pandas DataFrame. The key of the dictionnary is determined as folowed:
+- If `col_names` is True, then, the dictionnary keys are the names of the columns (str).
+- Elif `row_names` is True, then, the dictionnary keys are the names (str).
+- Else the dictionnary keys are indexes of the columns (int).
+
+`to_pandas_table()` can be customised with `as_spans` parameter. If set to `True`, then the pandas dataframe will contain the cells as spans, else the pandas dataframe will contain the cells as raw strings.
+
+```python
+import spacy
+
+nlp = spacy.blank("fr")
+nlp.add_pipe("eds.normalizer")
+nlp.add_pipe("eds.tables")
+
+text = """
+SERVICE
+MEDECINE INTENSIVE –
+REANIMATION
+Réanimation / Surveillance Continue
+Médicale
+
+COMPTE RENDU D'HOSPITALISATION du 05/06/2020 au 10/06/2020
+Madame DUPONT Marie, née le 16/05/1900, âgée de 20 ans, a été hospitalisée en réanimation du
+05/06/1920 au 10/06/1920 pour intoxication médicamenteuse volontaire.
+
+
+Examens complémentaires
+Hématologie
+Numération
+Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
+Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
+Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
+Hématocrite ¦% ¦44.2 ¦39.2-48.6
+VGM ¦fL ¦94.4 + ¦79.6-94
+TCMH ¦pg ¦31.6 ¦27.3-32.8
+CCMH ¦g/dL ¦33.5 ¦32.4-36.3
+Plaquettes ¦x10*9/L ¦191 ¦172-398
+VMP ¦fL ¦11.5 + ¦7.4-10.8
+
+Sur le plan neurologique : Devant la persistance d'une confusion à distance de l'intoxication au
+...
+
+2/2Pat : <NOM> <Prenom>|F |<date> | <ipp> |Intitulé RCP
+
+"""
+
+doc = nlp(text)
+
+# A table span
+table = doc.spans["tables"][0]
+# Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
+# Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
+# Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
+# Hématocrite ¦% ¦44.2 ¦39.2-48.6
+# VGM ¦fL ¦94.4 + ¦79.6-94
+# TCMH ¦pg ¦31.6 ¦27.3-32.8
+# CCMH ¦g/dL ¦33.5 ¦32.4-36.3
+# Plaquettes ¦x10*9/L ¦191 ¦172-398
+# VMP ¦fL ¦11.5 + ¦7.4-10.8
+
+# Convert span to Pandas table
+df = table._.to_pd_table(as_spans=False)
+type(df)
+# >> pandas.core.frame.DataFrame
+```
+The pd DataFrame:
+|      | 0           | 1        | 2      | 3         |
+| ---: | :---------- | :------- | :----- | :-------- |
+|    0 | Leucocytes  | x10*9/L  | 4.97   | 4.09-11   |
+|    1 | Hématies    | x10*12/L | 4.68   | 4.53-5.79 |
+|    2 | Hémoglobine | g/dL     | 14.8   | 13.4-16.7 |
+|    3 | Hématocrite | %        | 44.2   | 39.2-48.6 |
+|    4 | VGM         | fL       | 94.4 + | 79.6-94   |
+|    5 | TCMH        | pg       | 31.6   | 27.3-32.8 |
+|    6 | CCMH        | g/dL     | 33.5   | 32.4-36.3 |
+|    7 | Plaquettes  | x10*9/L  | 191    | 172-398   |
+|    8 | VMP         | fL       | 11.5 + | 7.4-10.8  |
+
+## Declared extensions
+
+The `eds.tables` pipeline declares two [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object. The first one is `to_pd_table()` method which returns a parsed pandas version of the table. The second one is `table` which contains the table stored as a dictionnary containing cells as `Span` objects.
+
+## Configuration
+
+The pipeline can be configured using the following parameters :
+
+| Parameter         | Explanation                                      | Default                |
+| ----------------- | ------------------------------------------------ | ---------------------- |
+| `tables_pattern`  | Pattern to identify table spans                  | `rf"(\b.*{sep}.*\n)+"` |
+| `sep_pattern`     | Pattern to identify column separation            | `r"¦"`                 |
+| `ignore_excluded` | Ignore excluded tokens                           | `True`                 |
+| `attr`            | spaCy attribute to match on, eg `NORM` or `TEXT` | `"TEXT"`               |
+
+## Authors and citation
+
+The `eds.tables` pipeline was developed by AP-HP's Data Science team.
@@ -15,6 +15,7 @@
 from .misc.measurements.factory import create_component as measurements
 from .misc.reason.factory import create_component as reason
 from .misc.sections.factory import create_component as sections
+from .misc.tables.factory import create_component as tables
 from .ner.adicap.factory import create_component as adicap
 from .ner.cim10.factory import create_component as cim10
 from .ner.covid.factory import create_component as covid
 
@@ -15,9 +15,15 @@
     ignore_excluded=True,
     units_config=patterns.units_config,
     number_terms=patterns.number_terms,
+    value_range_terms=patterns.value_range_terms,
     unit_divisors=patterns.unit_divisors,
     measurements=None,
-    stopwords=patterns.stopwords,
+    stopwords_unitless=patterns.stopwords_unitless,
+    stopwords_measure_unit=patterns.stopwords_measure_unit,
+    measure_before_unit=False,
+    parse_doc=True,
+    parse_tables=True,
+    all_measurements=True,
 )
 
 
@@ -29,7 +35,13 @@ def create_component(
     measurements: Optional[Union[Dict[str, MeasureConfig], List[str]]],
     units_config: Dict[str, UnitConfig],
     number_terms: Dict[str, List[str]],
-    stopwords: List[str],
+    value_range_terms: Dict[str, List[str]],
+    all_measurements: bool,
+    parse_tables: bool,
+    parse_doc: bool,
+    stopwords_unitless: List[str],
+    stopwords_measure_unit: List[str],
+    measure_before_unit: bool,
     unit_divisors: List[str],
     ignore_excluded: bool,
     attr: str,
@@ -38,9 +50,15 @@ def create_component(
         nlp,
         units_config=units_config,
         number_terms=number_terms,
+        value_range_terms=value_range_terms,
+        all_measurements=all_measurements,
+        parse_tables=parse_tables,
+        parse_doc=parse_doc,
         unit_divisors=unit_divisors,
         measurements=measurements,
-        stopwords=stopwords,
+        stopwords_unitless=stopwords_unitless,
+        stopwords_measure_unit=stopwords_measure_unit,
+        measure_before_unit=measure_before_unit,
         attr=attr,
         ignore_excluded=ignore_excluded,
     )