|
| 1 | +# Using a LLM as a span qualifier |
| 2 | +In this tutorial we woud learn how to use the `LLMSpanClassifier` pipe to qualify spans. |
| 3 | +You should install the extra dependencies before in a python environment (python>='3.8'): |
| 4 | +```bash |
| 5 | +pip install edsnlp[llm] |
| 6 | +``` |
| 7 | + |
| 8 | +## Using a local LLM server |
| 9 | +We suppose that there is an available LLM server compatible with OpenAI API. |
| 10 | +For example, using the library vllm you can launch an LLM server as follows in command line: |
| 11 | +```bash |
| 12 | +vllm serve Qwen/Qwen3-8B --port 8000 --enable-prefix-caching --tensor-parallel-size 1 --max-num-seqs=10 --max-num-batched-tokens=35000 |
| 13 | +``` |
| 14 | + |
| 15 | +## Using an external API |
| 16 | +You can also use the [Openai API](https://openai.com/index/openai-api/) or the [Groq API](https://groq.com/). |
| 17 | + |
| 18 | +!!! warning |
| 19 | + |
| 20 | + As you are probably working with sensitive medical data, please check whether you can use an external API or if you need to expose an API in your own infrastructure. |
| 21 | + |
| 22 | +## Import dependencies |
| 23 | +```{ .python .no-check } |
| 24 | +from datetime import datetime |
| 25 | +
|
| 26 | +import pandas as pd |
| 27 | +
|
| 28 | +import edsnlp |
| 29 | +import edsnlp.pipes as eds |
| 30 | +from edsnlp.pipes.qualifiers.llm.llm_qualifier import LLMSpanClassifier |
| 31 | +from edsnlp.utils.span_getters import make_span_context_getter |
| 32 | +``` |
| 33 | +## Define prompt and examples |
| 34 | +```{ .python .no-check } |
| 35 | +task_prompts = { |
| 36 | + 0: { |
| 37 | + "normalized_task_name": "biopsy_procedure", |
| 38 | + "system_prompt": "You are a medical assistant and you will help answering questions about dates present in clinical notes. Don't answer reasoning. " |
| 39 | + + "We are interested in detecting biopsy dates (either procedure, analysis or result). " |
| 40 | + + "You should answer in a JSON object following this schema {'biopsy':bool}. " |
| 41 | + + "If there is not enough information, answer {'biopsy':'False'}." |
| 42 | + + "\n\n#### Examples:\n", |
| 43 | + "examples": [ |
| 44 | + ( |
| 45 | + "07/12/2020", |
| 46 | + "07/12/2020 : Anapath / biopsies rectales : Muqueuse rectale normale sous réserve de fragments de petite taille.", |
| 47 | + "{'biopsy':'True'}", |
| 48 | + ), |
| 49 | + ( |
| 50 | + "24/12/2021", |
| 51 | + "Chirurgie 24/12/2021 : Colectomie gauche + anastomose colo rectale + clearance hépatique gauche (une méta posée sur", |
| 52 | + "{'biopsy':'False'}", |
| 53 | + ), |
| 54 | + ], |
| 55 | + "prefix_prompt": "\nDetermine if '{span}' corresponds to a biopsy date. The text is as follows:\n<<< ", |
| 56 | + "suffix_prompt": " >>>", |
| 57 | + "json_schema": { |
| 58 | + "properties": { |
| 59 | + "biopsy": {"title": "Biopsy", "type": "boolean"}, |
| 60 | + }, |
| 61 | + "required": [ |
| 62 | + "biopsy", |
| 63 | + ], |
| 64 | + "title": "DateModel", |
| 65 | + "type": "object", |
| 66 | + }, |
| 67 | + "response_mapping": { |
| 68 | + "(?i)(oui)|(yes)|(true)": "1", |
| 69 | + "(?i)(non)|(no)|(false)|(don't)|(not)": "0", |
| 70 | + }, |
| 71 | + }, |
| 72 | +} |
| 73 | +``` |
| 74 | + |
| 75 | +## Format these examples for few-shot learning |
| 76 | +```{ .python .no-check } |
| 77 | +def format_examples(raw_examples, prefix_prompt, suffix_prompt): |
| 78 | + examples = [] |
| 79 | +
|
| 80 | + for date, context, answer in raw_examples: |
| 81 | + prompt = prefix_prompt.format(span=date) + context + suffix_prompt |
| 82 | + examples.append((prompt, answer)) |
| 83 | +
|
| 84 | + return examples |
| 85 | +``` |
| 86 | + |
| 87 | +## Set parameters and prompts |
| 88 | +```{ .python .no-check } |
| 89 | +# Set prompt |
| 90 | +prompt_id = 0 |
| 91 | +raw_examples = task_prompts.get(prompt_id).get("examples") |
| 92 | +prefix_prompt = task_prompts.get(prompt_id).get("prefix_prompt") |
| 93 | +user_prompt = task_prompts.get(prompt_id).get("user_prompt") |
| 94 | +system_prompt = task_prompts.get(prompt_id).get("system_prompt") |
| 95 | +suffix_prompt = task_prompts.get(prompt_id).get("suffix_prompt") |
| 96 | +examples = format_examples(raw_examples, prefix_prompt, suffix_prompt) |
| 97 | +
|
| 98 | +# Define JSON schema |
| 99 | +response_format = { |
| 100 | + "type": "json_schema", |
| 101 | + "json_schema": { |
| 102 | + "name": "DateModel", |
| 103 | + # "strict": True, |
| 104 | + "schema": task_prompts.get(prompt_id)["json_schema"], |
| 105 | + }, |
| 106 | +} |
| 107 | +
|
| 108 | +# Set parameters |
| 109 | +response_mapping = None |
| 110 | +max_tokens = 200 |
| 111 | +extra_body = { |
| 112 | + # "chat_template_kwargs": {"enable_thinking": False}, |
| 113 | +} |
| 114 | +temperature = 0 |
| 115 | +``` |
| 116 | + |
| 117 | +=== "For local serving" |
| 118 | + |
| 119 | + ```{ .python .no-check } |
| 120 | + ### For local serving |
| 121 | + model_name = "Qwen/Qwen3-8B" |
| 122 | + api_url = "http://localhost:8000/v1" |
| 123 | + api_key = "EMPTY_API_KEY" |
| 124 | + ``` |
| 125 | + |
| 126 | + |
| 127 | +=== "Using the Groq API" |
| 128 | + !!! warning |
| 129 | + ⚠️ This section involves the use of an external API. Please ensure you have the necessary credentials and understand the potential risks associated with external API usage. |
| 130 | + |
| 131 | + ```{ .python .no-check } |
| 132 | + ### Using Groq API |
| 133 | + model_name = "openai/gpt-oss-20b" |
| 134 | + api_url = "https://api.groq.com/openai/v1" |
| 135 | + api_key = "TOKEN" ## your API KEY |
| 136 | + ``` |
| 137 | + |
| 138 | +## Define the pipeline |
| 139 | +```{ .python .no-check } |
| 140 | +nlp = edsnlp.blank("eds") |
| 141 | +nlp.add_pipe("sentencizer") |
| 142 | +nlp.add_pipe(eds.dates()) |
| 143 | +nlp.add_pipe( |
| 144 | + LLMSpanClassifier( |
| 145 | + name="llm", |
| 146 | + model=model_name, |
| 147 | + span_getter=["dates"], |
| 148 | + attributes={"_.biopsy_procedure": True}, |
| 149 | + context_getter=make_span_context_getter( |
| 150 | + context_sents=(3, 3), |
| 151 | + context_words=(1, 1), |
| 152 | + ), |
| 153 | + prompt=dict( |
| 154 | + system_prompt=system_prompt, |
| 155 | + user_prompt=user_prompt, |
| 156 | + prefix_prompt=prefix_prompt, |
| 157 | + suffix_prompt=suffix_prompt, |
| 158 | + examples=examples, |
| 159 | + ), |
| 160 | + api_params=dict( |
| 161 | + max_tokens=max_tokens, |
| 162 | + temperature=temperature, |
| 163 | + response_format=response_format, |
| 164 | + extra_body=extra_body, |
| 165 | + ), |
| 166 | + api_url=api_url, |
| 167 | + api_key=api_key, |
| 168 | + response_mapping=response_mapping, |
| 169 | + n_concurrent_tasks=4, |
| 170 | + ) |
| 171 | +) |
| 172 | +``` |
| 173 | + |
| 174 | +## Apply it on a document |
| 175 | + |
| 176 | +```{ .python .no-check } |
| 177 | +# Let's try with a fake LLM generated text |
| 178 | +text = """ |
| 179 | +Centre Hospitalier Départemental – RCP Prostate – 20/02/2025 |
| 180 | +
|
| 181 | +M. Bernard P., 69 ans, retraité, consulte après avoir noté une faiblesse du jet urinaire et des levers nocturnes répétés depuis un an. PSA à 15,2 ng/mL (05/02/2025). TR : nodule ferme sur lobe gauche. |
| 182 | +
|
| 183 | +IRM multiparamétrique du 10/02/2025 : lésion PIRADS 5, 2,1 cm, atteinte de la capsule suspectée. |
| 184 | +Biopsies du 12/02/2025 : adénocarcinome Gleason 4+4=8, toutes les carottes gauches positives. |
| 185 | +Scanner TAP et scintigraphie osseuse du 14/02 : absence de métastases viscérales ou osseuses. |
| 186 | +
|
| 187 | +En RCP du 20/02/2025, patient classé cT3a N0 M0, haut risque. Décision : radiothérapie externe + hormonothérapie longue (24 mois). Planification de la simulation scanner le 25/02. |
| 188 | +""" |
| 189 | +``` |
| 190 | + |
| 191 | +```{ .python .no-check } |
| 192 | +t0 = datetime.now() |
| 193 | +doc = nlp(text) |
| 194 | +t1 = datetime.now() |
| 195 | +print("Execution time", t1 - t0) |
| 196 | +
|
| 197 | +for span in doc.spans["dates"]: |
| 198 | + print(span, span._.biopsy_procedure) |
| 199 | +``` |
| 200 | + |
| 201 | +Lets check the type |
| 202 | +```{ .python .no-check } |
| 203 | +type(span._.biopsy_procedure) |
| 204 | +``` |
| 205 | +# Apply on multiple documents |
| 206 | +```{ .python .no-check } |
| 207 | +texts = [ |
| 208 | + text, |
| 209 | +] * 2 |
| 210 | +
|
| 211 | +notes = pd.DataFrame({"note_id": range(len(texts)), "note_text": texts}) |
| 212 | +docs = edsnlp.data.from_pandas(notes, nlp=nlp, converter="omop") |
| 213 | +predicted_docs = docs.map_pipeline(nlp, 2) |
| 214 | +``` |
| 215 | + |
| 216 | +```{ .python .no-check } |
| 217 | +t0 = datetime.now() |
| 218 | +note_nlp = edsnlp.data.to_pandas( |
| 219 | + predicted_docs, |
| 220 | + converter="ents", |
| 221 | + span_getter="dates", |
| 222 | + span_attributes=[ |
| 223 | + "biopsy_procedure", |
| 224 | + ], |
| 225 | +) |
| 226 | +t1 = datetime.now() |
| 227 | +print("Execution time", t1 - t0) |
| 228 | +note_nlp.head() |
| 229 | +``` |
0 commit comments