Skip to content

Commit 9018780

Browse files
committed
feat: new qualifier llm pipe
1 parent 8dfb62c commit 9018780

File tree

9 files changed

+1500
-0
lines changed

9 files changed

+1500
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
## LLM Span Classifier {: #edsnlp.pipes.qualifiers.llm.factory.create_component }
2+
3+
::: edsnlp.pipes.qualifiers.llm.factory.create_component
4+
options:
5+
heading_level: 3
6+
show_bases: false
7+
show_source: false
8+
only_class_level: true
9+
10+
## APIParams {: #edsnlp.pipes.qualifiers.llm.llm_qualifier.APIParams }
11+
12+
::: edsnlp.pipes.qualifiers.llm.llm_qualifier.APIParams
13+
options:
14+
heading_level: 3
15+
show_bases: false
16+
show_source: false
17+
only_class_level: true
18+
19+
## PromptConfig {: #edsnlp.pipes.qualifiers.llm.llm_qualifier.PromptConfig }
20+
21+
::: edsnlp.pipes.qualifiers.llm.llm_qualifier.PromptConfig
22+
options:
23+
heading_level: 3
24+
show_bases: false
25+
show_source: false
26+
only_class_level: true
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Using a LLM as a span qualifier
2+
In this tutorial we woud learn how to use the `LLMSpanClassifier` pipe to qualify spans.
3+
You should install the extra dependencies before in a python environment (python>='3.8'):
4+
```bash
5+
pip install edsnlp[llm]
6+
```
7+
8+
## Using a local LLM server
9+
We suppose that there is an available LLM server compatible with OpenAI API.
10+
For example, using the library vllm you can launch an LLM server as follows in command line:
11+
```bash
12+
vllm serve Qwen/Qwen3-8B --port 8000 --enable-prefix-caching --tensor-parallel-size 1 --max-num-seqs=10 --max-num-batched-tokens=35000
13+
```
14+
15+
## Using an external API
16+
You can also use the [Openai API](https://openai.com/index/openai-api/) or the [Groq API](https://groq.com/).
17+
18+
!!! warning
19+
20+
As you are probably working with sensitive medical data, please check whether you can use an external API or if you need to expose an API in your own infrastructure.
21+
22+
## Import dependencies
23+
```{ .python .no-check }
24+
from datetime import datetime
25+
26+
import pandas as pd
27+
28+
import edsnlp
29+
import edsnlp.pipes as eds
30+
from edsnlp.pipes.qualifiers.llm.llm_qualifier import LLMSpanClassifier
31+
from edsnlp.utils.span_getters import make_span_context_getter
32+
```
33+
## Define prompt and examples
34+
```{ .python .no-check }
35+
task_prompts = {
36+
0: {
37+
"normalized_task_name": "biopsy_procedure",
38+
"system_prompt": "You are a medical assistant and you will help answering questions about dates present in clinical notes. Don't answer reasoning. "
39+
+ "We are interested in detecting biopsy dates (either procedure, analysis or result). "
40+
+ "You should answer in a JSON object following this schema {'biopsy':bool}. "
41+
+ "If there is not enough information, answer {'biopsy':'False'}."
42+
+ "\n\n#### Examples:\n",
43+
"examples": [
44+
(
45+
"07/12/2020",
46+
"07/12/2020 : Anapath / biopsies rectales : Muqueuse rectale normale sous réserve de fragments de petite taille.",
47+
"{'biopsy':'True'}",
48+
),
49+
(
50+
"24/12/2021",
51+
"Chirurgie 24/12/2021 : Colectomie gauche + anastomose colo rectale + clearance hépatique gauche (une méta posée sur",
52+
"{'biopsy':'False'}",
53+
),
54+
],
55+
"prefix_prompt": "\nDetermine if '{span}' corresponds to a biopsy date. The text is as follows:\n<<< ",
56+
"suffix_prompt": " >>>",
57+
"json_schema": {
58+
"properties": {
59+
"biopsy": {"title": "Biopsy", "type": "boolean"},
60+
},
61+
"required": [
62+
"biopsy",
63+
],
64+
"title": "DateModel",
65+
"type": "object",
66+
},
67+
"response_mapping": {
68+
"(?i)(oui)|(yes)|(true)": "1",
69+
"(?i)(non)|(no)|(false)|(don't)|(not)": "0",
70+
},
71+
},
72+
}
73+
```
74+
75+
## Format these examples for few-shot learning
76+
```{ .python .no-check }
77+
def format_examples(raw_examples, prefix_prompt, suffix_prompt):
78+
examples = []
79+
80+
for date, context, answer in raw_examples:
81+
prompt = prefix_prompt.format(span=date) + context + suffix_prompt
82+
examples.append((prompt, answer))
83+
84+
return examples
85+
```
86+
87+
## Set parameters and prompts
88+
```{ .python .no-check }
89+
# Set prompt
90+
prompt_id = 0
91+
raw_examples = task_prompts.get(prompt_id).get("examples")
92+
prefix_prompt = task_prompts.get(prompt_id).get("prefix_prompt")
93+
user_prompt = task_prompts.get(prompt_id).get("user_prompt")
94+
system_prompt = task_prompts.get(prompt_id).get("system_prompt")
95+
suffix_prompt = task_prompts.get(prompt_id).get("suffix_prompt")
96+
examples = format_examples(raw_examples, prefix_prompt, suffix_prompt)
97+
98+
# Define JSON schema
99+
response_format = {
100+
"type": "json_schema",
101+
"json_schema": {
102+
"name": "DateModel",
103+
# "strict": True,
104+
"schema": task_prompts.get(prompt_id)["json_schema"],
105+
},
106+
}
107+
108+
# Set parameters
109+
response_mapping = None
110+
max_tokens = 200
111+
extra_body = {
112+
# "chat_template_kwargs": {"enable_thinking": False},
113+
}
114+
temperature = 0
115+
```
116+
117+
=== "For local serving"
118+
119+
```{ .python .no-check }
120+
### For local serving
121+
model_name = "Qwen/Qwen3-8B"
122+
api_url = "http://localhost:8000/v1"
123+
api_key = "EMPTY_API_KEY"
124+
```
125+
126+
127+
=== "Using the Groq API"
128+
!!! warning
129+
⚠️ This section involves the use of an external API. Please ensure you have the necessary credentials and understand the potential risks associated with external API usage.
130+
131+
```{ .python .no-check }
132+
### Using Groq API
133+
model_name = "openai/gpt-oss-20b"
134+
api_url = "https://api.groq.com/openai/v1"
135+
api_key = "TOKEN" ## your API KEY
136+
```
137+
138+
## Define the pipeline
139+
```{ .python .no-check }
140+
nlp = edsnlp.blank("eds")
141+
nlp.add_pipe("sentencizer")
142+
nlp.add_pipe(eds.dates())
143+
nlp.add_pipe(
144+
LLMSpanClassifier(
145+
name="llm",
146+
model=model_name,
147+
span_getter=["dates"],
148+
attributes={"_.biopsy_procedure": True},
149+
context_getter=make_span_context_getter(
150+
context_sents=(3, 3),
151+
context_words=(1, 1),
152+
),
153+
prompt=dict(
154+
system_prompt=system_prompt,
155+
user_prompt=user_prompt,
156+
prefix_prompt=prefix_prompt,
157+
suffix_prompt=suffix_prompt,
158+
examples=examples,
159+
),
160+
api_params=dict(
161+
max_tokens=max_tokens,
162+
temperature=temperature,
163+
response_format=response_format,
164+
extra_body=extra_body,
165+
),
166+
api_url=api_url,
167+
api_key=api_key,
168+
response_mapping=response_mapping,
169+
n_concurrent_tasks=4,
170+
)
171+
)
172+
```
173+
174+
## Apply it on a document
175+
176+
```{ .python .no-check }
177+
# Let's try with a fake LLM generated text
178+
text = """
179+
Centre Hospitalier Départemental – RCP Prostate – 20/02/2025
180+
181+
M. Bernard P., 69 ans, retraité, consulte après avoir noté une faiblesse du jet urinaire et des levers nocturnes répétés depuis un an. PSA à 15,2 ng/mL (05/02/2025). TR : nodule ferme sur lobe gauche.
182+
183+
IRM multiparamétrique du 10/02/2025 : lésion PIRADS 5, 2,1 cm, atteinte de la capsule suspectée.
184+
Biopsies du 12/02/2025 : adénocarcinome Gleason 4+4=8, toutes les carottes gauches positives.
185+
Scanner TAP et scintigraphie osseuse du 14/02 : absence de métastases viscérales ou osseuses.
186+
187+
En RCP du 20/02/2025, patient classé cT3a N0 M0, haut risque. Décision : radiothérapie externe + hormonothérapie longue (24 mois). Planification de la simulation scanner le 25/02.
188+
"""
189+
```
190+
191+
```{ .python .no-check }
192+
t0 = datetime.now()
193+
doc = nlp(text)
194+
t1 = datetime.now()
195+
print("Execution time", t1 - t0)
196+
197+
for span in doc.spans["dates"]:
198+
print(span, span._.biopsy_procedure)
199+
```
200+
201+
Lets check the type
202+
```{ .python .no-check }
203+
type(span._.biopsy_procedure)
204+
```
205+
# Apply on multiple documents
206+
```{ .python .no-check }
207+
texts = [
208+
text,
209+
] * 2
210+
211+
notes = pd.DataFrame({"note_id": range(len(texts)), "note_text": texts})
212+
docs = edsnlp.data.from_pandas(notes, nlp=nlp, converter="omop")
213+
predicted_docs = docs.map_pipeline(nlp, 2)
214+
```
215+
216+
```{ .python .no-check }
217+
t0 = datetime.now()
218+
note_nlp = edsnlp.data.to_pandas(
219+
predicted_docs,
220+
converter="ents",
221+
span_getter="dates",
222+
span_attributes=[
223+
"biopsy_procedure",
224+
],
225+
)
226+
t1 = datetime.now()
227+
print("Execution time", t1 - t0)
228+
note_nlp.head()
229+
```

edsnlp/pipes/llm/llm_span_qualifier/__init__.py

Whitespace-only changes.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from edsnlp.core import registry
2+
3+
from .llm_span_qualifier import LLMSpanClassifier
4+
5+
create_component = registry.factory.register(
6+
"eds.llm_span_qualifier",
7+
)(LLMSpanClassifier)

0 commit comments

Comments
 (0)