add outlines to wrap Ollama based OSS models #431

staru09 · 2025-09-23T18:20:02Z

Can be used by setting this flag on top of the yaml file.

use_outlines_for_ollama: true

I tested in local pipelines but the results aren't very consistent.

staru09 · 2025-09-23T18:53:27Z

here is the dataset that I am using, I generated it synthetically and here is the pipeline that I am trying to run.

use_outlines_for_ollama: true
default_lm_api_base: "http://localhost:11434"
bypass_cache: true

datasets:
  ai_btc:
    type: file
    path: "./btc.json"

system_prompt:
  dataset_description: internal Slack-style conversations for the "AI agents for bitcoin payments" project
  persona: an objective delivery analyst assessing developer reliability and schedule risk

default_model: ollama/gemma3:12b

operations:
  - name: extract_reliability_signals
    type: map
    output:
      mode: structured_output
      schema:
        report_key: string
        week: string
        time: string
        channel: string
        speaker: string
        message: string
        positive_signals: "list[string]"
        negative_signals: "list[string]"
        reliability_delta: number
        slip: boolean
        fix_action: boolean
    prompt: |
      You analyze a single project chat message and extract reliability signals for the speaker.

      Message metadata:
      - Week: {{ input.week }}
      - Time: {{ input.time }}
      - Channel: {{ input.channel }}
      - Speaker: {{ input.speaker }}

      Message:
      {{ input.message }}

      Identify signals:
      - positive_signals: items like "shipped on time", "added tests", "fixed bug", "stabilized routing", "reduced incidents", "implemented alerting", "idempotency", "rollback plan".
      - negative_signals: items like "slipped", "blocked", "broke build", "repeated incidents", "performance regressions", "waiting on", "deferred".
      - reliability_delta: a small signed number in [-1.0, 1.0] representing the incremental reliability impact of THIS message (e.g., +0.3 for a clear fix and tests; -0.4 for a slip).
      - slip: true if this message indicates schedule slip or delivery risk; else false.
      - fix_action: true if this message describes a concrete fix/mitigation that reduces risk (tests, alerts, rebalances, idempotency, locks, playbooks).

      Output fields:
      - report_key: always "reliability"
      - week, time, channel, speaker, message: echo from input
      - positive_signals: list[string]
      - negative_signals: list[string]
      - reliability_delta: number in [-1.0, 1.0]
      - slip: boolean
      - fix_action: boolean
    validate:
      - output["report_key"] == "reliability"
      - isinstance(output["positive_signals"], list)
      - isinstance(output["negative_signals"], list)
      - isinstance(output["slip"], bool)
      - isinstance(output["fix_action"], bool)
    litellm_completion_kwargs:
      temperature: 0.2
      max_tokens: 300

  - name: aggregate_reliability_by_speaker
    type: reduce
    reduce_key: speaker
    output:
      mode: structured_output
      schema:
        speaker: string
        messages: integer
        positive_count: integer
        negative_count: integer
        slips: integer
        fixes: integer
        reliability_score: number
        example_positive: string
        example_negative: string
    prompt: |
      Aggregate reliability for this speaker from the given items. Score higher for more fixes, fewer slips.

      Inputs:
      {% for item in inputs %}
      - week: {{ item.week }} | slip: {{ item.slip }} | fix: {{ item.fix_action }} | delta: {{ item.reliability_delta }}
        + {{ item.positive_signals }}
        - {{ item.negative_signals }}
        msg: {{ item.message | truncate(140, True, '…') }}
      {% endfor %}

      Compute:
      - messages: number of items
      - positive_count: total length of positive_signals
      - negative_count: total length of negative_signals
      - slips: count of items with slip=true
      - fixes: count of items with fix_action=true
      - reliability_score: round(sum(reliability_delta) + 0.3*fixes - 0.4*slips - 0.1*negative_count + 0.1*positive_count, 3)
      - example_positive: a short example message capturing a strong positive
      - example_negative: a short example message capturing a notable negative

      Return only JSON matching the schema.
    fold_batch_size: 40
    fold_prompt: |
      You maintain a cumulative reliability summary for this speaker across batches.

      Previous output (empty on first batch):
      {{ output | tojson }}

      Batch inputs:
      {% for item in inputs %}
      - slip: {{ item.slip }} | fix: {{ item.fix_action }} | delta: {{ item.reliability_delta }}
        + {{ item.positive_signals }} | - {{ item.negative_signals }}
        msg: {{ item.message | truncate(120, True, '…') }}
      {% endfor %}

      Update fields cumulatively:
      - messages += len(batch)
      - positive_count += sum(len(+) per item)
      - negative_count += sum(len(-) per item)
      - slips += batch count of slip=true
      - fixes += batch count of fix_action=true
      - reliability_score += sum(reliability_delta) + 0.3*new_fixes - 0.4*new_slips - 0.1*new_negative + 0.1*new_positive; round to 3 decimals
      - example_positive: keep strongest positive seen so far
      - example_negative: keep strongest negative seen so far

      Return only JSON matching the schema.
    litellm_completion_kwargs:
      temperature: 0.2
      max_tokens: 900
    timeout: 180

  - name: select_most_reliable
    type: reduce
    reduce_key: _all
    output:
      mode: structured_output
      schema:
        most_reliable_developer: string
        score: number
        summary: string
        ranking: "list[{speaker: string, score: number, messages: integer}]"
    prompt: |
      You are given per-speaker reliability aggregates. Identify the most reliable developer and produce a ranking.

      Inputs (one per speaker):
      {% for s in inputs %}
      - speaker: {{ s.speaker }} | messages: {{ s.messages }} | score: {{ s.reliability_score }}
        +pos: {{ s.positive_count }} | -neg: {{ s.negative_count }} | slips: {{ s.slips }} | fixes: {{ s.fixes }}
        example+: {{ s.example_positive | default('') }}
        example-: {{ s.example_negative | default('') }}
      {% endfor %}

      Build:
      - ranking: descending by reliability_score (include up to 10 entries)
      - most_reliable_developer: the top speaker name
      - score: that developer's reliability_score (rounded to 3 decimals)
      - summary: 1–2 paragraphs citing the specific positives and low risk indicators

      Return only JSON matching the schema.
    litellm_completion_kwargs:
      temperature: 0.2
      max_tokens: 700

pipeline:
  steps:
    - name: reliability_extraction
      input: ai_btc
      operations:
        - extract_reliability_signals

    - name: reliability_aggregate
      input: reliability_extraction
      operations:
        - aggregate_reliability_by_speaker

    - name: reliability_selection
      input: reliability_aggregate
      operations:
        - select_most_reliable

  output:
    type: file
    path: "./ai_reliability.json"

ollama+outlines

f5b4605

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add outlines to wrap Ollama based OSS models #431

add outlines to wrap Ollama based OSS models #431

staru09 commented Sep 23, 2025

Uh oh!

staru09 commented Sep 23, 2025

Uh oh!

Uh oh!

add outlines to wrap Ollama based OSS models #431

Are you sure you want to change the base?

add outlines to wrap Ollama based OSS models #431

Conversation

staru09 commented Sep 23, 2025

Uh oh!

staru09 commented Sep 23, 2025

Uh oh!

Uh oh!