Skip to content

Conversation

staru09
Copy link
Contributor

@staru09 staru09 commented Sep 23, 2025

Can be used by setting this flag on top of the yaml file.

use_outlines_for_ollama: true 

I tested in local pipelines but the results aren't very consistent.

@staru09
Copy link
Contributor Author

staru09 commented Sep 23, 2025

here is the dataset that I am using, I generated it synthetically and here is the pipeline that I am trying to run.

use_outlines_for_ollama: true
default_lm_api_base: "http://localhost:11434"
bypass_cache: true

datasets:
  ai_btc:
    type: file
    path: "./btc.json"

system_prompt:
  dataset_description: internal Slack-style conversations for the "AI agents for bitcoin payments" project
  persona: an objective delivery analyst assessing developer reliability and schedule risk

default_model: ollama/gemma3:12b

operations:
  - name: extract_reliability_signals
    type: map
    output:
      mode: structured_output
      schema:
        report_key: string
        week: string
        time: string
        channel: string
        speaker: string
        message: string
        positive_signals: "list[string]"
        negative_signals: "list[string]"
        reliability_delta: number
        slip: boolean
        fix_action: boolean
    prompt: |
      You analyze a single project chat message and extract reliability signals for the speaker.

      Message metadata:
      - Week: {{ input.week }}
      - Time: {{ input.time }}
      - Channel: {{ input.channel }}
      - Speaker: {{ input.speaker }}

      Message:
      {{ input.message }}

      Identify signals:
      - positive_signals: items like "shipped on time", "added tests", "fixed bug", "stabilized routing", "reduced incidents", "implemented alerting", "idempotency", "rollback plan".
      - negative_signals: items like "slipped", "blocked", "broke build", "repeated incidents", "performance regressions", "waiting on", "deferred".
      - reliability_delta: a small signed number in [-1.0, 1.0] representing the incremental reliability impact of THIS message (e.g., +0.3 for a clear fix and tests; -0.4 for a slip).
      - slip: true if this message indicates schedule slip or delivery risk; else false.
      - fix_action: true if this message describes a concrete fix/mitigation that reduces risk (tests, alerts, rebalances, idempotency, locks, playbooks).

      Output fields:
      - report_key: always "reliability"
      - week, time, channel, speaker, message: echo from input
      - positive_signals: list[string]
      - negative_signals: list[string]
      - reliability_delta: number in [-1.0, 1.0]
      - slip: boolean
      - fix_action: boolean
    validate:
      - output["report_key"] == "reliability"
      - isinstance(output["positive_signals"], list)
      - isinstance(output["negative_signals"], list)
      - isinstance(output["slip"], bool)
      - isinstance(output["fix_action"], bool)
    litellm_completion_kwargs:
      temperature: 0.2
      max_tokens: 300

  - name: aggregate_reliability_by_speaker
    type: reduce
    reduce_key: speaker
    output:
      mode: structured_output
      schema:
        speaker: string
        messages: integer
        positive_count: integer
        negative_count: integer
        slips: integer
        fixes: integer
        reliability_score: number
        example_positive: string
        example_negative: string
    prompt: |
      Aggregate reliability for this speaker from the given items. Score higher for more fixes, fewer slips.

      Inputs:
      {% for item in inputs %}
      - week: {{ item.week }} | slip: {{ item.slip }} | fix: {{ item.fix_action }} | delta: {{ item.reliability_delta }}
        + {{ item.positive_signals }}
        - {{ item.negative_signals }}
        msg: {{ item.message | truncate(140, True, '…') }}
      {% endfor %}

      Compute:
      - messages: number of items
      - positive_count: total length of positive_signals
      - negative_count: total length of negative_signals
      - slips: count of items with slip=true
      - fixes: count of items with fix_action=true
      - reliability_score: round(sum(reliability_delta) + 0.3*fixes - 0.4*slips - 0.1*negative_count + 0.1*positive_count, 3)
      - example_positive: a short example message capturing a strong positive
      - example_negative: a short example message capturing a notable negative

      Return only JSON matching the schema.
    fold_batch_size: 40
    fold_prompt: |
      You maintain a cumulative reliability summary for this speaker across batches.

      Previous output (empty on first batch):
      {{ output | tojson }}

      Batch inputs:
      {% for item in inputs %}
      - slip: {{ item.slip }} | fix: {{ item.fix_action }} | delta: {{ item.reliability_delta }}
        + {{ item.positive_signals }} | - {{ item.negative_signals }}
        msg: {{ item.message | truncate(120, True, '…') }}
      {% endfor %}

      Update fields cumulatively:
      - messages += len(batch)
      - positive_count += sum(len(+) per item)
      - negative_count += sum(len(-) per item)
      - slips += batch count of slip=true
      - fixes += batch count of fix_action=true
      - reliability_score += sum(reliability_delta) + 0.3*new_fixes - 0.4*new_slips - 0.1*new_negative + 0.1*new_positive; round to 3 decimals
      - example_positive: keep strongest positive seen so far
      - example_negative: keep strongest negative seen so far

      Return only JSON matching the schema.
    litellm_completion_kwargs:
      temperature: 0.2
      max_tokens: 900
    timeout: 180

  - name: select_most_reliable
    type: reduce
    reduce_key: _all
    output:
      mode: structured_output
      schema:
        most_reliable_developer: string
        score: number
        summary: string
        ranking: "list[{speaker: string, score: number, messages: integer}]"
    prompt: |
      You are given per-speaker reliability aggregates. Identify the most reliable developer and produce a ranking.

      Inputs (one per speaker):
      {% for s in inputs %}
      - speaker: {{ s.speaker }} | messages: {{ s.messages }} | score: {{ s.reliability_score }}
        +pos: {{ s.positive_count }} | -neg: {{ s.negative_count }} | slips: {{ s.slips }} | fixes: {{ s.fixes }}
        example+: {{ s.example_positive | default('') }}
        example-: {{ s.example_negative | default('') }}
      {% endfor %}

      Build:
      - ranking: descending by reliability_score (include up to 10 entries)
      - most_reliable_developer: the top speaker name
      - score: that developer's reliability_score (rounded to 3 decimals)
      - summary: 1–2 paragraphs citing the specific positives and low risk indicators

      Return only JSON matching the schema.
    litellm_completion_kwargs:
      temperature: 0.2
      max_tokens: 700

pipeline:
  steps:
    - name: reliability_extraction
      input: ai_btc
      operations:
        - extract_reliability_signals

    - name: reliability_aggregate
      input: reliability_extraction
      operations:
        - aggregate_reliability_by_speaker

    - name: reliability_selection
      input: reliability_aggregate
      operations:
        - select_most_reliable

  output:
    type: file
    path: "./ai_reliability.json"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant