Skip to content

Conversation

@andompesta
Copy link
Contributor

Description

Fix a problem in which words composed in sub-tokens are spitted in the based sub-tokens.
@Pouyanpi

Related Issue(s)

Solve issue: #1197

Checklist

  • [ x ] I've read the CONTRIBUTING guidelines.
  • [ x ] I've updated the documentation if applicable.
  • I've added tests if applicable.
  • [ x ] @mentions of the person or team responsible for reviewing proposed changes.

@andompesta andompesta marked this pull request as draft May 16, 2025 08:46
@Pouyanpi Pouyanpi added the bug Something isn't working label May 16, 2025
@Pouyanpi Pouyanpi added this to the v0.14.0 milestone May 16, 2025
@Pouyanpi Pouyanpi requested review from Pouyanpi and Copilot May 16, 2025 12:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.62%. Comparing base (85400a5) to head (2acf809).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1198   +/-   ##
========================================
  Coverage    68.62%   68.62%           
========================================
  Files          161      161           
  Lines        15966    15966           
========================================
  Hits         10957    10957           
  Misses        5009     5009           
Flag Coverage Δ
python 68.62% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
nemoguardrails/rails/llm/llmrails.py 87.12% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

if words:
yield words[0]
for word in words[1:]:
# yield words[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds an extra space before the first word, no?
I think the previous logic was good here.


async for chunk_list, chunk_str_rep in buffer_strategy(streaming_handler):
chunk_str = " ".join(chunk_list)
chunk_str = "".join(chunk_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I assume all tokenizers are adding spaces and any other whitespace and special chars in some of the chunks, so this should work for any LLM provider.

@Pouyanpi
Copy link
Collaborator

Pouyanpi commented Sep 2, 2025

close it in favor of #1259

@Pouyanpi Pouyanpi closed this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants