Skip to content

Conversation

tolgaerdonmez
Copy link

Problem

default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins.
But for longer documents with token length >100k and 0 this function raises the following:

 raise ValueError(
            "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
        )

Which covers the case of empty documents but also violates the constraint >100k.

Solution (Currently implemented)

I'm not sure with this solution but my first approach was to change the last bin interval to inf. This solves the problem easily but could be inefficient for very large documents.

    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]

Better Solution Proposal (Let's discuss this)

If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Mar 5, 2025
@tolgaerdonmez
Copy link
Author

I've found another solution:
Seperate the document into half of the total token length using langchains text splitters with overlap.
Use the token counting function as the length function used in ragas itself.

@jjmachan
Copy link
Member

Hey @tolgaerdonmez! 👋

Hope you're doing well! I really loved your work on improving bin coverage for default_transform() in Knowledge Graph transformations - it's exactly the kind of thoughtful improvement that makes Ragas better for everyone.

Quick question for you - we're trying to figure out what to do with the Testset Generation module as we gear up for v0.4, and since you've been working in this space, I'd love to get your take on it.

Mind checking out this discussion when you have a moment?
🔗 #2231

Basically we're wondering if we should keep it as part of the core library, spin it off into its own thing, or maybe even retire it if folks aren't really using it much. No pressure at all, but given your experience with knowledge graph transformations and document processing, your perspective would be super helpful!

Just drop a 👍 👎 or 🚀 on the issue, or feel free to share any thoughts you have.

Thanks for being awesome! 🙏

Copy link
Contributor

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good. Thanks for the PR @tolgaerdonmez

Could you please rebase and check for potential conflicts?

@anistark
Copy link
Contributor

anistark commented Sep 8, 2025

I've found another solution: Seperate the document into half of the total token length using langchains text splitters with overlap. Use the token counting function as the length function used in ragas itself.

This might be a bit over-engineering, but we can take it up as a discussion in a new issue perhaps. This fix is good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants