total bin coverage for default_transform() in Knowledge Graph transformations #1950

tolgaerdonmez · 2025-03-05T10:40:27Z

Problem

default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins.
But for longer documents with token length >100k and 0 this function raises the following:

 raise ValueError(
            "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
        )

Which covers the case of empty documents but also violates the constraint >100k.

Solution (Currently implemented)

I'm not sure with this solution but my first approach was to change the last bin interval to inf. This solves the problem easily but could be inefficient for very large documents.

    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]

Better Solution Proposal (Let's discuss this)

If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.

tolgaerdonmez · 2025-03-09T14:30:07Z

I've found another solution:
Seperate the document into half of the total token length using langchains text splitters with overlap.
Use the token counting function as the length function used in ragas itself.

jjmachan · 2025-08-28T13:55:42Z

Hey @tolgaerdonmez! 👋

Hope you're doing well! I really loved your work on improving bin coverage for default_transform() in Knowledge Graph transformations - it's exactly the kind of thoughtful improvement that makes Ragas better for everyone.

Quick question for you - we're trying to figure out what to do with the Testset Generation module as we gear up for v0.4, and since you've been working in this space, I'd love to get your take on it.

Mind checking out this discussion when you have a moment?
🔗 #2231

Basically we're wondering if we should keep it as part of the core library, spin it off into its own thing, or maybe even retire it if folks aren't really using it much. No pressure at all, but given your experience with knowledge graph transformations and document processing, your perspective would be super helpful!

Just drop a 👍 👎 or 🚀 on the issue, or feel free to share any thoughts you have.

Thanks for being awesome! 🙏

anistark

The changes looks good. Thanks for the PR @tolgaerdonmez

Could you please rebase and check for potential conflicts?

anistark · 2025-09-08T08:48:48Z

I've found another solution: Seperate the document into half of the total token length using langchains text splitters with overlap. Use the token counting function as the length function used in ragas itself.

This might be a bit over-engineering, but we can take it up as a discussion in a new issue perhaps. This fix is good to go.

total bin coverage for default transforms

9e2f881

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Mar 5, 2025

anistark reviewed Sep 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

total bin coverage for default_transform() in Knowledge Graph transformations #1950

total bin coverage for default_transform() in Knowledge Graph transformations #1950

Uh oh!

tolgaerdonmez commented Mar 5, 2025

Uh oh!

tolgaerdonmez commented Mar 9, 2025

Uh oh!

jjmachan commented Aug 28, 2025

Uh oh!

anistark left a comment

Uh oh!

anistark commented Sep 8, 2025

Uh oh!

Uh oh!

total bin coverage for default_transform() in Knowledge Graph transformations #1950

Are you sure you want to change the base?

total bin coverage for default_transform() in Knowledge Graph transformations #1950

Uh oh!

Conversation

tolgaerdonmez commented Mar 5, 2025

Problem

Solution (Currently implemented)

Better Solution Proposal (Let's discuss this)

Uh oh!

tolgaerdonmez commented Mar 9, 2025

Uh oh!

jjmachan commented Aug 28, 2025

Uh oh!

anistark left a comment

Choose a reason for hiding this comment

Uh oh!

anistark commented Sep 8, 2025

Uh oh!

Uh oh!