Skip to content

Move crawl and QA logs to new mongo collection #2791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Aug 5, 2025

Fixes #2765

(This is a necessary part of ensuring mongo documents don't exceed 16MB. Hopefully it's also sufficient but we'll need to see in practice if there are other fields that need to be separated out.)

This PR moves crawl and QA run logs into a separate crawl_logs mongo collection.

It adds a new backend module (without a distinct API router, but the crawls module is getting quite large and it seemed to make sense to add a separate module for the new mongo collection), as well as a migration to move crawl logs from Crawl objects into the new collection. The existing nightly test for crawl error logs is fleshed out, and a new nightly test added for behavior logs.

The migration has been tested locally. I've also verified that the new collection's indices are used by the existing crawl error and behavior log endpoints.

@tw4l tw4l requested a review from ikreymer August 5, 2025 19:07
@tw4l tw4l force-pushed the issue-2765-split-off-logs branch from 902a347 to 2ebc399 Compare August 5, 2025 19:07
qa_run_id=None,
)

while behavior_logs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible to do this entirely in mongo, might be faster using $function and calling JSON.parse?

Copy link
Member Author

@tw4l tw4l Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it be more complicated/less efficient than it seems at first glance to handle it that way. As far as I can tell we wouldn't be able to write to both mongo collections in a single query, so we'd need to put the crawl document's log lines into memory between queries anyway. At that point, seems like we may as well keep to having a single codepath for parsing/writing log lines so we know everything's consistent.

Co-authored-by: Ilya Kreymer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Move logs out of crawl documents to ensure documents don't exceed 16 MB Mongo limit
2 participants