-
Notifications
You must be signed in to change notification settings - Fork 97
Enhance read_lines to yield Path objects for streaming and add integration testcommoncrawl dataset
#1001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
c1ffb10 to
c02b1ba
Compare
23b60a2 to
90968f8
Compare
…`commoncrawl` dataset example.
| ], | ||
| "recordSet": [ | ||
| { | ||
| "@id": "warc-records", | ||
| "@type": "cr:RecordSet", | ||
| "field": [ | ||
| { | ||
| "@id": "warc-records/url", | ||
| "@type": "cr:Field", | ||
| "name": "url", | ||
| "dataType": "sc:URL", | ||
| "source": { | ||
| "fileSet": { | ||
| "@id": "warc-files" | ||
| }, | ||
| "extract": { | ||
| "fileProperty": "fullpath" | ||
| } | ||
| } | ||
| } | ||
| ] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mqzhou-dev This means we will need to describe our dataset at record level, correct? I cannot tag Greg here but afaik we do not want to go down to record level in our description. I asked him separately to confirm.
EDIT: Nope: I misunderstood.
Ill confirm once I hear from Greg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if our urls are records of the FileSet level (as here in the proposal (?)), or if they are records of the FileObject level.
We have a FileObject, called warc.paths.gz. Whose content is a bunch of urls (all incomplete)
So warc.paths.gz is like this:
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00001-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00002-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00003-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
Each of these urls describe one file in the FileSet.
So the associated FileSet is composed of:
File1: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
File2: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00001-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
File3: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00002-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
File4: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00003-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
...
- So I think our urls are records of the FileObject level.
- ie. we get the urls when we read the FileObject.
- And records of the FileSet level are parquet objects.
- so when we read the FileSet objects we get a bunch of parquets.
- If I am correct, source @id for urls must be warc.paths.gz (FileObject) than warc-paths (FileSet).
But I might be confusing here the record definition for the FileSet object manifested from a FileObject :) Fairly complex idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mqzhou-dev Sorry sorry, I misuderstood and Greg informs me its ok to go down to record level :)
Then looks good to me if we can clarify the records being urls at FileSet level - and I might have misunderstood that as well :)
This PR stacks on top of #999, which is not merged yet. Once #999 is submitted, I will pull and rebase my commit on top of it.
Please only review the changes in c02b1ba.