Skip to content

Conversation

@mqzhou-dev
Copy link

@mqzhou-dev mqzhou-dev commented Jan 22, 2026

This PR stacks on top of #999, which is not merged yet. Once #999 is submitted, I will pull and rebase my commit on top of it.

Please only review the changes in c02b1ba.

@mqzhou-dev mqzhou-dev requested a review from a team as a code owner January 22, 2026 01:07
@github-actions
Copy link

github-actions bot commented Jan 22, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Comment on lines +240 to +261
],
"recordSet": [
{
"@id": "warc-records",
"@type": "cr:RecordSet",
"field": [
{
"@id": "warc-records/url",
"@type": "cr:Field",
"name": "url",
"dataType": "sc:URL",
"source": {
"fileSet": {
"@id": "warc-files"
},
"extract": {
"fileProperty": "fullpath"
}
}
}
]
}
Copy link
Contributor

@handecelikkanat handecelikkanat Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mqzhou-dev This means we will need to describe our dataset at record level, correct? I cannot tag Greg here but afaik we do not want to go down to record level in our description. I asked him separately to confirm.
EDIT: Nope: I misunderstood.

Ill confirm once I hear from Greg.

Copy link
Contributor

@handecelikkanat handecelikkanat Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mqzhou-dev

I am unsure if our urls are records of the FileSet level (as here in the proposal (?)), or if they are records of the FileObject level.

We have a FileObject, called warc.paths.gz. Whose content is a bunch of urls (all incomplete)

So warc.paths.gz is like this:

cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00001-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00002-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00003-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet

Each of these urls describe one file in the FileSet.

So the associated FileSet is composed of:

File1: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
File2: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00001-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
File3: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00002-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
File4: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00003-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet
...
  • So I think our urls are records of the FileObject level.
    • ie. we get the urls when we read the FileObject.
  • And records of the FileSet level are parquet objects.
    • so when we read the FileSet objects we get a bunch of parquets.
  • If I am correct, source @id for urls must be warc.paths.gz (FileObject) than warc-paths (FileSet).

But I might be confusing here the record definition for the FileSet object manifested from a FileObject :) Fairly complex idea :)

Copy link
Contributor

@handecelikkanat handecelikkanat Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mqzhou-dev Sorry sorry, I misuderstood and Greg informs me its ok to go down to record level :)

Then looks good to me if we can clarify the records being urls at FileSet level - and I might have misunderstood that as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants