Enhance `read_lines` to yield `Path` objects for streaming and add integration test`commoncrawl` dataset #1001

mqzhou-dev · 2026-01-22T01:07:59Z

This PR stacks on top of #999, which is not merged yet. Once #999 is submitted, I will pull and rebase my commit on top of it.

Please only review the changes in c02b1ba.

github-actions · 2026-01-22T01:08:08Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…`commoncrawl` dataset example.

handecelikkanat · 2026-01-23T16:55:56Z

datasets/1.1/commoncrawl-CC-MAIN-2025-43-draft/metadata.json

+  ],
+  "recordSet": [
+    {
+      "@id": "warc-records",
+      "@type": "cr:RecordSet",
+      "field": [
+        {
+          "@id": "warc-records/url",
+          "@type": "cr:Field",
+          "name": "url",
+          "dataType": "sc:URL",
+          "source": {
+            "fileSet": {
+              "@id": "warc-files"
+            },
+            "extract": {
+              "fileProperty": "fullpath"
+            }
+          }
+        }
+      ]
+    }


@mqzhou-dev This means we will need to describe our dataset at record level, correct? I cannot tag Greg here but afaik we do not want to go down to record level in our description. I asked him separately to confirm.
EDIT: Nope: I misunderstood.

Ill confirm once I hear from Greg.

@mqzhou-dev

I am unsure if our urls are records of the FileSet level (as here in the proposal (?)), or if they are records of the FileObject level.

We have a FileObject, called warc.paths.gz. Whose content is a bunch of urls (all incomplete)

So warc.paths.gz is like this:

cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00001-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00002-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00003-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet

Each of these urls describe one file in the FileSet.

So the associated FileSet is composed of:

File1: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00000-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet File2: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00001-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet File3: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00002-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet File4: http://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-51/subset=crawldiagnostics/part-00003-2e1354aa-67a6-459b-81f6-7e2c39db0a5b.c000.gz.parquet ...

So I think our urls are records of the FileObject level.

ie. we get the urls when we read the FileObject.

And records of the FileSet level are parquet objects.

so when we read the FileSet objects we get a bunch of parquets.

If I am correct, source @id for urls must be warc.paths.gz (FileObject) than warc-paths (FileSet).

But I might be confusing here the record definition for the FileSet object manifested from a FileObject :) Fairly complex idea :)

@mqzhou-dev Sorry sorry, I misuderstood and Greg informs me its ok to go down to record level :)

Then looks good to me if we can clarify the records being urls at FileSet level - and I might have misunderstood that as well :)

ccl-core added 2 commits January 19, 2026 22:47

Fixing commoncrawl definition and some small bugs

4fd13de

Fix mypy and pytype

49e48a9

mqzhou-dev requested a review from a team as a code owner January 22, 2026 01:07

mqzhou-dev force-pushed the pr-999 branch 6 times, most recently from c1ffb10 to c02b1ba Compare January 22, 2026 02:01

mqzhou-dev requested review from benjelloun and ccl-core January 22, 2026 02:13

mqzhou-dev mentioned this pull request Jan 22, 2026

Fixing commoncrawl definition and some small bugs #999

Open

mqzhou-dev force-pushed the pr-999 branch 2 times, most recently from 23b60a2 to 90968f8 Compare January 22, 2026 10:20

Enhance read_lines to yield Path objects for streaming and add a …

97022be

…`commoncrawl` dataset example.

mqzhou-dev force-pushed the pr-999 branch from 90968f8 to 97022be Compare January 22, 2026 10:33

mqzhou-dev requested a review from handecelikkanat January 22, 2026 16:36

handecelikkanat reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `read_lines` to yield `Path` objects for streaming and add integration test`commoncrawl` dataset #1001

Enhance `read_lines` to yield `Path` objects for streaming and add integration test`commoncrawl` dataset #1001

Uh oh!

mqzhou-dev commented Jan 22, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

handecelikkanat Jan 23, 2026 •

edited

Loading

Uh oh!

handecelikkanat Jan 23, 2026 •

edited

Loading

Uh oh!

handecelikkanat Jan 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enhance read_lines to yield Path objects for streaming and add integration testcommoncrawl dataset #1001

Are you sure you want to change the base?

Enhance read_lines to yield Path objects for streaming and add integration testcommoncrawl dataset #1001

Uh oh!

Conversation

mqzhou-dev commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

handecelikkanat Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

handecelikkanat Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

handecelikkanat Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enhance `read_lines` to yield `Path` objects for streaming and add integration test`commoncrawl` dataset #1001

Enhance `read_lines` to yield `Path` objects for streaming and add integration test`commoncrawl` dataset #1001

mqzhou-dev commented Jan 22, 2026 •

edited

Loading

github-actions bot commented Jan 22, 2026 •

edited

Loading

handecelikkanat Jan 23, 2026 •

edited

Loading

handecelikkanat Jan 23, 2026 •

edited

Loading

handecelikkanat Jan 23, 2026 •

edited

Loading