Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 449 90

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 203 16

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 124 14

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 37 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 28 5

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 60 10

Repositories

Showing 10 of 75 repositories
  • cc-index-annotations Public

    Example code to join an annotation to a host or url index

    commoncrawl/cc-index-annotations’s past year of commit activity
    Python 1 0 0 0 Updated Dec 10, 2025
  • cc-citations Public

    Scientific articles using or citing Common Crawl data

    commoncrawl/cc-citations’s past year of commit activity
    Jupyter Notebook 28 5 0 1 Updated Dec 8, 2025
  • web-languages Public

    Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

    commoncrawl/web-languages’s past year of commit activity
    66 84 3 1 Updated Dec 5, 2025
  • whirlwind-python-notebook Public

    A jupyter notebook illistrating the basics of Common Crawl's datasets.

    commoncrawl/whirlwind-python-notebook’s past year of commit activity
    Jupyter Notebook 2 Apache-2.0 0 0 0 Updated Dec 5, 2025
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 203 Apache-2.0 16 2 1 Updated Dec 5, 2025
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    commoncrawl/cc-webgraph’s past year of commit activity
    Java 102 Apache-2.0 4 2 (1 issue needs help) 0 Updated Dec 4, 2025
  • cc-index-table Public

    Index Common Crawl archives in tabular format

    commoncrawl/cc-index-table’s past year of commit activity
    Java 124 Apache-2.0 14 7 1 Updated Dec 4, 2025
  • nutch Public Forked from Aloisius/nutch

    Common Crawl fork of Apache Nutch

    commoncrawl/nutch’s past year of commit activity
    Java 40 Apache-2.0 1,269 6 (1 issue needs help) 1 Updated Dec 4, 2025
  • crawler-commons Public Forked from crawler-commons/crawler-commons

    A set of reusable Java components that implement functionality common to any web crawler

    commoncrawl/crawler-commons’s past year of commit activity
    Java 2 Apache-2.0 91 0 2 Updated Dec 4, 2025
  • ia-hadoop-tools Public Forked from Aloisius/ia-hadoop-tools

    Web archiving tools on Hadoop

    commoncrawl/ia-hadoop-tools’s past year of commit activity
    Java 4 29 2 1 Updated Dec 3, 2025