|
| 1 | +--- |
| 2 | +title: Petabyte-Scale Web Crawling and Data Processing |
| 3 | +logo: success-stories/ahrefs.svg |
| 4 | +card_logo: success-stories/white/ahrefs.svg |
| 5 | +background: /success-stories/ahrefs-bg.jpg |
| 6 | +theme: blue |
| 7 | +synopsis: "Ahrefs built the world's third-largest web crawler using OCaml, processing 500 billion requests daily and indexing petabytes of web data with a lean, efficient team." |
| 8 | +url: https://ahrefs.com/ |
| 9 | +priority: 2 |
| 10 | +why_ocaml_reasons: |
| 11 | +- Performance |
| 12 | +- Reliability |
| 13 | +- Expressiveness |
| 14 | +- Native Compilation |
| 15 | +- Industrial Strength |
| 16 | +- Scalability |
| 17 | +- Maintainability |
| 18 | +--- |
| 19 | + |
| 20 | +## Challenge |
| 21 | + |
| 22 | +Ahrefs is a Singapore-based SaaS company that provides comprehensive SEO tools and marketing intelligence powered by big data. Since 2011, they've been crawling the entire web daily to maintain extensive databases of backlinks, keywords, and website analytics that help businesses with SEO strategy, competitor analysis, and content optimization. Today, they're trusted by 44% of Fortune 500 companies. |
| 23 | + |
| 24 | +Building and operating a web crawler at internet scale presents extraordinary challenges. Ahrefs needed to index billions of web pages continuously, process petabytes of data in real-time, and turn this massive dataset into actionable insights for thousands of customers worldwide. The technical demands are staggering: their systems must handle **500 billion backend requests per day** while maintaining **over 100PB of storage**. |
| 25 | + |
| 26 | +As a self-funded company, Ahrefs couldn't solve these challenges by throwing unlimited resources at the problem. They needed maximum efficiency from a small team—systems that could run reliably for months without intervention, code that could be understood and maintained by a lean engineering organization, and performance that could compete with tech giants despite having a fraction of their headcount. |
| 27 | + |
| 28 | +The question wasn't just whether they could build a web-scale crawler, but whether they could do it sustainably with the constraints of a bootstrapped company. |
| 29 | + |
| 30 | +## Result |
| 31 | + |
| 32 | +Over a decade later, Ahrefs operates one of the world's most sophisticated web crawling operations, ranking as the **third-largest web crawler globally**. Their OCaml-powered systems process **500 billion requests daily**, maintain an index of **456.5 billion pages** across **267.6 million domains**, and update metrics for **300 million pages every 24 hours**. |
| 33 | + |
| 34 | +This technical achievement translates directly to business success. Ahrefs has grown into a **$100M+ ARR company** with **150 employees** managing **4000+ servers**—all while maintaining their original philosophy of operational efficiency. They've become the sector leader in SEO tools, proving that the right technology choices can create sustainable competitive advantages. |
| 35 | + |
| 36 | +The reliability of their OCaml systems is perhaps most impressive: programs written years ago continue running without surprises, requiring minimal maintenance from their engineering team. This "boring" reliability has allowed Ahrefs to focus engineering effort on building new features and capabilities rather than fighting infrastructure fires. |
| 37 | + |
| 38 | +Their success demonstrates that OCaml can power not just technical excellence at massive scale, but sustainable business growth in highly competitive markets. |
| 39 | + |
| 40 | +## Why OCaml |
| 41 | +Ahrefs chose OCaml because it solved their constraint: building world-class infrastructure with limited resources. |
| 42 | + |
| 43 | +* **Expressiveness reduces team requirements** - OCaml allowed their small team to develop crawling and data processing systems with few lines of code, essential when you can't hire armies of engineers like big tech companies. |
| 44 | +* **Reliability minimizes operational overhead** - Systems run for months without surprises, crucial when you can't afford large operations teams to babysit infrastructure. |
| 45 | +* **Native performance handles web scale** - Compilation to native code provided the performance needed for processing 500 billion requests daily without requiring expensive hardware optimizations. |
| 46 | +* **Type safety prevents data disasters** - When processing petabytes of evolving web data, catching format issues at compile time rather than in production saves hours of debugging and prevents costly system failures. |
| 47 | +* **Language philosophy matches business model** - OCaml's expressiveness made it economical to create specialized, efficient systems tailored to their exact requirements rather than adapting bloated generic solutions. |
| 48 | + |
| 49 | +## Solution |
| 50 | + |
| 51 | +Ahrefs built their crawling infrastructure around OCaml's strengths, creating a distributed system that balances performance, reliability, and maintainability. **[OCaml](https://ocaml.org/)** serves as the primary language for all crawling and data processing systems, compiled natively for maximum performance across their **4000+ servers**. |
| 52 | + |
| 53 | +The architecture treats data consistency as paramount. Using **[ATD (Adjustable Type Definitions)](https://github.com/ahrefs/atd)** to define shared data structures, they ensure type safety throughout their processing pipeline—from initial web crawling through to final data storage. This approach catches schema mismatches at compile time rather than runtime, crucial when processing billions of pages daily. |
| 54 | + |
| 55 | +Their storage layer combines **[ClickHouse](https://clickhouse.com/)** for analytical workloads, **[MySQL](https://www.mysql.com/)** for transactional data, and **[Elasticsearch](https://www.elastic.co/)** for search functionality, all orchestrated on **[AWS](https://aws.amazon.com/)**. The key insight was designing these systems to work together seamlessly through shared OCaml types rather than complex API layers. |
| 56 | + |
| 57 | +Ahrefs maintains their own libraries and frameworks rather than relying on generic solutions. This "build it ourselves" philosophy requires more initial investment but delivers systems perfectly tailored to web crawling demands. Their **1.5 million lines of OCaml code** represent years of accumulated domain expertise encoded in reliable, maintainable software. |
| 58 | + |
| 59 | +The result is a unified system where improvements to crawling algorithms, data processing pipelines, or storage efficiency can be implemented quickly and deployed confidently across their entire infrastructure. |
| 60 | + |
| 61 | +## Lessons Learned |
| 62 | + |
| 63 | +Ahrefs' experience building web-scale infrastructure in OCaml offers valuable insights: |
| 64 | + |
| 65 | +* **Reliability pays compound interest**: OCaml's "boring" stability means systems built years ago still run without surprises, freeing engineering time for new capabilities rather than maintenance. |
| 66 | +* **Types scale better than tests**: At petabyte scale, compile-time guarantees about data consistency prevent entire classes of runtime failures that would be catastrophic at this volume. |
| 67 | +* **Expressiveness enables specialization**: OCaml's high-level abstractions made it economical to build highly specialized systems rather than adapting generic frameworks to their unique requirements. |
| 68 | +* **Small teams can compete with giants**: The right language choice allowed Ahrefs to build infrastructure that competes with tech giants despite having a fraction of their resources. |
| 69 | +* **Performance and maintainability aren't mutually exclusive**: OCaml's combination of native compilation and high-level abstractions delivered both the performance needed for web scale and the clarity needed for long-term maintenance. |
| 70 | + |
| 71 | +## Open Source |
| 72 | + |
| 73 | +Ahrefs supports the OCaml ecosystem through contributions that benefit infrastructure and data processing applications: |
| 74 | + |
| 75 | +- **[Ahrefs DevKit](https://github.com/ahrefs/devkit):** Tools and utilities for building distributed applications. |
| 76 | +- **[OCaml Community Tools](https://github.com/ocaml-community):** Contributions to widely used infrastructure tools like `ocurl` and `ocaml-mariadb`. |
| 77 | +- **[ATD](https://github.com/ahrefs/atd):** Schema definition language for cross-platform data serialization. |
0 commit comments