Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
---
title: Robotic process automation
title: What is robotic process automation (RPA)
description: Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks.
sidebar_position: 8.7
slug: /concepts/robotic-process-automation
---

# What is robotic process automation (RPA)? {#what-is-robotic-process-automation-rpa}

**Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: I - Webhooks & advanced Actor overview
title: Webhooks & advanced Actor overview
description: Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks.
sidebar_position: 6.1
sidebar_label: I - Webhooks & advanced Actor overview
slug: /expert-scraping-with-apify/actors-webhooks
---

# Webhooks & advanced Actor overview {#webhooks-and-advanced-actors}

**Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks.**

---
Expand All @@ -15,7 +14,7 @@ Thus far, you've run Actors on the platform and written an Actor of your own, wh

## Advanced Actor overview {#advanced-actors}

In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in three short lessons [here](../../webscraping/scraping_basics_javascript/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same.
In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in [three short lessons](../../webscraping/scraping_basics_javascript/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same.

Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile).

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: IV - Apify API & client
title: Apify API & client
description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.
sidebar_position: 6.4
sidebar_label: IV - Apify API & client
slug: /expert-scraping-with-apify/apify-api-and-client
---

# Apify API & client {#api-and-client}

**Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: VI - Bypassing anti-scraping methods
title: Bypassing anti-scraping methods
description: Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK.
sidebar_position: 6.6
sidebar_label: VI - Bypassing anti-scraping methods
slug: /expert-scraping-with-apify/bypassing-anti-scraping
---

# Bypassing anti-scraping methods {#bypassing-anti-scraping-methods}

**Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: II - Managing source code
title: Managing source code
description: Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository. This is standard on the Apify platform.
sidebar_position: 6.2
sidebar_label: II - Managing source code
slug: /expert-scraping-with-apify/managing-source-code
---

# Managing source code {#managing-source-code}

**Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository. This is standard on the Apify platform.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: V - Migrations & maintaining state
title: Migrations & maintaining state
description: Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected.
sidebar_position: 6.5
sidebar_label: V - Migrations & maintaining state
slug: /expert-scraping-with-apify/migrations-maintaining-state
---

# Migrations & maintaining state {#migrations-maintaining-state}

**Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
---
title: VII - Saving useful run statistics
title: Saving useful run statistics
description: Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper.
sidebar_position: 6.7
sidebar_label: VII - Saving useful run statistics
slug: /expert-scraping-with-apify/saving-useful-stats
---

# Saving useful run statistics {#savings-useful-run-statistics}

**Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper.**

---

Using Crawlee and the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset. This is great, but sometimes, we want to store some extra data about the run itself, or about each request. We might want to store some extra general run information separately from our results or potentially include statistics about each request within its corresponding dataset item.

The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of captchas hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an Actor. As your projects scale, this will become more and more useful and important.
The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of CAPTCHAs hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an Actor. As your projects scale, this will become more and more useful and important.

## Learning 🧠 {#learning}

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: V - Handling migrations
title: Handling migrations
description: Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts.
sidebar_position: 5
sidebar_label: V - Handling migrations
slug: /expert-scraping-with-apify/solutions/handling-migrations
---

# Handling migrations {#handling-migrations}

**Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts.**

---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ sidebar_position: 6.7
slug: /expert-scraping-with-apify/solutions
---

# Solutions

**View all of the solutions for all of the activities and tasks of this course. Please try to complete each task on your own before reading the solution!**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: I - Integrating webhooks
title: Integrating webhooks
description: Learn how to integrate webhooks into your Actors. Webhooks are a super powerful tool, and can be used to do almost anything!
sidebar_position: 1
sidebar_label: I - Integrating webhooks
slug: /expert-scraping-with-apify/solutions/integrating-webhooks
---

# Integrating webhooks {#integrating-webhooks}

**Learn how to integrate webhooks into your Actors. Webhooks are a super powerful tool, and can be used to do almost anything!**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: II - Managing source
title: Managing source
description: View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code.
sidebar_position: 2
sidebar_label: II - Managing source
slug: /expert-scraping-with-apify/solutions/managing-source
---

# Managing source

**View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: VI - Rotating proxies/sessions
title: Rotating proxies/sessions
description: Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections.
sidebar_position: 6
sidebar_label: VI - Rotating proxies/sessions
slug: /expert-scraping-with-apify/solutions/rotating-proxies
---

# Rotating proxies/sessions {#rotating-proxy-sessions}

**Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: VII - Saving run stats
title: Saving run stats
description: Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items.
sidebar_position: 7
sidebar_label: VII - Saving run stats
slug: /expert-scraping-with-apify/solutions/saving-stats
---

# Saving run stats {#saving-stats}

**Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: IV - Using the Apify API & JavaScript client
title: Using the Apify API & JavaScript client
description: Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client.
sidebar_position: 4
sidebar_label: IV - Using the Apify API & JavaScript client
slug: /expert-scraping-with-apify/solutions/using-api-and-client
---

# Using the Apify API & JavaScript client {#using-api-and-client}

**Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client.**

---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: III - Using storage & creating tasks
title: Using storage & creating tasks
description: Get quiz answers and explanations for the lesson about using storage and creating tasks on the Apify platform.
sidebar_position: 3
sidebar_label: III - Using storage & creating tasks
slug: /expert-scraping-with-apify/solutions/using-storage-creating-tasks
---

# Using storage & creating tasks {#using-storage-creating-tasks}

## Quiz answers 📝 {#quiz-answers}

**Q: What is the relationship between Actors and tasks?**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
---
title: III - Tasks & storage
title: Tasks & storage
description: Understand how to save the configurations for Actors with Actor tasks. Also, learn about storage and the different types Apify offers.
sidebar_position: 6.3
sidebar_label: III - Tasks & storage
slug: /expert-scraping-with-apify/tasks-and-storage
---

# Tasks & storage {#tasks-and-storage}

**Understand how to save the configurations for Actors with Actor tasks. Also, learn about storage and the different types Apify offers.**

---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ sidebar_position: 14.1
slug: /node-js/analyzing-pages-and-fixing-errors
---

# How to analyze and fix errors when scraping a website {#scraping-with-sitemaps}

**Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.**

---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ slug: /node-js/dealing-with-dynamic-pages

import Example from '!!raw-loader!roa-loader!./dealing_with_dynamic_pages.js';

# How to scrape from dynamic pages {#dealing-with-dynamic-pages}

**Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?**

---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Sitemap is usually a simple XML file that contains a list of all pages on the we
- _Does not directly reflect the website_ - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs.
- _Updated in intervals_ - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week.
- _Hard to find or unavailable_ - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all.
- _Streamed, compressed, and archived_ - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework.
- _Streamed, compressed, and archived_ - Sitemaps are often streamed and archived with .tgz extensions and compressed with Gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework.

## Pros and cons of categories, search, and filters

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ sidebar_position: 2
slug: /anti-scraping/mitigation/using-proxies
---

# Using proxies {#using-proxies}

**Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.**

---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/executing-scripts/collecting-data
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Extracting data {#extracting-data}

**Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio.**

---
Expand Down
4 changes: 1 addition & 3 deletions sources/academy/webscraping/puppeteer_playwright/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Puppeteer & Playwright
title: Puppeteer & Playwright course
description: Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.
sidebar_position: 3
category: web scraping & automation
Expand All @@ -9,8 +9,6 @@ slug: /puppeteer-playwright
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Puppeteer & Playwright course {#puppeteer-playwright-course}

**Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.**

---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,13 @@ slug: /puppeteer-playwright/page/interacting-with-a-page
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Interacting with a page {#interacting-with-a-page}

**Learn how to programmatically do actions on a page such as clicking, typing, and pressing keys. Also, discover a common roadblock that comes up when automating.**

---

The **Page** object has a whole boat-load of functions which can be used to interact with the loaded page. We're not going to go over every single one of them right now, but we _will_ use a few of the most common ones to add some functionality to our current project.

Let's say that we want to automate searching for **hello world** on Google, then click on the first result and log the title of the page to the console, then take a screenshot and write it it to the filesystem. In order to understand how we're going to automate this, let's break down how we would do it manually:
Let's say that we want to automate searching for **hello world** on Google, then click on the first result and log the title of the page to the console, then take a screenshot and write it to the filesystem. In order to understand how we're going to automate this, let's break down how we would do it manually:

1. Click on the button which accepts Google's cookies policy (To see how it looks, open Google in an anonymous window.)
2. Type **hello world** into the search bar
Expand Down
Loading