Pro Scraping guide: Using Incogniton and Puppeteer to extract data

scraping static sites

This article takes you a step further, teaching the step-by-step approach to professional scraping using the Incogniton API with Puppeteer (Node.js).

Puppeteer, a key tool in the scraping repertoire, excels at automating browser interactions, while Incogniton enhances your privacy with robust anti-fingerprinting and web Unblocker technologies. Together, they help you scrape efficiently and securely without compromising on anonymity.

Dynamic Content: A Quick Overview

To understand why we need to deploy more advanced tools and approach to dynamic content, you need to understand

JavaScript-rendered content, or dynamic content, refers to webpage content that is not present in the initial server-rendered HTML but is instead loaded or modified through client-side JavaScript code. Simply put, content that your computer runs in your browser using JS.

Examples include lazy-loaded images, infinite scrolling, and pages in Single-page applications (SPAs).

Disable javascript via the Chromium browser Devtools panel

Whenever you need to confirm if a page is dynamically rendered, disabling JavaScript on the page is the way to go. You can disable JS on any chromium-based browser by following these steps:

Open Chrome DevTools (F12).
Press Ctrl + Shift + P or Cmd + Shift + P on Mac to open the Command Menu.
Enter “Disable JavaScript” in the search bar and select the corresponding option.
Refresh the page, and observe the difference. Javascript-generated content are not populated.

To return to the default state, close the Devtools panel and refresh the page.

Now back to scraping. Let’s take a closer look at our scraping tools.

Puppeteer

Puppeteer is a javascript library that automates Chrome browsers via Chrome DevTools Protocol (CDP), enabling programmatic executions and user interaction simulations such as clicking or typing.

Incogniton browser

Incogniton is an anti-detect browser with built-in Puppeteer integration, providing anonymity for web scraping and other automation tasks.

[Incogniton Browser]

Anti-detect browser that helps you remain anonymous on the internet. Manage multiple accounts and profiles in an…

The browser also comes with free proxies and supports proxy integration, enabling traffic routing through different IPs for bypassing geo-restrictions and IP blocks. Together, these features establish Incogniton as a dependable unblocked browser — ideal for tasks that prioritise privacy.

Scraping Dynamic content with Incogniton Headless Chrome

The Incogniton-puppeteer API helps us facilitate the creation of fully-isolated browser profiles with distinct browser fingerprints, combining advanced anonymity with automated, headless browsing.

Note that some features of Incogniton are available as paid options.

Let’s get to the process, shall we?

Step 1: Install and set up Incogniton

captionless image

If you already have an Incogniton account, create a profile and get the profile ID. Else, follow these steps for a quick start:

Visit the Incogniton download page. Select the version for your device (Windows or macOS) and download the Incogniton app.
Install the app on your computer, following the prompts. As the app installs, navigate to the website, choose a plan and create an account.
Upon installation, sign in with your credentials.
Navigate to profile management and create a new profile.
Set up your proxy for IP rotation; Incogniton provides a suite of proxy deals.
Complete the profile creation process and get the profile ID.

Keep the Incogniton App open and ensure the profile status shows “Ready” and not “Launching” or “Syncing” before you run your script.

Step 2: Set up the Incogniton-Puppeteer API

Create a new file named anon-scraper2.js in the root folder. Then, install the puppeteer-core library. I opt for the lightweight Puppeteer-core over puppeteer because it does not bundle Chrome, makes it ideal for when you have an existing Chrome instance as we do with Incogniton.

  npm install puppeteer-core

The next thing is to establish a connection between your Incogniton instance and Puppeteer to start the browser using the puppeteer.connect() function.

import puppeteer from "puppeteer-core";
// non-headless launch
const startIncogniton = async ({ profileId }) => {
 try {
    const launchUrl = `http://localhost:35000/automation/launch/puppeteer`;
    const requestBody = {
      profileID: profileId,
    };
    // Make a POST request with body data
    const response = await fetch(launchUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(requestBody),
    });
    const data = await response.json();
    const { puppeteerUrl } = data;
  // Wait for the browser to launch
  console.log('The Incogniton browser is launching...');
  await delay(30000); // await initial boot process
  // Connect Puppeteer to the launched browser
   const browser = await puppeteer.connect({
   browserURL: puppeteerUrl,
   acceptInsecureCerts: true,
  });
  return browser;
 } catch (error) {
  console.error("Error starting Incogniton session ->", error);
  throw error;
 }
};

The code above connects Puppeteer to the created Incogniton browser instance by fetching the browser’s puppeteerUrl using the Profile ID you provide. It then establishes a HTTP connection to Puppeteer with the puppeteer.connect() function.

The _delay_ mechanism is to ensure the browser launch process is complete before the system proceeds.

That said, a headless browser instance is better suited for large-scale scraping and other RPA (Robotic Process Automation) tasks that require fast execution times and reduced resource consumption.

Why? Because headless browsers operate without the need to render a graphical interface, which boosts performance and lowers resource overhead at every step of the way. So we’ll go for that.

To use the Incogniton headless chrome, append the customArgs field to the request body as shown in the code below:

import puppeteer from "puppeteer-core";
// headless launch
const startIncogniton = async ({ profileId }) => {
  try {
    const launchUrl = `http://localhost:35000/automation/launch/puppeteer`;
    const requestBody = {
      profileID: profileId,
      customArgs: '--headless=new', // use headless mode
    };
    // Make a POST request with body data
    const response = await fetch(launchUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(requestBody),
    });
    const data = await response.json();
    const { puppeteerUrl } = data;
    // Wait for the browser to launch
    console.log('The Incogniton browser is launching...');

    await delay(30000); // await initial boot process
    // Connect Puppeteer to the launched browser
    const browser = await puppeteer.connect({
      browserURL: puppeteerUrl,
      acceptInsecureCerts: true,
    });
    return browser;

} catch (error) {
console.error('Error starting Incogniton session ->', error);
throw error;
}
};

If you’re used to working with Puppeteer independently, you might be more familiar with the puppeteer.launch() function; however, we use puppeteer.connect() because, once again, we’re connecting to an already active [Incogniton] browser instance and not launching a new one.

Note: The default Incogniton port is _35000._If you've configured Incogniton to run on a different port, adjust the _launchUrl_ accordingly.

To confirm that the anti-detect browser is working as expected, let’s conduct an IPHey test. See script below:

// imports...
const incognitonProfileId = "YOUR_INCOGNITON_PROFILE_ID";
const ipheyTest = async (browser) => {
 try {
  const page = await browser.newPage();
  // Navigate to the IPHey website and wait till zero network requests
  await page.goto("https://iphey.com/", { waitUntil: "networkidle0" });
  // Check for 'trustworthy status' in the DOM
  const ipResult = await page.$eval(
   ".trustworthy-status:not(.hide)",
   (elt) => (elt ? elt.innerText.trim() : "")
  );
  console.log("IP Result ->", ipResult); // expected output: 'Trustworthy'
  await page.close();
 } catch (error) {
  console.error("Error during IPHEY test ->", error);
 } finally {
  await browser.close();
 }
};
// Execute iphey test
const testIncognitonProfile = async () => {
 const browser = await startIncogniton({ profileId: incognitonProfileId }); // your profile Id
 await ipheyTest(browser);
};
testIncognitonProfile();

All things being equal, the ipheyTest() function should return 'Trustworthy'. Other possible results are 'Suspicious' or 'Not reliable'.

Troubleshooting tip: If you encounter any errors with the Incogniton API, close all browser tabs, perform a ‘force stop’ of the browser instance, and try again. If that doesn’t work, reach out to the Incogniton support.

One significant thing to note here is the use of { waitUntil: networkidle0 } to instruct puppeteer to wait until there are no more network requests; though using setTimeout() instead is common as well, it is not a reliable option as it doesn’t guarantee the page is fully loaded under varying network conditions.

Step 3: Scrape a dynamically loaded page

To demonstrate scraping client-side rendered data, we’ll use the JS-generated content page from the Quotes to Scrape website. The page dynamically loads quotes with Javascript. You can confirm this by disabling javascript on the page then refresh as described earlier — you get a blank page.

The function below scrapes JS-generated data using the Incogniton API:

const scrapeDynamicContent = async (profileId) => {
 try {
  // Start Incogniton browser
  const browser = await startIncogniton({ profileId });
  const page = await browser.newPage();
  // Navigate to the dynamic page with client-side rendering
  await page.goto("https://quotes.toscrape.com/js/", {
   waitUntil: "networkidle0",
  });
  // Extract quotes and authors from dynamically rendered content
  const quotes = await page.$$eval(".quote", (elements) =>
   elements.map((element) => ({
    text: element && element.querySelector(".text").innerText.trim(),
    author:
     element && element.querySelector(".author").innerText.trim(),
   }))
  );
  console.log("Extracted Quotes ->", quotes);
  // Close the browser after scraping
  await browser.close();
 } catch (error) {
  console.error("Error scraping dynamically loaded content ->", error);
 }
};

Pay attention to how I have used .$$eval() over .$eval() here as the intention is to extract multiple quotes over a single element selection, as well as the && operator for short-circuiting so elements with absent fields do not throw an error.

Here’s what the output data should look like:

Extracted Quotes -> [  {
    text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    author: 'Albert Einstein'
  },
  {
    text: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
    author: 'J.K. Rowling'
  },
  {
    text: '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
    author: 'Jane Austen'
  },
  // other quotes...
]

Notice how I use the .trim() function to remove irrelevant whitespaces - a good example of implementing proper data cleaning practices in real-world scenarios.

Talking of real-world scenarios, in the next section, you will learn how to handle pagination and how to crawl through pages recursively for professional across-the-site data extraction.

Handle paginated data scraping and web crawling requirements

Web crawling in scraping typically involves identifying links on a page, and following those links to subsequent pages, repeating the process recursively until all relevant data is collected. And most practical scraping scenarios would require crawling through pages.

But it’s way less complex than it sounds, just follow closely.

Recursive function flow chart

For paginated data scraping, we will once again work with the bookstore website, extracting the names and prices of books.

Using the Incogniton-puppeteer API, we implement a recursive function to navigate through pages by clicking on the “next” button, and continue this process until there is no more “next” button available, indicating that we have reached the last page.

The recursive function is structured like this:

const scrapeRecursively = async ({ browser, givenPage, scrapeUrl, allData }) => {
  try {
    // Start Incogniton browser
    const page = givenPage;
    await page.goto(scrapeUrl, { waitUntil: 'networkidle0' });
    const quotes = await page.  $$eval('.quote', elements =>
      elements.map(el => ({
        text: el.querySelector('.text').innerText.trim(),
        author: el.querySelector('.author').innerText.trim(),
      }))
    );
    allData.push(quotes);
    const nextLink = await page.$('li.next a');
    // if there's a next button, continue scraping
    if (nextLink) {
      const href = await nextLink.evaluate(el => el.href);
      const nextUrl = new URL(href);
      await scrapeRecursively({
        givenPage: page,
        profileId: incognitonProfileId,
        scrapeUrl: nextUrl.href,
        allData,
      });
    }
    return {
      data: allData,
    };
  } catch (error) {
    console.error('Error scraping dynamically loaded content ->', error);
    throw error;
  }
};
// Usage example
const scrapeAllPages = async profileId => {
  try {
    const browser = await startIncogniton({ profileId });
    const page = await browser.newPage();
    const allData = [];
    await scrapeRecursively({
      browser,
      givenPage: page,
      scrapeUrl: 'https://quotes.toscrape.com/js/',
      allData,
    });
    console.log('ALL SCRAPED DATA ->', allData);
    browser.close();
  } catch (err) {
    console.error(err);
    throw err;
  }
}

The function navigates to the URL, waits for the page to load, and extracts all books on the page. Next, it checks for a “next page” link using the li.next > a selector. If found, it constructs the full URL and recursively calls itself to scrape data from the next page until there are no further pages to scrape.

scraping multiple pages with the recursive scraper function

Mission complete! You’ve successfully scraped a paginated, dynamic website — all while gaining insights on ways to keep your private data private.

Find the complete source code for Anonymous-Scraper102 here.

Just a heads-up: this method isn’t a magic fix-all.

Yes, Incogniton’s real fingerprints ensure your automation software doesn’t come off as a bot, so as long as you stay responsible and ethical in its use, you’re unlikely to get hit with annoying anti-bot prompts like CAPTCHAs.

However, in those rare cases where you do get one, I’ll dive into the solutions to that in a future article, so hit follow :)

Conclusion

And there you have it — everything you need to know about scraping the web anonymously. With this knowledge, you have what it takes to tackle web scraping challenges, and in the same vein maintaining your privacy and security. The possibilities are limitless.

But nonetheless, as I have continuously reiterated, always be mindful of each website’s terms of service and uphold ethical practices.

In the next article, we’ll explore how to export your scraped data into formats like JSON or CSV, or store it in a database, so you can seamlessly process and analyse your data for insights. Don’t miss it.

Disclaimer: This guide is for educational use only. While scraping public data is generally accepted, you should always read the website’s terms, and follow the legal regulations of your region.

INCOGNITON DOCS