Pro Scraping Guide: Using Incogniton and Puppeteer to Extract Data

scraping static sites

This guide takes you through the step-by-step approach to professional web scraping using the Incogniton API with Puppeteer (Node.js).

Puppeteer excels at automating browser interactions, while Incogniton enhances your privacy with robust anti-fingerprinting and Web Unblocker technologies. Together, they help you scrape efficiently and securely without compromising on anonymity.

Dynamic Content: A Quick Overview

JavaScript-rendered content, or dynamic content, refers to webpage content that is not present in the initial server-rendered HTML but is instead loaded or modified through client-side JavaScript code. Simply put, this is content that your computer runs in your browser using JS.

Examples include lazy-loaded images, infinite scrolling, and pages in Single-page applications (SPAs).

Disable javascript via the Chromium browser Devtools panel

To confirm if a page is dynamically rendered, you can disable JavaScript:

  • Open Chrome DevTools (F12)
  • Press Ctrl + Shift + P or Cmd + Shift + P on Mac to open the Command Menu
  • Enter "Disable JavaScript" in the search bar and select the corresponding option
  • Refresh the page and observe the difference – JavaScript-generated content will not be populated

To return to the default state, close the DevTools panel and refresh the page.

Now let's look at our scraping toolkit.

Puppeteer

Puppeteer is a JavaScript library that automates Chrome browsers via the Chrome DevTools Protocol (CDP), enabling programmatic executions and user interaction simulations such as clicking or typing.

Incogniton Browser

Incogniton is an anti-detect browser with built-in Puppeteer integration, providing anonymity for web scraping and other automation tasks.

Scraping Dynamic Content with Incogniton Headless Chrome

The Incogniton-puppeteer API facilitates the creation of fully-isolated browser profiles with distinct browser fingerprints, combining advanced anonymity with automated, headless browsing.

Note that some features of Incogniton are available as paid options.

1

Install and Set Up Incogniton

Incogniton setup

If you already have an Incogniton account, create a profile and get the profile ID. Otherwise, follow these steps:

  1. Visit the Incogniton download page. Select the version for your device (Windows or macOS) and download the app.

  2. Install the app on your computer. While it's installing, navigate to the website, choose a plan, and create an account.

  3. Upon installation, sign in with your credentials.

  4. Navigate to profile management and create a new profile.

  5. Set up your proxy for IP rotation; Incogniton provides a suite of proxy deals.

  6. Complete the profile creation process and get the profile ID.

Keep the Incogniton App open and ensure the profile status shows "Ready" and not "Launching" or "Syncing" before you run your script.

2

Set Up the Incogniton-Puppeteer API

Create a new file named anon-scraper.js in your project folder. Then, install the puppeteer-core library:

npm install puppeteer-core

We use puppeteer-core instead of puppeteer because it doesn't bundle Chrome, making it ideal when you already have a Chrome instance (Incogniton).

Now, establish a connection between your Incogniton instance and Puppeteer using the puppeteer.connect() function:

import puppeteer from "puppeteer-core";

// Function to introduce a delay
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

// Non-headless launch
const startIncogniton = async ({ profileId }) => {
  try {
    const launchUrl = `http://localhost:35000/automation/launch/puppeteer`;
    const requestBody = {
      profileID: profileId,
    };

    // Make a POST request with body data
    const response = await fetch(launchUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(requestBody),
    });

    const data = await response.json();
    const { puppeteerUrl } = data;

    // Wait for the browser to launch
    console.log('The Incogniton browser is launching...');
    await delay(30000); // await initial boot process

    // Connect Puppeteer to the launched browser
    const browser = await puppeteer.connect({
      browserURL: puppeteerUrl,
      acceptInsecureCerts: true,
    });

    return browser;
  } catch (error) {
    console.error("Error starting Incogniton session ->", error);
    throw error;
  }
};

For headless mode (better suited for large-scale scraping), modify the request body:

// Headless launch
const startIncogniton = async ({ profileId }) => {
  try {
    const launchUrl = `http://localhost:35000/automation/launch/puppeteer`;
    const requestBody = {
      profileID: profileId,
      customArgs: '--headless=new', // use headless mode
    };

    // Rest of the function remains the same...
  }
};

Note: The default Incogniton port is 35000. If you've configured Incogniton to run on a different port, adjust the launchUrl accordingly.

To confirm that the anti-detect browser is working properly, let's run an IPHey test:

const incognitonProfileId = "YOUR_INCOGNITON_PROFILE_ID";

const ipheyTest = async (browser) => {
  try {
    const page = await browser.newPage();
    // Navigate to the IPHey website and wait till zero network requests
    await page.goto("https://iphey.com/", { waitUntil: "networkidle0" });

    // Check for 'trustworthy status' in the DOM
    const ipResult = await page.$eval(
      ".trustworthy-status:not(.hide)",
      (elt) => (elt ? elt.innerText.trim() : "")
    );

    console.log("IP Result ->", ipResult); // expected output: 'Trustworthy'
    await page.close();
  } catch (error) {
    console.error("Error during IPHEY test ->", error);
  } finally {
    await browser.close();
  }
};

// Execute iphey test
const testIncognitonProfile = async () => {
  const browser = await startIncogniton({ profileId: incognitonProfileId });
  await ipheyTest(browser);
};

testIncognitonProfile();

All things being equal, the ipheyTest() function should return 'Trustworthy'. Other possible results are 'Suspicious' or 'Not reliable'.

Troubleshooting tip: If you encounter any errors with the Incogniton API, close all browser tabs, perform a force stop of the browser instance, and try again. If that doesn't work, reach out to Incogniton support.

Note the use of { waitUntil: "networkidle0" } to instruct Puppeteer to wait until there are no more network requests. This is more reliable than using setTimeout() as it guarantees the page is fully loaded regardless of network conditions.

3

Scrape a Dynamically Loaded Page

To demonstrate scraping client-side rendered data, we'll use the JS-generated content page from the Quotes to Scrape website. The page dynamically loads quotes with JavaScript.

const scrapeDynamicContent = async (profileId) => {
  try {
    // Start Incogniton browser
    const browser = await startIncogniton({ profileId });
    const page = await browser.newPage();

    // Navigate to the dynamic page with client-side rendering
    await page.goto("https://quotes.toscrape.com/js/", {
      waitUntil: "networkidle0",
    });

    // Extract quotes and authors from dynamically rendered content
    const quotes = await page.$$eval(".quote", (elements) =>
      elements.map((element) => ({
        text: element && element.querySelector(".text").innerText.trim(),
        author: element && element.querySelector(".author").innerText.trim(),
      }))
    );

    console.log("Extracted Quotes ->", quotes);

    // Close the browser after scraping
    await browser.close();
  } catch (error) {
    console.error("Error scraping dynamically loaded content ->", error);
  }
};

Pay attention to how we use .$$eval() instead of .$eval() since we want to extract multiple quotes, not just a single element. We also use the && operator for short-circuiting so elements with absent fields don't throw an error.

The output data should look like this:

Extracted Quotes -> [
  {
    text: '"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."',
    author: 'Albert Einstein'
  },
  {
    text: '"It is our choices, Harry, that show what we truly are, far more than our abilities."',
    author: 'J.K. Rowling'
  },
  {
    text: '"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."',
    author: 'Jane Austen'
  },
  // other quotes...
]

Notice how we use the .trim() function to remove irrelevant whitespaces - a good example of implementing proper data cleaning in real-world scenarios.

4

Handle Paginated Data Scraping

Web crawling typically involves identifying links on a page and following those links to subsequent pages, repeating the process recursively until all relevant data is collected.

Recursive function flow chart

For paginated data scraping, we'll continue working with the quotes website, extracting quotes across multiple pages:

const scrapeRecursively = async ({ browser, givenPage, scrapeUrl, allData }) => {
  try {
    // Use the provided page
    const page = givenPage;
    await page.goto(scrapeUrl, { waitUntil: 'networkidle0' });

    // Extract quotes from current page
    const quotes = await page.$$eval('.quote', elements =>
      elements.map(el => ({
        text: el.querySelector('.text').innerText.trim(),
        author: el.querySelector('.author').innerText.trim(),
      }))
    );

    // Add current page data to the collection
    allData.push(...quotes);

    // Look for next page link
    const nextLink = await page.$('li.next a');

    // If there's a next button, continue scraping
    if (nextLink) {
      const href = await nextLink.evaluate(el => el.href);
      await scrapeRecursively({
        givenPage: page,
        scrapeUrl: href,
        allData,
      });
    }

    return {
      data: allData,
    };
  } catch (error) {
    console.error('Error scraping dynamically loaded content ->', error);
    throw error;
  }
};

// Usage example
const scrapeAllPages = async profileId => {
  try {
    const browser = await startIncogniton({ profileId });
    const page = await browser.newPage();
    const allData = [];

    await scrapeRecursively({
      browser,
      givenPage: page,
      scrapeUrl: 'https://quotes.toscrape.com/js/',
      allData,
    });

    console.log('ALL SCRAPED DATA ->', allData);
    browser.close();
  } catch (err) {
    console.error(err);
    throw err;
  }
};

The function navigates to the URL, waits for the page to load, and extracts all quotes. Next, it checks for a "next page" link using the li.next a selector. If found, it constructs the full URL and recursively calls itself to scrape data from the next page until there are no further pages to scrape.

scraping multiple pages with the recursive scraper function

Conclusion

You've now learned how to scrape dynamically loaded websites while maintaining privacy and security using Incogniton and Puppeteer. This approach helps you avoid being detected as a bot in most cases, reducing the likelihood of encountering CAPTCHAs and other anti-bot measures.

However, it's important to note that this method isn't foolproof. While Incogniton's real fingerprints make your automation software less likely to be detected as a bot, you should always use these tools responsibly and ethically.

As always, be mindful of each website's terms of service and uphold ethical scraping practices:

  • Respect robots.txt files
  • Implement reasonable rate limiting
  • Only collect publicly available data
  • Don't overload servers with requests

Disclaimer: This guide is for educational use only. While scraping public data is generally accepted, you should always read the website's terms and follow the legal regulations of your region.

Find the complete source code for Anonymous-Scraper here.

Was this page helpful?