Web Scraping with Node.js

3 min readApr 7, 2023

Introduction

Welcome to the world of web automation! If you’ve ever wanted to automate web-based tasks or scrape data from websites, you’re in for a treat. In this blog post, we’ll introduce you to Puppeteer, a powerful Node.js package that makes web automation and scraping a breeze. We’ll explain how it works, showcase its key features, and explore some real-life examples to help you understand the full potential of Puppeteer.

What is Puppeteer?

Puppeteer is a Node.js library developed by the Chrome team at Google. It offers a high-level API to control headless or full-fledged Chromium-based browsers programmatically. With Puppeteer, you can carry out a wide range of tasks, such as generating PDFs, crawling websites for data, automating form submissions, and testing web applications.

Getting Started with Scrapping

npm install puppeteer @types/puppeteer typescript ts-node

Next, create a new TypeScript file (e.g., scrapeBlogPosts.ts), and add the following code:

import puppeteer, { Browser, Page } from 'puppeteer';

interface BlogPost {
  title: string;
  description: string;
  url: string;
}

async function scrapeBlogPosts(url: string): Promise<BlogPost[]> {
  // Launch a new browser instance
  const browser: Browser = await puppeteer.launch();
  const page: Page = await browser.newPage();

  // Navigate to the provided URL
  await page.goto(url);

  // Extract the desired data from the website
  const data: BlogPost[] = await page.evaluate(() => {
    const articles = Array.from(document.querySelectorAll('.blog-post'));
    
    const scrapedData: BlogPost[] = articles.map(article => {
      const title = article.querySelector('.post-title')?.innerText;
      const description = article.querySelector('.post-description')?.innerText;
      const url = article.querySelector<HTMLAnchorElement>('.post-link')?.href;

      return {
        title: title || '',
        description: description || '',
        url: url || '',
      };
    });

    return scrapedData;
  });

  // Close the browser instance
  await browser.close();

  // Return the scraped data
  return data;
}

// Usage example
(async () => {
  const blogPosts: BlogPost[] = await scrapeBlogPosts('https://www.exampleblog.com');
  console.log(blogPosts);
})();

I’ll break down the TypeScript version of the web scraping function and explain each part.

Import statements

import puppeteer, { Browser, Page } from 'puppeteer';

In this line, we import the puppeteer module, along with the Browser and Page types, which are used later for type annotations.

2. Interface definition

interface BlogPost {
  title: string;
  description: string;
  url: string;
}

Here, we define an interface BlogPost to represent the structure of the blog post objects we'll be scraping. It has three properties: title, description, and url, all of which are strings.

3. Function definition

async function scrapeBlogPosts(url: string): Promise<BlogPost[]> {
  ...
}

We define an asynchronous function scrapeBlogPosts() that takes a string url as an argument and returns a Promise that resolves to an array of BlogPost objects. This function will perform the web scraping task.

4. Launching a browser instance and creating a new page

const browser: Browser = await puppeteer.launch();
const page: Page = await browser.newPage();

We create a new Browser instance by calling puppeteer.launch() and a new Page instance using browser.newPage(). These instances will load the website and extract the desired data.

5. Navigating to the provided URL

await page.goto(url);

We use the page.goto() method to navigate to the URL passed as an argument to the scrapeBlogPosts() function.

6. Extracting data from the website

const data: BlogPost[] = await page.evaluate(() => {
  ...
});

We use the page.evaluate() method to execute JavaScript in the context of the loaded page. Inside the evaluate() function, we query the DOM for blog post elements (assuming they have a class "blog-post") and extract each post's title, description, and URL. The extracted data is stored in the data variable as an array of BlogPost objects.

7. Closing the browser instance

await browser.close();

After extracting the data, we close the Browser instance to free up resources.

8. Returning the scraped data

return data;

Finally, we return the scraped data as an array of BlogPost objects.

9. Usage example:

(async () => {
  const blogPosts: BlogPost[] = await scrapeBlogPosts('https://www.exampleblog.com');
  console.log(blogPosts);
})();

This part demonstrates how to call the scrapeBlogPosts() function and log the results to the console. An IIFE (Immediately Invoked Function Expression) defines and calls an async function, which calls the scrapeBlogPosts() function with the website URL to scrape. The scraped data is then logged into the console.

That's the TypeScript version of the web scraping function, explained step by step.