Use puppeteer to scrape the historical hot list on Juejin.

Preface#

After becoming a worker, there is not much time to browse the gold rush. Many times, I can only lie on the bed and browse the hot list on weekends. But at this time, we often miss the historical records of the gold rush hot list (the content of the hot list is always changing). This situation may cause us to miss some very good gold rush articles. So is there any way to record the historical hot list of the gold rush with just one command? At this time, we need to introduce Puppeteer.

Brief Introduction#

Puppeteer is a Node.js library developed and maintained by Google. It provides a high-level API for automating web page operations through controlling a Headless Chrome (a headless Chrome browser) instance. It can be used to perform various tasks, including taking screenshots, generating PDFs, crawling data, automating form filling and interaction, etc.

Here, let's focus on Headless Chrome, which is a headless browser (the code later will involve this concept). A headless browser refers to a web browser without a visible user interface. It can run in the background and perform web operations and browsing behaviors, but it does not have a graphical interface display. Compared to traditional web browsers, which usually provide a user-visible interface for users to interact with, such as inputting URLs for link navigation or clicking submit during login and registration. However, a headless browser can automate these operations to run in the background without popping up windows repeatedly. It allows developers to automate various operations by writing scripts, such as:

Test automation in web applications

Taking web page screenshots

Running automated tests on JavaScript libraries

Collecting website data

Automating web page interaction

Next, I will demonstrate how to quickly get started with Puppeteer and only need one command to get the hot list article information of the gold rush.

Configure Puppeteer#

The configuration here is actually just following the official documentation to get started quickly. I will briefly go through it here.

Here, we can choose to install Puppeteer directly.

npm i puppeteer

One thing to note here is that we can configure Puppeteer by creating a puppeteer.config.cjs file:

const {join} = require('path');

/**
 * @type {import("puppeteer").Configuration}
 */
module.exports = {
  // Changes the cache location for Puppeteer.
  cacheDirectory: join(__dirname, '.cache', 'puppeteer'),
};

After creating this configuration file and running the installation command, we can see that a .cache folder is added. If we open it, we will find that it stores many binary files. This involves an optimization solution to improve startup speed. The .cache file will automatically download the Chrome browser binary file suitable for our current operating system when we first use Puppeteer. This avoids the need for Puppeteer to re-download the required files during subsequent startups, improving startup speed.

Getting Started with Puppeteer#

Now we create a test.js file and enter the following content. I will explain each line:

// Import the Puppeteer library so that we can use its functions. Using ESM syntax is also possible here.
// import puppeteer from 'puppeteer';
const puppeteer = require('puppeteer');

(async () => {
  // Launch a Chrome browser instance. By default, Puppeteer is in headless mode.
  const browser = await puppeteer.launch();//const browser = await puppeteer.launch({headless: true});
  // Create a new page object.
  const page = await browser.newPage();
  // Navigate the code to the specified URL, simulating the operation of entering a URL to jump.
  await page.goto('https://example.com');
  // Take a screenshot of the current page. Note: To ensure consistent content display on different devices, the default browser window size is 800x600.
  await page.screenshot({path: 'example.png'});

  // Close Chrome.
  await browser.close();
})();

After running the node .\test.js command, if you get this image, it means you have succeeded:

Great, now you have started with Puppeteer and mastered the most basic operations. Next, we will implement the requirements mentioned above.

Function Implementation#

First, we need to know where the article titles and links are on the gold rush hot list. Specifically, not to let us understand their positions, but to let Puppeteer know their positions. Here, we open the console in the hot list section and use the selector syntax: Page.$$(). This method can run the document.querySelectorAll method in the browser. Enter: $$('a'), and you will get a bunch of <a> tags, but obviously this is not what we expected.

At this time, we need to narrow down its scope. Put the mouse on the hot list, right-click and "Inspect".

Now we can quickly locate this part of the content on the console. At this time, we modify the selector content: $$('.hot-list>a'), and now we get the link content. The principle is the same for getting the title. We just need to do a little processing: $$('.article-title').map(x=>x.innerText), and we can get the titles of the gold rush hot list.

Pitfall#

If we run the code directly at this time, we will most likely get []. This involves a very important point, web page loading delay.

Here, we cancel the default headless browser mode and modify the code to:

const browser = await puppeteer.launch({ headless: false })

Now when we run the code, we will find that the web page ends our operation before it is fully loaded. At this time, we need to set waitUntil or delay loading the script. Modify the code:

await page.goto("https://juejin.cn/hot/articles", {
        waitUntil: "domcontentloaded",
    });
await page.waitForTimeout(2000);

But if we check the documentation, we will find that it prompts us that page.waitForTimeout is deprecated and it is recommended to use Frame.waitForSelector. It will wait for an element matching the given selector to appear in the frame before running the code. It is more efficient compared to delaying the execution of the code directly. For now, let's leave it like this. Later, I will provide the complete code.

Complete and Optimize the Function#

When we can successfully detect the content, what we need is to save it to the local file. At this time, we introduce the file system module of Node.js to write the detected file content:

import puppeteer from "puppeteer";
import fs from "fs";

(async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto('https://juejin.cn/hot/articles', {
        waitUntil: "domcontentloaded"
    });
    await page.waitForTimeout(2000);

    let hotList = await page.$$eval(".article-title[data-v-cfcb8fcc]", (title) => {
        return title.map((x) => x.innerText);
    });

    console.log(hotList);

    // Save the article titles to a text file
    fs.writeFile('titles.txt', hotList.join('\n'), (err) => {
        if (err) throw err;
        console.log('The article titles have been saved to the titles.txt file');
    });

    await browser.close();
})();

Now we get all the titles of the articles. However, having only titles is not enough. It would be troublesome if we still need to manually input when we want to read some articles on weekends. At this time, we need to save the titles and links of the articles together. We can use closest("a").href to get the link:

const articleList = await page.$$eval(
    ".article-title[data-v-cfcb8fcc] a",
    (articles) => {
      return articles.map((article) => ({
        title: article.innerText,
        link: article.href,
      }));
    }
  );

  console.log(articleList);

  // Save the article titles and links to a text file
  const formattedData = articleList.map(
    (article) => `${article.title} - ${article.link}`
  );
  fs.writeFile("articles.txt", formattedData.join("\n"), (err) => {
    if (err) throw err;
    console.log("The article titles and links have been saved to the articles.txt file");
  });

Great! But now we find that when we run this script again the next day, it overwrites the previous day's file. This is not acceptable. We need to classify the hot list articles according to different days, so that we can have different files for different days. Here, we add the previously mentioned functionality of waiting for an element matching the given selector to appear in the frame. Finally, we get:

import puppeteer from "puppeteer";
import fs from "fs";

(async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto("https://juejin.cn/hot/articles", {
        waitUntil: "domcontentloaded",
    });

    // Process folder name
    const currentDate = new Date().toLocaleDateString();
    const fileName = `${currentDate.replace(/\//g, "-")}.txt`;

    await page.waitForSelector(".article-title[data-v-cfcb8fcc]");

    const articleList = await page.$$eval(
        ".article-title[data-v-cfcb8fcc]",
        (articles) => {
            return articles.map((article) => ({
                title: article.innerText,
                link: article.closest("a").href,
            }));
        }
    );

    console.log(articleList);

    const formattedData = articleList.map(
        (article) => `${article.title} - ${article.link}`
    );
    fs.writeFile(fileName, formattedData.join("\n"), (err) => {
        if (err) throw err;
        console.log(`The article titles and links have been saved to the file: ${fileName}`);
    });

    await browser.close();
})();

After running the code, we get the following content:

Summary#

Puppeteer, as a Node.js library developed and maintained by the Google team, greatly facilitates various automation operations. Imagine that in the future, you only need to run a simple node command to store the current hot list articles and information. Isn't it beautiful? 🐱

However, web crawling is just one of the least significant aspects of its many features. As the official documentation says, it can also be used for automated form submission, UI testing, capturing site timelines, crawling SPAs to achieve pre-rendering effects (I will write an article about front-end first screen optimization in the future).