使用Cheerio实现Nodejs爬虫

2024年11月09日阅读：41 字数：836 阅读时长：2 分钟

Cheerio是快速、灵活和优雅的库，可以在服务器端解析和操作 HTML 和 XML。

Cheerio是快速、灵活和优雅的库，用于解析和操作 HTML 和 XML。

官网：The industry standard for working with HTML in JavaScript | cheerio

安装：npm install cheerio

功能特性：

它实现了核心 jQuery 的一个子集。选择器几乎相同的。
几乎可以解析任何 HTML 或 XML 文档。
Cheerio 适用于浏览器和服务器环境。

我们可以用它来在Nodejs中操作和解析HTML文档来实现类似爬虫抓取数据的功能。

1. API

1.1. 加载文档

首先把要解析的HTML document传给cheerio

const $ = cheerio.load('<h2 class="title">Hello world</h2>');

1.2. 选择元素

加载文档后，您可以使用返回的函数选择元素。它的api和jquery类似。

要选择文档中的所有元素：<p>

const $p = $('p');

要选择具有特定类名的元素：

const $selected = $('.selected');

也支持find、children、eq等方法来选择筛选特定元素

const $ = cheerio.load(
  `<ul>
    <li>Item 1</li>
    <li>Item 2</li>
  </ul>`,
);

const listItems = $('ul').find('li');

const secondItem = $('li').eq(1);

1.3. 操作元素

获取或设置属性

// Set the 'src' attribute of an image element
$('img').attr('src', 'https://example.com/image.jpg');

// Get the 'href' attribute of a link element
const href = $('a').attr('href');

获取或修改html内容

// Set the text content of an element
$('h1').text('Hello, World!');

// Get the text content of an element
const text = $('p').text();

2. 常见问题

2.1. 代理

有时候可能需要抓取一些国外网站的数据，可以使用tunnel借助vpn进行访问。比如：

谷歌翻译

const axios = require('axios');
const cheerio = require('cheerio');
const tunnel = require('tunnel')

const agent = tunnel.httpsOverHttp({
  proxy: {
    host: 'localhost',
    port: 10809,
  },
});

function translate({ source, to = 'en', from = 'zh-CN' }) {
  return axios({
    url: `https://translate.google.com/m?sl=${from}&tl=${to}&q=${encodeURI(source)}`,
    method: 'get',
    httpsAgent: agent
  }).then(async res => {
    const html = res.data
    const $ = cheerio.load(html)

    let text = $('.result-container').text()
    text = text.replace(text[0], text[0].toLocaleUpperCase())
    console.log(source, text)

    return { source, to, from, text }
  }).catch(err => {
    console.error(err);
  })
}

维基百科

const fs = require('fs-extra')
const chreerio = require('cheerio')
const axios = require('axios')
const tunnel = require('tunnel')

const agent = tunnel.httpsOverHttp({
  proxy: {
    host: 'localhost',
    port: 10809,
  },
});

function main() {
  const name = '国家5A级旅游景区'
  axios({
    url: 'https://zh.wikipedia.org/wiki/' + name,
    methods: 'get',
    httpsAgent: agent,
    headers: {
      'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6'
    }
  }).then(res => {
    const $ = chreerio.load(res.data)
    const trs = $('table:eq(0) tbody tr')
    console.log(trs.length)

    const topPosts = []
    let province = ''
    let city = ''
    let area = ''
    trs.each((i, link) => {
      province = $(link).find('th').eq(0).text().trim() || province
      city = $(link).find('td').eq(-4).text().trim() || city
      area = $(link).find('td').eq(-3).text().trim() || area
      topPosts.push({
        name: $(link).find('td').eq(-2).text().trim(),
        province,
        city,
        area,
        batch: $(link).find('td').eq(-1).text().trim(),
      })
    })

    fs.writeJSONSync(name + '.json', topPosts, { spaces: 2 })
  })
}

main()

2.2. GB2312乱码

由于Nodejs不支持GBK、GB2312编码，导致可能抓取部分网页会出现乱码现象，我们需要用iconv-lite这个库对源数据进行处理。

const fs = require('fs-extra')
const cheerio = require('cheerio')
const axios = require('axios')
const iconv = require('iconv-lite');
const download = require('download'); 

async function main() {
  const list = fs.readJSONSync('emu.json')
  const result = []
  for (const item of list) {
    const response = await getEmu(item.url)
    if (response) {
      console.log(item.name)
      result.push({
        name: item.name,
        ...response
      })
      const extname = response.split('.').pop()
      download('https://www.china-emu.cn' + response, 'images', {
        filename: item.name + '.' + extname
      });
    }
  }

  fs.writeJsonSync('emu-detail.json', result, { spaces: 2 } )
}

function getEmu(str) {
  return axios.get('https://www.china-emu.cn' + str, { responseType: 'arraybuffer' }).then(response => {
    const str = iconv.decode(Buffer.from(response.data), 'gb2312');
    const html = iconv.encode(str, 'utf8').toString();
    const $ = cheerio.load(html)
    const $img = $('.pic-bg img')
    const result = {
      cover: $img.attr('src')
    }
    const $para = $('.para')
    $para.each(function (index, item) {
      const key = $(item).find('.para-M').text()
      const value = $(item).find('.para-M').text().trim()
      result[key] = value
    })
    return result
  })
}

main()

评论区 (0)

还没有评论，快来抢第一吧