[Scrapy教學8]詳解Scrapy框架爬取分頁資料的實用技巧

Photo by Austin Distel on Unsplash

相信大家在瀏覽網頁的經驗中，都有看過利用分頁的方式來呈現，將內容切分為好幾頁，除了能夠提升網頁的執行效率外，也增加了使用者的體驗。

在前面的Scrapy網頁爬蟲框架系列教學中，皆分享了爬取單一網頁的內容，這時候如果想要利用Scrapy網頁爬蟲框架來爬取多頁的資料，該如何實作呢?本文就來延續[Scrapy教學7]教你Scrapy框架匯出CSV檔案方法提升資料處理效率文章的教學內容，一起來學習這個技巧吧。實作的步驟包含：

Scrapy專案建立網頁內容爬取方法(Method)
Scrapy定位網頁的下一頁按鈕
Scrapy爬取多頁的網頁內容

一、Scrapy專案建立網頁內容爬取方法(Method)

首先，來回顧一下目前在Scrapy網頁爬蟲(spiders / inside.py)所建立的parse()方法(Method)，如下範例：

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']

    def parse(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }
            yield NewsScraperItem

以上範例第11~32行為爬取INSIDE硬塞的網路趨勢觀察網站－AI新聞的單一網頁，為了提升其中爬取邏輯的重用性(reusable)，本文將它獨立成一個新方法(Method)，如下範例：

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']

    def parse(self, response):

        yield from self.scrape(response)  #爬取網頁內容

    def scrape(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }

            yield NewsScraperItem

由於爬取邏輯被獨立為一個新方法(Method)，這時候在parse()方法(Method)中，則需使用「yield from」關鍵字來進行呼叫，並且傳入網頁的回應結果(response)，來執行網頁資料的爬取，如上範例第11行。

二、Scrapy定位網頁的下一頁按鈕

開啟INSIDE硬塞的網路趨勢觀察網站－AI新聞，往下可以看到頁碼的區域如下圖：

如果想要利用Scrapy網頁爬蟲框架，繼續爬取第二頁的資料時，就需要取得下一頁的網址，通常都會位於「下一頁」按鈕的href屬性中。

在上圖的「下一頁」按鈕點擊右鍵，選擇「檢查」，可以看到它的HTML原始碼如下圖：

接下來，回到Scrapy專案的spiders / inside.py檔案，在parse()方法(Method)中，即可利用Scrapy框架的xpath()方法(Method)，傳入「下一頁」按鈕的樣式類別(class)來進行定位，如下範例第14~15行：

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']

    def parse(self, response):

        yield from self.scrape(response)  #爬取網頁內容

        # 定位「下一頁」按鈕元素
        next_page_url = response.xpath(
            "//a[@class='pagination_item pagination_item-next']/@href")

    def scrape(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }

            yield NewsScraperItem

三、Scrapy爬取多頁的網頁內容

在定位到INSIDE硬塞的網路趨勢觀察網站－AI新聞的「下一頁」按鈕後，接下來就要判斷這個按鈕是否存在，如果存在的話代表後續還有分頁，需要繼續往下爬取，反之，則停止，如下範例第17~21行：

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']

    def parse(self, response):

        yield from self.scrape(response)  #爬取網頁內容

        # 定位「下一頁」按鈕元素
        next_page_url = response.xpath(
            "//a[@class='pagination_item pagination_item-next']/@href")

        if next_page_url:

            url = next_page_url.get()  #取得下一頁的網址

            yield scrapy.Request(url, callback=self.parse)  #發送請求

    def scrape(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }

            yield NewsScraperItem

其中，第21行的callback關鍵字參數(Keyword Argument)代表的意思，就是在Scrapy網頁爬蟲請求下一頁的網址後，再重新執行parse()方法(Method)，也就會取得下一頁的回應結果(response)，爬取網頁內容(第11行)，並且定位「下一頁」按鈕，判斷如果存在的話，代表還有下一頁，則取得下一頁的網址，發送請求，以此類推，直到沒有下一頁為止。

以上就是將INSIDE硬塞的網路趨勢觀察網站－AI新聞的所有分頁資料爬取下來的實作方法。當然，有時候並不想爬取那麼多的資料，可能只需要特定頁數的內容，舉例來說前3頁，該如何在Scrapy網頁爬蟲框架中實現呢?

這時候，就需要定義一個類別屬性(Class Attribute)，來計算目前執行的次數，如下範例第8行：

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']
    count = 1  # 執行次數

    def parse(self, response):

        yield from self.scrape(response)  #爬取網頁內容

        # 定位「下一頁」按鈕元素
        next_page_url = response.xpath(
            "//a[@class='pagination_item pagination_item-next']/@href")

        if next_page_url:

            url = next_page_url.get()  #取得下一頁的網址

            yield scrapy.Request(url, callback=self.parse)  #發送請求

    def scrape(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }

            yield NewsScraperItem

由於在執行Scrapy網頁爬蟲時，一定會先執行一次，所以第8行的執行次數(count)預設值為1。

接下來，就可以在每一次請求下一頁的網址前，將執行次數(count)加1，判斷如果在3次內，就發送請求，否則停止，如下範例第22~25行：

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']
    count = 1  # 執行次數

    def parse(self, response):

        yield from self.scrape(response)  #爬取網頁內容

        # 定位「下一頁」按鈕元素
        next_page_url = response.xpath(
            "//a[@class='pagination_item pagination_item-next']/@href")

        if next_page_url:
   
            url = next_page_url.get()  #取得下一頁的網址

            InsideSpider.count += 1

            if InsideSpider.count <= 3:
                yield scrapy.Request(url, callback=self.parse)  #發送請求

    def scrape(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }

            yield NewsScraperItem

最後，利用以下指令來執行Scrapy網頁爬蟲：

$ scrapy crawl inside

截取部分執行結果如下圖：

以上就是爬取INSIDE硬塞的網路趨勢觀察網站－AI新聞的前3頁內容，讀者可依自己的需求來進行頁數的調整。

四、小結

在實務上開發Python網頁爬蟲時，爬取分頁資料是常常會碰到的情況，而本文則利用一個實際的案例來讓讀者瞭解如何在Scrapy網頁爬蟲框架中來進行實作，對於想要爬取許多分頁的讀者，希望本文能夠幫助到您，如果有其它的想法或問題，歡迎在底下留言和我分享唷。

如果您喜歡我的文章，請幫我按五下Like(使用Google或Facebook帳號免費註冊)，支持我創作教學文章，回饋由LikeCoin基金會出資，完全不會花到錢，感謝大家。

GitHub網址：https://github.com/mikeku1116/news-scraper

有想要看的教學內容嗎?歡迎利用以下的Google表單讓我知道，將有機會成為教學文章，分享給大家😊

https://forms.gle/UW8u9XddoY17HjaSA

Python學習資源

Python學習資源整理

Python網頁爬蟲推薦課程

Python網頁爬蟲－BeautifulSoup教學

[Python爬蟲教學]7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧

Python網頁爬蟲－Selenium教學

Python網頁爬蟲－Scrapy教學

Python非同步網頁爬蟲

Python網頁爬蟲應用

Python網頁爬蟲部署

[Python爬蟲教學]教你如何部署Python網頁爬蟲至Heroku雲端平台

Python網頁爬蟲資料儲存

Python網頁爬蟲技巧

你的Py教練Mike

搜尋此網誌

[Scrapy教學8]詳解Scrapy框架爬取分頁資料的實用技巧

一、Scrapy專案建立網頁內容爬取方法(Method)

二、Scrapy定位網頁的下一頁按鈕

三、Scrapy爬取多頁的網頁內容

四、小結

標籤

留言

張貼留言

這個網誌中的熱門文章

[Pandas教學]資料分析必懂的Pandas DataFrame處理雙維度資料方法

[Python教學]搞懂5個Python迴圈常見用法

[Python爬蟲教學]7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧

[Python物件導向]淺談Python類別(Class)

[Python教學]5個必知的Python Function觀念整理

[Pandas教學]5個實用的Pandas讀取Excel檔案資料技巧

[Python+LINE Bot教學]6步驟快速上手LINE Bot機器人

[Python教學]Python Lambda Function應用技巧分享

[Python爬蟲教學]整合Python Selenium及BeautifulSoup實現動態網頁爬蟲

Visual Studio Code Python環境建置

取得最新發佈的免費Python教學