[Scrapy教學9]一定要懂的Scrapy框架結合Gmail寄送爬取資料附件秘訣

Photo by Solen Feyissa on Unsplash

在Python網頁爬蟲蒐集資料的過程中，除了能夠將資料存入資料庫或匯出成檔案外，另一個最常應用的場景就是「訊息通知」，也就是在利用Python網頁爬蟲蒐集到所需的資料後，透過訊息通知的管道來推送資料結果。

舉例來說，[Python爬蟲教學]Python網頁爬蟲結合LINE Notify打造自動化訊息通知服務文章整合了LINE Notify服務，來通知使用者Python網頁爬蟲取得的降價訊息，而本文則要來和大家分享另一個訊息通知管道的結合，也就是電子郵件。

本文延續[Scrapy教學8]詳解Scrapy框架爬取分頁資料的實用技巧文章，將爬取的結果存入CSV檔案後，透過Gmail附件郵寄給使用者。在開始之前，大家可以先參考[Python實戰應用]Python寄送Gmail電子郵件實作教學文章的第二節步驟，取得Gmail的應用程式密碼，以便能夠利用它的SMTP(簡易郵件傳輸協定)來發送郵件。本文的重點包含：

Scrapy網頁爬蟲框架流程
Scrapy網頁爬蟲專案回顧
Scrapy MailSender結合Gmail發送郵件

一、Scrapy網頁爬蟲框架流程

首先，來複習一下在[Scrapy教學1]快速入門Scrapy框架的5個執行模組及架構文章中所分享的Scrapy網頁爬蟲框架流程，如下圖：

從上圖可以知道，想要將Scrapy網頁爬蟲取得的資料進行後續處理，就需要在SPIDERS爬蟲程式取得回應結果(6)後，把爬取的資料暫存在ITEMS資料模型，傳遞給ITEM PIPELINE資料模型管道(7,8)，來自訂後續資料處理的邏輯。

所以，可想而知，如果想要將爬取的結果匯出成CSV檔案，透過Gmail的附件寄出，就是要寫在ITEM PIPELINE資料模型管道中，也就是Scrapy專案的pipelines.py檔案。

二、Scrapy網頁爬蟲專案回顧

接下來，回顧一下目前Scrapy專案的三個部份，如下：

「SPIDERS爬蟲程式(inside.py)」

import scrapy


class InsideSpider(scrapy.Spider):
    name = 'inside'
    allowed_domains = ['www.inside.com.tw']
    start_urls = ['https://www.inside.com.tw/tag/ai']
    count = 1  # 執行次數

    def parse(self, response):

        yield from self.scrape(response)  # 爬取網頁內容

        # 定位「下一頁」按鈕元素
        next_page_url = response.xpath(
            "//a[@class='pagination_item pagination_item-next']/@href")

        if next_page_url:

            url = next_page_url.get()  # 取得下一頁的網址

            InsideSpider.count += 1

            if InsideSpider.count <= 3:
                yield scrapy.Request(url, callback=self.parse)  # 發送請求

    def scrape(self, response):

        # 爬取文章標題
        post_titles = response.xpath(
            "//h3[@class='post_title']/a[@class='js-auto_break_title']/text()"
        ).getall()

        # 爬取發佈日期
        post_dates = response.xpath(
            "//li[@class='post_date']/span/text()"
        ).getall()

        # 爬取作者
        post_authors = response.xpath(
            "//span[@class='post_author']/a/text()"
        ).getall()

        for data in zip(post_titles, post_dates, post_authors):
            NewsScraperItem = {
                "post_title": data[0],
                "post_date": data[1],
                "post_author": data[2]
            }

            yield NewsScraperItem

以上為Scrapy網頁爬蟲爬取INSIDE硬塞的網路趨勢觀察網站－AI新聞前3頁的文章資訊，其中的實作說明可以參考[Scrapy教學8]詳解Scrapy框架爬取分頁資料的實用技巧文章。

「ITEMS資料模型(items.py)」

import scrapy


class NewsScraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    post_title = scrapy.Field()  #文章標題
    post_date = scrapy.Field()  #發佈日期
    post_author = scrapy.Field()  #文章作者

包含了後續要匯出到CSV檔案的「文章標題」、「發佈日期」及「文章作者」三個欄位。

「ITEM PIPELINE資料模型管道(pipelines.py)」

from itemadapter import ItemAdapter
from scrapy.exporters import CsvItemExporter


class CsvPipeline:
    def __init__(self):
        self.file = open('posts.csv', 'wb')
        self.exporter = CsvItemExporter(self.file, encoding='big5')
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

以上為[Scrapy教學7]教你Scrapy框架匯出CSV檔案方法提升資料處理效率文章，將Scrapy網頁爬蟲爬取的資料匯出到CSV檔案部分，而本文就是要把其中的CSV檔案附在Gmail郵件中寄出。(PS.第8行的CsvItemExporter預設為utf8編碼，如果讀者匯出的CSV檔案要使用Microsoft Excel開啟，就需要設定為big5編碼，否則會出現亂碼)

三、Scrapy MailSender結合Gmail發送郵件

在Scrapy網頁爬蟲框架中，想要實作發送電子郵件的功能，可以使用內建的MailSender模組(Module)，透過基本的設定即可達成。並且它是一個基於Twisted框架的非阻斷IO(non-blocking IO)，能夠在發送電子郵件時，避免因為非預期的錯誤而導致程式碼卡住。

開啟Scrapy專案的settings.py設定檔，加入以下的Gmail SMTP設定：

MAIL_HOST = "smtp.gmail.com"
MAIL_PORT = 587
MAIL_FROM = "申請Gmail應用程式密碼所使用的電子郵件帳號"
MAIL_PASS = "Gmail應用程式密碼"
MAIL_TLS = True  #開啟安全連線

並且，將[Scrapy教學7]教你Scrapy框架匯出CSV檔案方法提升資料處理效率文章中所建立的CsvPipeline資料模型管道設定開啟，如下範例：

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'news_scraper.pipelines.CsvPipeline': 500,
}

設定完成後，開啟ITEM PIPELINE資料模型管道(pipelines.py)檔案，引用Scrapy框架的設定檔及MailSender模組(Module)，如下範例：

from itemadapter import ItemAdapter
from news_scraper import settings
from scrapy.mail import MailSender

由於我們要在Scrapy網頁爬蟲將資料匯入到CSV檔案結束後，才進行發送郵件的動作，所以，就需要在CsvPipeline類別(Class)的close_spider()方法(Method)中，來建立Scrapy MailSender物件，如下範例：

class CsvPipeline:
    def __init__(self):
        self.file = open('posts.csv', 'wb')
        self.exporter = CsvItemExporter(self.file, encoding='big5')
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

        mail = MailSender(smtphost=settings.MAIL_HOST,
                          smtpport=settings.MAIL_PORT,
                          smtpuser=settings.MAIL_FROM,
                          smtppass=settings.MAIL_PASS,
                          smtptls=settings.MAIL_TLS)

以上範例第15行利用剛剛在settings.py檔案中所設定的值來建立Scrapy MailSender物件，特別注意其中的關鍵字參數(Keyword Argument)需一模一樣。

接著，指定Gmail的附件，包含「附件顯示的名稱(attach_name)」、「網際網路媒體類型(mime_type)」及「檔案物件(file_object)」，如下範例：

class CsvPipeline:
    def __init__(self):
        self.file = open('posts.csv', 'wb')
        self.exporter = CsvItemExporter(self.file, encoding='big5')
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

        mail = MailSender(smtphost=settings.MAIL_HOST,
                          smtpport=settings.MAIL_PORT,
                          smtpuser=settings.MAIL_FROM,
                          smtppass=settings.MAIL_PASS,
                          smtptls=settings.MAIL_TLS)

        attach_name = "posts.csv"  #附件的顯示名稱
        mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
        file_object = open("posts.csv", "rb")  #讀取匯出的csv檔
	
	#寄出郵件
        return mail.send(to=["example@gmail.com"],  #收件者
                         subject="news",  #郵件標題
                         body="",  #郵件內容
                         attachs=[(attach_name, mime_type, file_object)])  #附件

最後，第26行透過Scrapy MailSender模組(Module)的send()方法(Method)，寄出網頁爬蟲匯出的CSV資料檔案，同樣關鍵字參數(Keyword Argument)需一樣，執行結果如下圖：

四、小結

在實務上，將Python網頁爬蟲取得的資料匯入到檔案中，並且郵寄給使用者是一個非常常見的應用，而在Scrapy網頁爬蟲框架中，則提供了MailSender模組(Module)，讓開發人員只需要透過簡單的設定，即可輕鬆結合SMTP(簡易郵件傳輸協定)，像是Gmail等寄送爬取的資料檔案，達到訊息通知的效果。希望本文的教學對於想要在Python網頁爬蟲專案中增加電子郵件功能的讀者有所幫助。

您的Python網頁爬蟲專案都是使用什麼訊息通知管道呢?歡迎在底下留言和我分享交流唷~

如果您喜歡我的文章，請幫我按五下Like(使用Google或Facebook帳號免費註冊)，支持我創作教學文章，回饋由LikeCoin基金會出資，完全不會花到錢，感謝大家。

GitHub網址：https://github.com/mikeku1116/news-scraper

有想要看的教學內容嗎?歡迎利用以下的Google表單讓我知道，將有機會成為教學文章，分享給大家😊

https://forms.gle/UW8u9XddoY17HjaSA

Python學習資源

Python學習資源整理

Python網頁爬蟲推薦課程

Python網頁爬蟲－BeautifulSoup教學

[Python爬蟲教學]7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧

Python網頁爬蟲－Selenium教學

Python網頁爬蟲－Scrapy教學

Python非同步網頁爬蟲

Python網頁爬蟲應用

Python網頁爬蟲部署

[Python爬蟲教學]教你如何部署Python網頁爬蟲至Heroku雲端平台

Python網頁爬蟲資料儲存

Python網頁爬蟲技巧

留言

Chito2021年8月11日晚上11:00
Mike你好，我的程式碼和範例的一樣，但在執行時遇到一個錯誤：
"""
[scrapy.mail] ERROR: Unable to send mail: To=['mymail@gmail.com'] Cc=[] Subject="news" Attachs=1- 502 Server does not support secure communication via TLS / SSL
"""
看錯誤訊息應該和TLS連線有關，上網爬文後還是找不到解答，請問有人可以告訴我哪裡出錯嗎?
我的環境是Python 3.8.8 / Scrapy 2.5.0
回覆刪除
回覆

新增留言

你的Py教練Mike

搜尋此網誌

[Scrapy教學9]一定要懂的Scrapy框架結合Gmail寄送爬取資料附件秘訣

一、Scrapy網頁爬蟲框架流程

二、Scrapy網頁爬蟲專案回顧

三、Scrapy MailSender結合Gmail發送郵件

四、小結

標籤

留言

張貼留言

這個網誌中的熱門文章

[Pandas教學]資料分析必懂的Pandas DataFrame處理雙維度資料方法

[Python教學]搞懂5個Python迴圈常見用法

[Python爬蟲教學]7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧

[Python物件導向]淺談Python類別(Class)

[Python教學]5個必知的Python Function觀念整理

[Pandas教學]5個實用的Pandas讀取Excel檔案資料技巧

[Python+LINE Bot教學]6步驟快速上手LINE Bot機器人

[Python教學]Python Lambda Function應用技巧分享

[Python爬蟲教學]整合Python Selenium及BeautifulSoup實現動態網頁爬蟲

Visual Studio Code Python環境建置

取得最新發佈的免費Python教學