[Python爬蟲教學]學會使用Selenium及BeautifulSoup套件爬取查詢式網頁

python_integrate_selenium_and_beautifulsoup

Photo by Austin Distel on Unsplash

在多樣化的網頁中，為了避免一次載入大量的資料影響執行效能，除了可以使用分頁，或像電子商務及社群平台，透過滾動捲軸的方式動態載入資料外，另一種常見的做法，就是增加查詢的功能。

常見的查詢式網頁，舉例來說，像是訂票系統，在網頁載入時，並不會把所有的時刻票劵資訊顯示出來，而是需要使用者指定想要購買的時刻，點擊查詢後，才會載入資料。如果想要利用Python網頁爬蟲爬取這種查詢類型的網頁，該如何實作呢?

本文以臺灣證券交易所的「個股日收盤價及月平均價」查詢式網頁為例，分享如何整合Python的selenium及beautifulsoup套件，自動化指定查詢條件，並且爬取查詢結果。其中的實作重點包含：

「個股日收盤價及月平均價」網頁分析
安裝selenium及beautifulsoup套件
selenium自動化指定查詢條件
beautifulsoup爬取查詢結果

一、「個股日收盤價及月平均價」網頁分析

臺灣證券交易所的「個股日收盤價及月平均價」，主要用來提供使用者查詢個股在指定的年月中，每日的收盤價及月平均價，如下圖：

如果有多檔股票需要分析，相信使用人工的方式來進行查詢及下載，會花費不少的時間，而透過Python網頁爬蟲，將會大幅提升取得資料的效率。

從上圖中可以看到，網頁分為查詢條件及結果，而指定查詢條件的部分，就需要使用Python的selenium自動化套件，來模仿使用者輸入年、月、股票代碼及點擊查詢按鈕，有了查詢結果後，即可利用Python的beautifulsoup套件，解析HTML原始碼，取得所需的資料。

二、安裝selenium及beautifulsoup套件

本文以Visual Studio Code為例，在Terminal視窗中利用以下的指令來安裝Python的selenium、beautifulsoup及webdriver-manager套件：

$ pip install selenium

$ pip install beautifulsoup4

$ pip install webdriver-manager

其中，webdriver-manager套件是用來協助selenium套件，在執行Python網頁爬蟲時，自動下載瀏覽器的驅動程式(Webdriver)。

三、selenium自動化指定查詢條件

新增scraper.py檔案，引用剛剛所下載的beautifulsoup、selenium、webdriver及time模組(Module)，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

為了要能夠自動化操作「個股日收盤價及月平均價」查詢條件的年月下拉選單，所以範例中就需引用selenium套件的Select模組(Module)，如上範例第3行。

接著，建立一個股票類別(Stock)，其中包含建構式(Constructor)及爬取「個股日收盤價及月平均價」的daily方法(Method)，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	pass

其中，建構式(Constructor)為了提供使用者能夠傳入多檔股票代碼，所以使用*args參數，將傳入的多個股票代碼，打包成元組(Tuple)，這樣在後續就能夠透過迴圈的方式，讀取多個股票代碼。

而daily()方法(Method)，則包含兩個參數，分別是年份及月份，讓使用者可以彈性的傳入所要查詢的年月。

在daily()方法(Method)中，首先，利用webdriver模組(Module)建立瀏覽器物件，其中，使用webdriver-manager模組(Module)自動下載瀏覽器驅動程式，接著，透過selenium套件的get()方法(Method)，請求臺灣證券交易所的「個股日收盤價及月平均價」網頁，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

接下來，就需要定位查詢條件的資料日期，在年份的地方，點擊滑鼠右鍵，選擇「檢查」，可以看到HTML原始碼如下：

這時候，就可以使用selenium套件的find_element_by_name()方法(Method)來進行元素的定位，由於是下拉選單，所以將定位的元素傳入selenium的Select模組(Module)來建立下拉選單物件，進而利用select_by_value()方法(Method)，選取使用者所傳入的年份，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

月份的部分，可以看到HTML原始碼為：

同樣使用selenium套件的find_element_by_name()方法(Method)來定位元素，傳入Select模組(Module)建立下拉選單物件，利用select_by_value()方法(Method)，選取使用者所傳入的月份，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

而股票代碼，可以看到它的HTML原始碼為：

由於不是下拉選單，所以使用selenium套件的find_element_by_name()方法(Method)來定位元素即可，並且利用send_keys()方法(Method)來模擬輸入資料，完成後，呼叫submit()方法(Method)送出，也就是查詢，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
	stockno.send_keys("2330")
	stockno.submit()

Python網頁爬蟲送出查詢條件後，就會產生查詢結果，也就是指定股票的「日收盤價及月平均價」，為了避免在查詢等待的期間，Python網頁爬蟲在內容尚未完成就進行爬取，所以可以使用sleep()方法(Method)，暫停一下，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
	stockno.send_keys("2330")
	stockno.submit()

        time.sleep(2)

四、beautifulsoup爬取查詢結果

接下來，要爬取查詢結果，就需要把網頁的原始碼，傳入beautifulsoup套件來進行解析，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
	stockno.send_keys("2330")
	stockno.submit()

        time.sleep(2)

        soup = BeautifulSoup(browser.page_source, "lxml")

回到「日收盤價及月平均價」網頁，可以看到查詢結果的表格HTML原始碼為：

從上圖可以看到，表格(table)擁有id的屬性，所以可以利用beautifulsoup套件的find()方法(Method)來進行定位，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
	stockno.send_keys("2330")
	stockno.submit()

        time.sleep(2)

        soup = BeautifulSoup(browser.page_source, "lxml")

        table = soup.find("table", {"id": "report-table"})

有了表格(table)物件後，就可以再利用beautifulsoup套件的find_all()方法(Method)，取得表格(table)下所有的資料欄位，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
	stockno.send_keys("2330")
	stockno.submit()

        time.sleep(2)

        soup = BeautifulSoup(browser.page_source, "lxml")

        table = soup.find("table", {"id": "report-table"})

        elements = table.find_all(
                "td", {"class": "dt-head-center dt-body-center"})

而要取得資料欄位中的文字，就需要透過迴圈進行讀取，並且利用beautifulsoup套件的getText()方法(Method)來取得，這邊使用Python Comprehension語法，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
	stockno.send_keys("2330")
	stockno.submit()

        time.sleep(2)

        soup = BeautifulSoup(browser.page_source, "lxml")

        table = soup.find("table", {"id": "report-table"})

        elements = table.find_all(
                "td", {"class": "dt-head-center dt-body-center"})
				
	data = [element.getText() for element in elements]

到目前為止，都是查詢一檔股票的實作，而要能夠自動化查詢多檔股票，則需要透過迴圈來重覆執行，如下範例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
import time

class Stock:
    def __init__(self, *stock_numbers):
        self.stock_numbers = stock_numbers

    def daily(self, year, month):
	browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(
            "https://www.twse.com.tw/zh/page/trading/exchange/STOCK_DAY_AVG.html")

        select_year = Select(browser.find_element_by_name("yy"))
        select_year.select_by_value(year)  # 選擇傳入的年份

        select_month = Select(browser.find_element_by_name("mm"))
        select_month.select_by_value(month)  # 選擇傳入的月份

        stockno = browser.find_element_by_name("stockNo")  # 定位股票代碼輸入框
		
	result = []
        for stock_number in self.stock_numbers:
            stockno.clear()  # 清空股票代碼輸入框
	    stockno.send_keys(stock_number)
	    stockno.submit()

	    time.sleep(2)

	    soup = BeautifulSoup(browser.page_source, "lxml")

	    table = soup.find("table", {"id": "report-table"})

	    elements = table.find_all(
			"td", {"class": "dt-head-center dt-body-center"})
					
	    data = (stock_number,) + tuple(element.getText() for element in elements)
				   
	    result.append(data)

        print(result)

範例中第24行，建立result串列(List)，是用來儲存多檔股票的「日收盤價及月平均價」。接著，第25行迴圈讀取初始化的多檔股票代碼，其中，在每一次執行時，需先清空上一次查詢的股票代碼的輸入框，才進行股票代碼的輸入及查詢。最後，第39行為了能夠區別每檔股票的資料，所以增加了股票代碼，並且，與爬取的「日收盤價及月平均價」轉型為元組(Tuple)打包起來，儲存至result串列(List)中。

接下來，就可以建立Stock物件，並且傳入想查詢的股票代碼來進行初始化，然後，呼教daily()方法(Method)，指定年份及月份，來爬取多檔股票的「日收盤價及月平均價」，如下範例：

stock = Stock('2451', '2454', '2369')  # 建立Stock物件
stock.daily("2019", "7")  # 動態爬取指定的年月份中，股票代碼的每日收盤價

執行結果

[
 ('2451', '108/07/01', '71.70', '108/07/02', '71.30', '108/07/03', '67.70', '108/07/04', '67.50', '108/07/05', '67.80', '108/07/08', '67.70', '108/07/09', '68.70', '108/07/10', '68.60', '108/07/11', '68.10', '108/07/12', '67.70', '108/07/15', '68.00', '108/07/16', '67.60', '108/07/17', '67.30', '108/07/18', '67.00', '108/07/19', '67.40', '108/07/22', '67.00', '108/07/23', '67.50', '108/07/24', '66.90', '108/07/25', '67.50', '108/07/26', '67.40', '108/07/29', '68.20', '108/07/30', '67.80', '108/07/31', '68.00', '月平均收盤價', '68.02'), 
 ('2454', '108/07/01', '314.00', '108/07/02', '319.00', '108/07/03', '317.50', '108/07/04', '319.00', '108/07/05', '321.00', '108/07/08', '314.00', '108/07/09', '313.50', '108/07/10', '318.00', '108/07/11', '322.50', '108/07/12', '318.00', '108/07/15', '314.50', '108/07/16', '314.00', '108/07/17', '310.00', '108/07/18', '302.00', '108/07/19', '304.50', '108/07/22', '308.50', '108/07/23', '314.50', '108/07/24', '308.00', '108/07/25', '313.00', '108/07/26', '313.50', '108/07/29', '319.00', '108/07/30', '325.50', '108/07/31', '314.50', '月平均收盤價', '314.70'), 
 ('2369', '108/07/01', '8.36', '108/07/02', '8.39', '108/07/03', '8.45', '108/07/04', '8.46', '108/07/05', '8.48', '108/07/08', '8.49', '108/07/09', '8.40', '108/07/10', '8.36', '108/07/11', '8.41', '108/07/12', '8.62', '108/07/15', '8.91', '108/07/16', '8.81', '108/07/17', '8.83', '108/07/18', '8.76', '108/07/19', '8.80', '108/07/22', '8.90', '108/07/23', '8.88', '108/07/24', '8.81', '108/07/25', '8.87', '108/07/26', '9.11', '108/07/29', '9.19', '108/07/30', '8.98', '108/07/31', '8.91', '月平均收盤價', '8.70')
]

五、小結

本文的Python網頁爬蟲透過整合selenium及beautifulsoup套件，實現查詢式的網頁資料爬取，對於想要進行大量資料的分析，能夠提升資料取得的效率，讀者不妨利用本文所分享的實作方式，來開發自動化查詢的網頁爬蟲吧。

如果您喜歡我的文章，請幫我按五下Like(使用Google或Facebook帳號免費註冊)，支持我創作教學文章，回饋由LikeCoin基金會出資，完全不會花到錢，感謝大家。

GitHub網址：https://github.com/mikeku1116/python-stock-scraper

有想要看的教學內容嗎?歡迎利用以下的Google表單讓我知道，將有機會成為教學文章，分享給大家😊

https://forms.gle/UW8u9XddoY17HjaSA

Python學習資源

Python學習資源整理

Python網頁爬蟲推薦課程

Python網頁爬蟲－BeautifulSoup教學

[Python爬蟲教學]7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧

Python網頁爬蟲－Selenium教學

Python非同步網頁爬蟲

Python網頁爬蟲應用

Python網頁爬蟲部署

[Python爬蟲教學]教你如何部署Python網頁爬蟲至Heroku雲端平台

Python網頁爬蟲資料儲存

Python網頁爬蟲技巧

你的Py教練Mike

搜尋此網誌

[Python爬蟲教學]學會使用Selenium及BeautifulSoup套件爬取查詢式網頁

一、「個股日收盤價及月平均價」網頁分析

二、安裝selenium及beautifulsoup套件

三、selenium自動化指定查詢條件

四、beautifulsoup爬取查詢結果

五、小結

標籤

留言

張貼留言

這個網誌中的熱門文章

[Pandas教學]資料分析必懂的Pandas DataFrame處理雙維度資料方法

[Python教學]搞懂5個Python迴圈常見用法

[Python爬蟲教學]7個Python使用BeautifulSoup開發網頁爬蟲的實用技巧

[Python物件導向]淺談Python類別(Class)

[Python教學]5個必知的Python Function觀念整理

[Pandas教學]5個實用的Pandas讀取Excel檔案資料技巧

[Python+LINE Bot教學]6步驟快速上手LINE Bot機器人

[Python教學]Python Lambda Function應用技巧分享

[Python爬蟲教學]整合Python Selenium及BeautifulSoup實現動態網頁爬蟲

Visual Studio Code Python環境建置

取得最新發佈的免費Python教學