Skip to the content.

Intro

Pocket is a free saved list service website. If you save an article website in it, it will automatically transform the website to a pure article. Its result is far way better and more accurate than those Python scrapping toolkit like goose3 and it support websites in various languages. So we can use it to scrap articles on websites very easily.

But because Pocket need to log in and need javascript to run the service, we cannot just use request or other similar toolkit to scrap the article directly. So we can use Selenium, which is a toolkit that can let you use code to control a web browser, to control a browser to do the log in operation then get the article we want to scrap.

Steps

pip install selenium 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

driver_path='./chromedriver.exe'    # your chromedriver path
options = Options()
options.add_argument("user-data-dir=./chromecache/pocket")     
    # you can specify an user folder so that you can need not to log in each time
driver = webdriver.Chrome(driver_path,chrome_options=options)
driver.get('https://app.getpocket.com')
articles=[href for a in driver.find_elements_by_xpath('//article/a') 
          if (href:=a.get_attribute('href')).startswith('https://getpocket.com/read/')]
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

for a in articles:
    driver.get(a)
    text=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//article/article'))).text
    print(text)   # or save the artcles any how you like

After your first log in, you can just combine all the codes together and let it scrap automatically.

Full Code

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver_path='./chromedriver.exe'    # your chromedriver path
options = Options()
options.add_argument("user-data-dir=./chromecache/pocket")     
    # you can specify an user folder so that you can need not to log in each time
driver = webdriver.Chrome(driver_path,chrome_options=options)
driver.get('https://app.getpocket.com')

WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//article/a')))
    # wait until all '//article/a' are loaded
articles=[href for a in driver.find_elements_by_xpath('//article/a') 
          if (href:=a.get_attribute('href')).startswith('https://getpocket.com/read/')]

for a in articles:
    driver.get(a)
    text=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//article/article'))).text
        # wait until the article is loaded
    print(text)   # or save the artcles any how you like


Comments