Intro
Pocket is a free saved list service website. If you save an article website in it, it will automatically transform the website to a pure article. Its result is far way better and more accurate than those Python scrapping toolkit like goose3
and it support websites in various languages. So we can use it to scrap articles on websites very easily.
But because Pocket need to log in and need javascript to run the service, we cannot just use request or other similar toolkit to scrap the article directly. So we can use Selenium, which is a toolkit that can let you use code to control a web browser, to control a browser to do the log in operation then get the article we want to scrap.
Steps
- First you need to save the articles in your Pocket account and install Selenium. Selenium support many programming languages. Here I use Python to install selenium.
pip install selenium
- Then use Selenium and the Chromdrive (or any other web browser Selenium supports) to open the Pocket page.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
driver_path='./chromedriver.exe' # your chromedriver path
options = Options()
options.add_argument("user-data-dir=./chromecache/pocket")
# you can specify an user folder so that you can need not to log in each time
driver = webdriver.Chrome(driver_path,chrome_options=options)
driver.get('https://app.getpocket.com')
- After logging in, get the article list. The article links are at xpath
//article/a
. If a website can be transformed into an article by Pocket, the article link will starts withhttps://getpocket.com/read/
.
articles=[href for a in driver.find_elements_by_xpath('//article/a')
if (href:=a.get_attribute('href')).startswith('https://getpocket.com/read/')]
- Then we can get in each link to get the article texts. Here I use implicit wait to wait the webpages get loaded.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
for a in articles:
driver.get(a)
text=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//article/article'))).text
print(text) # or save the artcles any how you like
- You can also scrap the images in the articles.
After your first log in, you can just combine all the codes together and let it scrap automatically.
Full Code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver_path='./chromedriver.exe' # your chromedriver path
options = Options()
options.add_argument("user-data-dir=./chromecache/pocket")
# you can specify an user folder so that you can need not to log in each time
driver = webdriver.Chrome(driver_path,chrome_options=options)
driver.get('https://app.getpocket.com')
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//article/a')))
# wait until all '//article/a' are loaded
articles=[href for a in driver.find_elements_by_xpath('//article/a')
if (href:=a.get_attribute('href')).startswith('https://getpocket.com/read/')]
for a in articles:
driver.get(a)
text=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//article/article'))).text
# wait until the article is loaded
print(text) # or save the artcles any how you like