Site icon JnPnote

Scrape practice with Beautiful Soup – Part 1

This post is for myself to remember how to Scrape/Crawl using Beautiful Soup.
The video SUB) Crawling text and images with Python from JoCoding (in Korean) and Beautiful Soup Documentation are the main reference.
I am doing this on Linux Ubuntu 20.04.4 LTS. Using Pycharm Community Edition.

1. Setup the environment

2. install BeautifulSoup and sample test

pip install beautifulsoup4
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))
response = urlopen('https://en.wikipedia.org/wiki/Main_Page')
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.find_all('a'):
    print(anchor.get('href', '/'))
python sample.py

3. try what you want to scrape.

<span class="long">
from bs4 import BeautifulSoup
from urllib.request import urlopen

response = urlopen('https://www.premierleague.com/tables')
soup = BeautifulSoup(response, 'html.parser')
i = 1
for anchor in soup.select('span.long'):
    print(str(i) + ":" + anchor.get_text())
    i = i + 1
python scrape_pl.py
response = urlopen('https://www.premierleague.com/tables')
soup = BeautifulSoup(response, 'html.parser')
i = 1
l = 21
for anchor in soup.select('span.long'):
    print(str(i) + ":" + anchor.get_text())
    i = i + 1
    if i == l:
        break

4. save the scraped data as txt file.

response = urlopen('https://www.premierleague.com/tables')
soup = BeautifulSoup(response, 'html.parser')
i = 1
l = 22
f = open("pl_standings.txt", 'w')
for anchor in soup.select('span.long'):
    data = str(i) + ":" + anchor.get_text() + "\n"
    i = i + 1
    if i == l:
        break
    f.write(data)
f.close()

done for this post

Exit mobile version