Scrape practice with Beautiful Soup – Part 1

This post is for myself to remember how to Scrape/Crawl using Beautiful Soup.
The video SUB) Crawling text and images with Python from JoCoding (in Korean) and Beautiful Soup Documentation are the main reference.
I am doing this on Linux Ubuntu 20.04.4 LTS. Using Pycharm Community Edition.

1. Setup the environment

  • open Pycharm Community Edition
  • Create New Project, using Pycharm is good because you don’t have to do install/activate virtual environment every time you run the project.
  • If you want to create Git repository to track changes, there is version control on your bottom left corner(probably).
  • if you want to share/push your project to your github, Git -> Github -> Share Project on Github
  • you can uncheck whatever you don’t want to push. Usually, people do not push .idea folder, I guess….
  • click add button to commit.
  • create .gitignore file under the root project folder. (there is one in venv/.idea folder, but better(or required?) to have one in root project folder.
  • https://www.toptal.com/developers/gitignore go to this site and create one for your gitignore
  • copy the result and paste in .gitignore file just created.

2. install BeautifulSoup and sample test

  • install BeautifulSoup4
pip install beautifulsoup4
  • create the file name sample.py
  • copy the sample code from the wikipedia
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en.wikipedia.org/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))
  • above code is same as the below one.
response = urlopen('https://en.wikipedia.org/wiki/Main_Page')
soup = BeautifulSoup(response, 'html.parser')
for anchor in soup.find_all('a'):
    print(anchor.get('href', '/'))
  • on your console, try
python sample.py
  • you will see the result like screenshot below
  • now you are ready to do some scrape with this basic code.

3. try what you want to scrape.

  • if you see the html of the club names, there are the same format.
<span class="long">
  • create new file. in my case, I named scrape_pl.py
  • you can copy the content from the sample and replace necessary things like below code.
  • or you can copy the code below
from bs4 import BeautifulSoup
from urllib.request import urlopen

response = urlopen('https://www.premierleague.com/tables')
soup = BeautifulSoup(response, 'html.parser')
i = 1
for anchor in soup.select('span.long'):
    print(str(i) + ":" + anchor.get_text())
    i = i + 1
  • run the file by typing below on console
python scrape_pl.py
  • you will see the result like below screenshot
  • you might found out there are more than 20 that we didn’t expected.
  • that is because there are more on PL2 and U18.
  • So, what should we do?
  • we need to find the tag or other ways to specify as possible as we can.
  • but in this case, they have almost same format.
  • so, I will limit the for loop only up to 20 for this.
response = urlopen('https://www.premierleague.com/tables')
soup = BeautifulSoup(response, 'html.parser')
i = 1
l = 21
for anchor in soup.select('span.long'):
    print(str(i) + ":" + anchor.get_text())
    i = i + 1
    if i == l:
        break

4. save the scraped data as txt file.

  • add and edit some code from the above like below.
response = urlopen('https://www.premierleague.com/tables')
soup = BeautifulSoup(response, 'html.parser')
i = 1
l = 22
f = open("pl_standings.txt", 'w')
for anchor in soup.select('span.long'):
    data = str(i) + ":" + anchor.get_text() + "\n"
    i = i + 1
    if i == l:
        break
    f.write(data)
f.close()
  • I didn’t add the exact path to test, and I guess it creates the file on the same root.
  • if you run the file again, you will see the created txt file.
  • some reason, if l = 21, it gives only up to 19th place. so I changed to 22.
  • not sure what’s the issue.

done for this post