By: Rick Dobson | Updated: 2023-07-14 | Comments | Related: > Python
Problem
BeautifulSoup can be used for screen scraping web page content and in this tutorial, we will take a look at different examples of scraping a web page using Python code and BeautifulSoup.
Solution
BeautifulSoup (sometimes referred to as Beautiful Soup library) is one of several widely used screen scraping packages for a web page. It is highly regarded for its ease of use and power. Other popular screen scraping packages include Selenium and Scrapy.
Screen scraping enables developers to create solutions that permit comparative analyses from all over the internet, such as web applications that look for the best price on an item from many different online stores.
Screen scraping with BeautifulSoup works for a moment in time. When a website revises the content in its web pages over time (as is often the case), this permits you to save and track selected content from a web page for a time series by comparing content for different time series values for successive scrapings of a web page.
This tutorial will demonstrate the basics of scraping a web page with BeautifulSoup. This tip starts by scraping a whole page and reviewing HTML tags from a common type of page at the MSSQLTips.com website. Excerpts from the sample page will be reviewed and described to highlight the use of selected HTML tags on a sample web page. Without understanding the basics of HTML programming, it will be exceedingly difficult to scrape meaningful content from a web page.
An internet search discovered four previous articles on MSSQLTips.com that presented coverage of how to use BeautifulSoup. Two articles were authored by Nai Biao Zhou ("Using Python to Download Data from an HTML Table to an SQL Server Database" and "Using Advanced Python Web Scraping to Conduct a Web Content Audit"). Two were authored by me ("Web Screen Scraping with Python to Populate SQL Server Tables" and "Scraping HTML Tables with Python to Populate SQL Server Tables"). The BeautifulSoup Documentation presents and demonstrates the complete array of features and capabilities of the BeautifulSoup web scraper.
Here is an image of the title, author name, and problem statement from a browser session for one of the five resources referenced above.
Loading the Software Used in this Tip
The source code implementing the Python programming language is regularly updated and available without charge from Python.org and other sources. For this tip, a stand-alone version of Python (version 3.10) was downloaded from Python.org onto a computer running Windows 10. You can also obtain Python as a component of Condor or SQL Server 2019 (and subsequent versions). This approach bundles Python with other non-core Python software.
Python is one of the most popular programming languages for its ease of use as well as its power, which is derived, in part, from the many modules that can be installed along with it. These packages are for web scraping and other functional areas, including web development, statistics, and data science.
When you download Python from Python.org, it comes with an integrated development environment called IDLE and another stand-alone program known as the Package Manager for Python, more commonly called pip. The IDLE application for Python is roughly analogous in function to SSMS for T-SQL in SQL Server.
Pip is particularly valuable because you can use it to selectively install packages, such as BeautifulSoup, to extend the core Python capabilities that you download from Python.org. After downloading and installing Python from Python.org, you can download and install BeautifulSoup by running the following line of code in a command window from the directory containing Python. This tip uses beautifulsoup4 (version 4.11.1).
pip install bs4
BeautifulSoup can run with a built-in Python parser known as html.parser. This parser is for parsing HTML documents. BeautifulSoup can also run with the lxml parser for HTML and XML documents. The pip to install lxml for use with Python appears below. The lxml parser is widely considered to be faster and more efficient than the Python built-in html parser. For the documents in this tip, the two parsers are interchangeable.
pip install lxml
Reading and Displaying Your First Page with Python and BeautifulSoup
The following Python script requests a web page from the MSSQLTips.com site.
- The script begins with references for the requests and bs4 libraries.
- The requests library allows Python to send HTTP requests, such as GET, for retrieving a web page from a website on a web server. The requests() function from the requests library takes a URL as an argument. The requests library is built into Python 3.7 and above. When working with earlier Python versions, you can add the requests library via pip.
- The from clause for importing the bs4 library assigns the alias name BeautifulSoup to the library.
- An assignment expression sets the myURL string variable to the URL for the web page to be read from the MSSQLTips.com site. The assignment references the sample web page for this tip. The sample page contains my author profile at MSSQLTips.com and other content. If you want to inspect the page from a browser, copy the myURL string value to the address box in your browser.
- The requests() function with the get HTTP command returns the web page content
from myURL to r, which contains the response to the HTTP request
- r.content is the content from the request in bytes
- html.parser is the name of the parser for formatting the bytes returned to Python
- soup is the name of the BeautfulSoup object with the formatted content from r.content
- You can display the contents of soup with a standard Python print() function
statement or by adding the BeautifulSoup prettify() function to the Python print
function.
- The statement for referencing the standard Python print() function is commented out with a preceding comment marker (#). If the statement were not commented out, the print function would display the soup object content as one long string.
- The last statement in the script shows the syntax for using the prettify() function. This function displays the soup content in a format that makes it easier to identify the tags on a web page. This can, in turn, make it easier to locate tags from which you want to extract content.
#Prepared by Rick Dobson for MSSQLTips.com #reference libraries import requests from bs4 import BeautifulSoup #assign a url to myURL myURL = "https://www.mssqltips.com/sqlserverauthor/57/rick-dobson/" #get url content r = requests.get(myURL) #parse HTML code soup = BeautifulSoup(r.content, 'html.parser') #print (soup) #pretty print the page print(soup.prettify())
Over 5,750 lines within the soup object are displayed by the final print statement in the preceding script. The following screenshot shows the top several lines of HTML extracted by the requests() function and formatted with the html.parser module from the URL at myURL. This excerpt appears in an IDLE Shell window; recall that IDLE is the Python-integrated development application that downloads from Python.org.
- The output commences with <!DOCTYPE html>, which is a declaration to inform browsers and other HTML readers, such as Python’s requests library, that the following text is in an HTML document. The declaration statement is not part of the HTML document.
- The html tag is the first statement in the document. The html tag
appears with class and lang attributes.
- The class="no-js" attribute assignment allows the use of CSS (cascading style sheets) for styling JavaScript free pages
- The lang="en" assignment indicates the language for the document is English
- The head tag trails the html tag to mark the start of the head section in an HTML page: the head section of an HTML document resides between the starting (<head) and ending head (</head>) section tags. The head section contains metadata about the HTML document. Head sections are for a document, but the head section content does not appear within a document when a browser displays the web page. BeautifulSoup can display the head section of a web page along with its body section.
- The first two tags after the head tag are meta tags nested within the head
section.
- The first meta tag defines the web page’s language as en-us for US-style English in contrast to the more generic language reference of en in the html tag
- The second meta tag specifies the character set for the document as utf-8.
- The link tag following the first two meta tags is for setting the small
identifier logo, a favicon, on a browser tab for a web page. When multiple
sources are referenced in a browser session, favicons can indicate the origin
of the source for a browser tab.
- Most browsers and languages allow multiple ways of specifying the same thing.
- There are several different ways to assign favicons:
- Favicon.ico is a common name for the image of a favicon.
- The file with the name favicon.ico is in the root directory of the website.
- This resource has additional detail on different ways to set favicons for web pages
- The screenshot at the end of the Solution section shows a browser session
with five tabs – one for each of the web pages referenced in the session.
- The first four tabs are for web pages from the MSSQLTips.com site. Each of these four tabs has the same image on its left edge. This image is the favicon for the site.
- The fifth tab is for the website with the BeautifulSoup documentation. Notice that its favicon looks like a web page with a navigation bar on the left side of the page and web content on the right side of the page.
The following screenshot shows the last tag in the head section, followed by the first tag in the body section of the web page. Web pages can have both sections.
- The ending tag for the head section is </head>
- The beginning tag for the body section starts with <body followed with a style attribute assignment. An angle bracket (>) pointing right terminates the starting body tag
- Subsequent tags in the body section specify the content and style or content in the remainder of the body section
The following screenshot illustrates the tags to end a web page at https://www.mssqltips.com/sqlserverauthor/57/rick-dobson/; this URL specifies the name and location of the sample page for this tip on the mssqltips.com website.
- The first tag (</body>) closes the body section of the HTML code for the page
- The second tag (</html>) closes the HTML code for the page. This tag matches the html tag (<html>) just before the <head> tag
Separating and Displaying the Head and Body Sections of a Web Page
The preceding section shows how to create the soup for a web page at a URL. It also reveals that the HTML document has outer tags of <html> and </html> for the beginning and end of the web page. Nested within the HTML tags are start and end tags for head and body sections within the sample web page for this tip. Each of the nested sections has its starting and ending tags of <head> and </head> for the head section and <body> and </body> for the body section. The preceding section shows and discusses some tags in the head section. However, it did not show any content for the body section. This section illustrates how to extract and print to the IDLE shell window the head section, the body section, or both sections together. In addition, this section shows and discusses a segment of HTML code from within the body section.
An HTML page can be thought of as a tree. For example, the html tag serves as a root node for the other tags on an HTML page. The html node in this tip has two child nodes called head and body. Within each of these nodes, there are many more tree nodes.
- The first several lines of code for this tip section are identical to the
preceding tip section
- The same two libraries (requests and bs4) are referenced at the beginning of the script
- The next line assigns a URL for the web page. The get function from the requests library extracts the web page code and assigns it to the r Response object; the page code, along with other values, are deposited in r by the get() function
- The BeautifulSoup function extracts the content for the page at myURL to create a BeautifulSoup soup object
- The second set of lines extracts the contents within the head and body sections
and displays the contents in each HTML document section
- The soup.head.contents syntax in the first line of the second section extracts the contents within the head section; the assignment statement on the line deposits its contents as a Python list object named head_contents
- The soup.body.contents syntax in the second line of the second section extracts the contents within the body section; the assignment statement on the line deposits its contents as a Python list object named body_contents
- The w3schools.com website defines a Python list object as one that stores multiple items, which are frequently starting and ending HTML tags for nodes on a web page
- The print() function displays the items in a list object on the standard
output device, such as the IDLE shell window
- The * in the print() function instructs Python to print all the list items in the object named after the asterisk
- The sep = "\n" parameter assignment within the print() function instructs Python to end each list item with a line feed
#Prepared by Rick Dobson for MSSQLTips.com #reference libraries import requests from bs4 import BeautifulSoup #assign a url to myURL myURL = "https://www.mssqltips.com/sqlserverauthor/57/rick-dobson/" #get url content r = requests.get(myURL) #parse HTML code soup = BeautifulSoup(r.content, 'html.parser') #create head and body HTML segments #get all the tags in the head section of myURL #contents returns a Python list object head_contents = soup.head.contents #get all the tags in the body section of myURL #contents returns a Python list object body_contents = soup.body.contents #display the list elements in head_contents print(*head_contents, sep = "\n") #display the list elements in body_contents print(*body_contents, sep = "\n")
Recall that the preceding section shows a screenshot with an excerpt from the soup object for the whole HTML document in the myURL string object. The first excerpt displayed in the prior section is for the top of the document. Recall also that the sample document for this tip has two sections named head and body.
The head_contents list object in the preceding script contains just the tag items within <head> and </head> tags as opposed to tags from the whole HTML document, namely <html> and </html>.
The following screenshot displays the last tag item in head_contents.
- Notice that there is no <\head> tag at the end of the screenshot. This is because the soup.head.contents syntax returns only the tags within a section – namely, a head section. As a result, both the head_contents and body_contents list objects omit the beginning and ending tags for their respective sections
- The following screenshot displays a screenshot that shows the last tag within
the head section, which is some HTML code in a script block from <script>
through </script>
- The first line in the script block is an if statement. This statement runs the code in the braces ({}) below it if the screen’s width is greater than 749 pixels
- The first and second lines in the braces below the if statement contains "window.googletag" and "googletag.cmd.push(function()". If you perform an internet search for these terms, you can discover that they pertain to the loading and refreshing of advertisements on a web page URL, such as the one at myURL
If you comment out with a hash sign (#) the print() function statement for the head_contents list object in the preceding Python script, and run the modified code, you can return to the IDLE shell window all the tags just for the body section of the web page at the myURL location.
Here is a screenshot with the first six tags in the body section from the body_contents Python list object.
- The first tag (<a name="top"></a>) is for an anchor that points at the top of the current page
- The sixth tag is for an HTML unordered list item (li) whose class attribute is "name") in an HTML list of items with a starting tag of (ul) for the overall list
Here is a screenshot with the last six tags in the body section.
- The first (<script>) and last (</script>) tags indicate that the code segment is for an HTML script
- The gtag() function and window.dataLayer are for the Google Tag Manager. Google Tag Manager is a popular package for tracking the usage of pages within a website
Coordinating Tag Content to a Segment of a Web Page
The previous section shows how to isolate the body section of an HTML document and how to display all the code within it. The contents of the body section are the list of tags from within the section. This section shows one approach for extracting a subset of the list of tags from the body section of a page that matches the content showing in a browser.
Recall that the list of tags for the body section can be displayed in an IDLE shell window. The IDLE application you can download from Python.org offers an Edit, Find command for searching text that displays in the IDLE shell window.
- As you have seen in earlier sections, you can read and display the HTML for a web page on a website with BeautifulSoup and Python
- You can also open a web page in your favorite browser, such as Microsoft Edge or Google Chrome, to discover a segment within a web page whose HTML you want to examine and/or save for downstream processing
The web page at the myURL location has a segment that provides background information for a contributing author. In fact, there is a separate page for each contributing author to MSSQLTips.com. Here is an excerpt from the page that illustrates the kinds of content found in the author background segment of the page. The excerpt has three key types of information: an image of the author, a short author summary for sharing with MSQLTips.com visitors, and an image with a hyperlink to the author’s page on the LinkedIn website.
Some readers may be interested in learning how to display just the content behind this page segment from the HTML for the whole author profile page. In any event, this kind of task is web scraping. The approach to scraping this author profile information is performed with the IDLE application for a session that contains just the body_contents list object developed with Python and BeautifulSoup in the previous section. As you will recall, Python can display the output from a print function statement in the IDLE shell window.
As you can see, the background information for the author starts with the name Rick Dobson. The Edit, Find command in the IDLE shell window lets you search the text in the body_contents list object for all the instances of Rick Dobson, and there are multiple instances for a variety of different reasons. The Edit command is a top-level menu item, and the Find menu item is a child-level menu item below Find.
By successively searching for the list object text for Rick Dobson, you can find the HTML that generates the preceding excerpt from the text in a browser window. A search dialog like the one below appears when you invoke the Edit, Find command from the IDLE shell window menu. The text Rick Dobson was typed into the Find: dialog box. For the web page in this tip, I had to press the Find Next button three times before I arrived at the code segment for the author’s background information.
Here is an HTML code segment for the content that appears in the preceding browser image. The following bullet points offer highlights about how the code works.
- The first line of code is for the opening of a div tag with a class attribute
- Opening and closing div tags serve as a container for a collection of other tags and/or text from the web page
- The div tag’s class attribute value references the HTML grid programming feature, which helps to control the position of tag elements on a web page
- The final line in the shell window is </div>; this line closes the opening div tag in the excerpt in the browser window
- The next line is for the author image in another nested <div> tag with a matching </div> tag at the end of the line. The code within this div container designates a .jpg file for the author’s image as well as a width and height for the displayed image
- The next line opens another nested div tag (<div> and </div>)
for a container with a class attribute value that references a different grid
programming specification than the preceding one. This div container includes
the following three items
- a text string containing an author summary
- a couple of <br/> tags; each of these tags is for a line break on the web page
- an anchor tag for jumping to the author’s LinkedIn page; the anchor
tag contains
- an href attribute value that points at the URL for the author’s Linkedin page
- and an image for the icon denoting the Linkedin website
Next Steps
The next steps suggested by this tip can take you in at least two different directions.
- First, you can search for different content on the sample page within this tip. For example, the sample page contains a list of articles by the author. You can return the full list or a subset from the full list of articles.
- Second, you can search other web pages besides the sample one for this tip. Pick a page you regularly visit with a browser, and then use the techniques covered in this tip to learn about the HTML tags controlling the page's appearance.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2023-07-14