HTML Parsing

Let’s talk about HTML Parsing because who doesn’t love reading through endless lines of code just for fun?

Okay, okay… let’s be real here. Sometimes you need to extract data from an HTML file or webpage and it can feel like pulling teeth with a rusty spoon. That’s where HTML parsing comes in!

HTML parsing is the process of converting raw HTML code into something more readable, such as a tree-like structure that represents the document’s content. This makes it easier to extract specific data or manipulate the HTML for various purposes.

Now, Time to get going with some examples!

To begin with you need an HTML parser library. For Python, we recommend using Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/) because it’s easy to use and has a lot of features. Here’s how you can install it:

# This script installs the Beautiful Soup library using the pip command.

# The $ symbol indicates that the command is being run in the terminal.

# The pip command is used to install Python packages.

# The install option is used to specify that we want to install a package.

# The beautifulsoup4 package is the name of the library we want to install.

# The # symbol is used to indicate a comment, which is used to provide additional information about the code.

# The comment above explains the purpose of the script and provides a link to the Beautiful Soup website for more information.

# The -m option is used to specify that we want to install the package globally, meaning it will be available for use in any Python project.

# The beautifulsoup4 package is the name of the library we want to install.

# The -m option is used to specify that we want to install the package globally, meaning it will be available for use in any Python project.

# The $ symbol indicates that the command is being run in the terminal.

# The pip command is used to install Python packages.

# The install option is used to specify that we want to install a package.

# The beautifulsoup4 package is the name of the library we want to install.

Once that’s done, let’s say we have an HTML file called “example.html” with the following content:

<!DOCTYPE html> <!-- This is the document type declaration, indicating that this is an HTML document -->
<html lang="en"> <!-- This is the opening tag for the root element, with the "lang" attribute specifying the language as English -->
  <head> <!-- This is the opening tag for the head element, which contains metadata for the document -->
    <meta charset="UTF-8"> <!-- This is a meta tag specifying the character encoding for the document as UTF-8 -->
    <title>Example Document</title> <!-- This is the title of the document, which will be displayed in the browser tab -->
  </head>
  <body> <!-- This is the opening tag for the body element, which contains the visible content of the document -->
    <h1>This is a heading</h1> <!-- This is a heading element, used to display a heading on the page -->
    <p>This is some text.</p> <!-- This is a paragraph element, used to display a block of text on the page -->
    <ul> <!-- This is the opening tag for the unordered list element, which contains a list of items -->
      <li>Item 1</li> <!-- This is a list item element, used to display an item in a list -->
      <li>Item 2</li> <!-- This is another list item element, used to display another item in the list -->
    </ul> <!-- This is the closing tag for the unordered list element -->
  </body> <!-- This is the closing tag for the body element -->
</html> <!-- This is the closing tag for the root element -->

To parse this HTML file using Beautiful Soup, you can do the following:

# Import the BeautifulSoup library
from bs4 import BeautifulSoup

# Import the os library
import os

# Define the file path
file_path = "example.html"

# Open the file and assign it to the variable 'f'
with open(os.path.abspath(file_path), 'r') as f:
    # Create a BeautifulSoup object and assign it to the variable 'soup'
    soup = BeautifulSoup(f, 'html.parser')

# Extract the title of the document and assign it to the variable 'title'
title = soup.title.string

# Print the title
print(title)

# Find all <li> elements and print their text content
for li in soup.find_all('li'):
    # Strip any extra whitespace from the text and print it
    print(li.text.strip())

# The script uses the BeautifulSoup library to parse an HTML file and extract specific elements from it. 
# The os library is also imported to access the file path. 
# The file is opened and assigned to the variable 'f'. 
# A BeautifulSoup object is created using the file and assigned to the variable 'soup'. 
# The title of the document is extracted and assigned to the variable 'title'. 
# The title is then printed. 
# The script then finds all <li> elements and prints their text content after stripping any extra whitespace.

This code opens the HTML file, creates a Beautiful Soup object with it, extracts the document’s title using `title = soup.title.string`, and then finds all

  • elements using `for li in soup.find_all(‘li’)`. The text content of each
  • element is printed by calling `print(li.text.strip())`.

    The output will be:

    # Import the necessary libraries
    from bs4 import BeautifulSoup
    
    # Open the HTML file and create a Beautiful Soup object
    with open('example.html') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
    
    # Extract the document's title using the title tag
    title = soup.title.string
    
    # Print the document's title
    print(title)
    
    # Find all <li> elements and iterate through them
    for li in soup.find_all('li'):
        # Print the text content of each <li> element
        print(li.text.strip())
    

    That’s it! You now have a basic understanding of HTML parsing using Beautiful Soup in Python.

  • SICORPS