Regular expressions are a powerful tool to parse and validate text input
Python includes many built-in functions to work with strings, but as you start working with increasing volumes of text input, you need a more powerful and flexible way to validate and format text. As an example, below is a screenshot of a simple Python program running on a Raspberry Pi. This program opens an HTML file called
Good morning.html, stores the contents of the file in memory as a string and then prints the string to the console.
The file contents look disorganized and a bit overwhelming when viewed as a string, but if you
download the HTML file and then open it in a web browser, you will discover that it is a very simple, interactive story.
Inside the HTML file, there is a lot of text. If you look closely, you will notice the story text is prefaced with an HTML tag,
<tw-storydata>. In this example, we are only interested in the story text, so we decide to extract certain portions of the file using regular expressions.
To do this, we start by importing the regular expression module with the import re statement. Next, we read in the file contents and store it in a string. Then we search the file contents for the <tw-storydata>
tag using the re.search() method. Take a look at the code below, then let's dive into the details of the regular expression.
'''Regular expressions in Python'''
import
re
# Code to open the local file if you have downloaded it
with
open("Good morning.html")
as
f:
file_text = f.read()
story_data = re.search('\
<tw-storydata.*?\
name="(?P<story_name>.*?)"\
.*</tw-storydata>', file_text)
print("Story name: '"
+ story_data.group("story_name") +
"'")
While the regular expression above may look complicated, the goal is simple: we want to find and store the story name. Let's look at it step by step to see how it works:
- We start by searching for all of the story data. This is the text between the
<tw-storydata> opening and closing HTML element
- To do this, we pass the re.search() method a string parameter that contains the regular expression to describe what we are seeking
- Start with a backslash character (\) to wrap to the next line to make the regular expression easier to read
- Add the <tw-storydata opening element that we are looking for
- The period character (.) indicates that we are looking for any character
- The asterisk character (*) modifies the search to look for any number of characters
- The question mark (?) modifies the search to be non-greedy, so the search will stop at the first instance of
name=" encountered after the
<tw-storydata opening element
- Another period + asterisk + question mark + closing angled bracket (.*?>) indicates a non-greedy search for a closing bracket
- The next step is to look for and capture the story name, so we search for name=" attribute then use the Python syntax for a
named group syntax
(?P<story_name>.*?)
- We round out the search with the closing element to validate that the structure conforms to our expectations.
Once the regular expression has run, we print out the story name by looking up the named group, which we called "story_name".
for passage
in re.findall('\
<tw-passagedata.*?\
name="(.*?)"\
.*?>\
(.*?)\
</tw-passagedata>', file_text):
print(passage[0] +
": '" + passage[1] +
"'")
The snippet of code finds and captures each passage within the story. Here's how it works:
- We loop through all the passages that are found match the pattern we define
- The
<tw-passagedata> elements have a variety of attributes, but we only capture text within
name="" by using capturing groups with the(.*?)
- We also capture the actual passage by doing a non-greedy search for the closing angle bracket
.*?>
then use another capturing group
- We then print the passage name and the actual passage text.
Ready to give it a try? Jump in and play with
this example, complete with the code needed to import the HTML file from Google Drive.
Follow along with this video to help you see how to run this on a Raspberry Pi. Let us know if you have any questions by leaving a comment below.