Python Start • June 28, 2020

Regular expressions in Python on Raspberry Pi

Regular expressions are a powerful tool to parse and validate text input

Python includes many built-in functions to work with strings, but as you start working with increasing volumes of text input, you need a more powerful and flexible way to validate and format text. As an example, below is a screenshot of a simple Python program running on a Raspberry Pi. This program opens an HTML file called Good morning.html, stores the contents of the file in memory as a string and then prints the string to the console.

Regular expressions quickly sift through text input

The file contents look disorganized and a bit overwhelming when viewed as a string, but if you download the HTML file and then open it in a web browser, you will discover that it is a very simple, interactive story.

HTML story file as it appears in a web browser

Inside the HTML file, there is a lot of text. If you look closely, you will notice the story text is prefaced with an HTML tag, <tw-storydata>. In this example, we are only interested in the story text, so we decide to extract certain portions of the file using regular expressions.

To do this, we start by importing the regular expression module with the import re statement. Next, we read in the file contents and store it in a string. Then we search the file contents for the <tw-storydata> tag using the re.search() method. Take a look at the code below, then let's dive into the details of the regular expression.

'''Regular expressions in Python'''

import re

# Code to open the local file if you have downloaded it

with open("Good morning.html") as f:

file_text = f.read()

story_data = re.search('\

<tw-storydata.*?\

name="(?P<story_name>.*?)"\

.*</tw-storydata>', file_text)

print("Story name: '" + story_data.group("story_name") + "'")

While the regular expression above may look complicated, the goal is simple: we want to find and store the story name. Let's look at it step by step to see how it works:

We start by searching for all of the story data. This is the text between the <tw-storydata> opening and closing HTML element
To do this, we pass the re.search() method a string parameter that contains the regular expression to describe what we are seeking
Start with a backslash character (\) to wrap to the next line to make the regular expression easier to read
Add the <tw-storydata opening element that we are looking for
The period character (.) indicates that we are looking for any character
The asterisk character (*) modifies the search to look for any number of characters
The question mark (?) modifies the search to be non-greedy, so the search will stop at the first instance of name=" encountered after the <tw-storydata opening element
Another period + asterisk + question mark + closing angled bracket (.*?>) indicates a non-greedy search for a closing bracket
The next step is to look for and capture the story name, so we search for name=" attribute then use the Python syntax for a named group syntax (?P<story_name>.*?)
We round out the search with the closing element to validate that the structure conforms to our expectations.

Once the regular expression has run, we print out the story name by looking up the named group, which we called "story_name".

for passage in re.findall('\

<tw-passagedata.*?\

name="(.*?)"\

.*?>\

(.*?)\

</tw-passagedata>', file_text):

print(passage[0] + ": '" + passage[1] + "'")

The snippet of code finds and captures each passage within the story. Here's how it works:

We loop through all the passages that are found match the pattern we define
The <tw-passagedata> elements have a variety of attributes, but we only capture text within name="" by using capturing groups with the(.*?)
We also capture the actual passage by doing a non-greedy search for the closing angle bracket .*?> then use another capturing group
We then print the passage name and the actual passage text.