Blog Layout

Python Start • Jun 28, 2020

Regular expressions in Python on Raspberry Pi

Regular expressions are a powerful tool to parse and validate text input


Python includes many built-in functions to work with strings, but as you start working with increasing volumes of text input, you need a more powerful and flexible way to validate and format text. As an example, below is a screenshot of a simple Python program running on a Raspberry Pi. This program opens an HTML file called Good morning.html, stores the contents of the file in memory as a string and then prints the string to the console.

Regular expressions quickly sift through text input


The file contents look disorganized and a bit overwhelming when viewed as a string, but if you download the HTML file and then open it in a web browser, you will discover that it is a very simple, interactive story.



HTML story file as it appears in a web browser

Inside the HTML file, there is a lot of text. If you look closely, you will notice the story text is prefaced with an HTML tag, <tw-storydata>. In this example, we are only interested in the story text, so we decide to extract certain portions of the file using regular expressions.

To do this, we start by importing the regular expression module with the import re statement. Next, we read in the file contents and store it in a string. Then we search the file contents for the <tw-storydata> tag using the re.search() method. Take a look at the code below, then let's dive into the details of the regular expression.

'''Regular expressions in Python'''

import re


# Code to open the local file if you have downloaded it

with open("Good morning.html") as f:

  file_text = f.read()


story_data = re.search('\

<tw-storydata.*?\

name="(?P<story_name>.*?)"\

.*</tw-storydata>', file_text)


print("Story name: '" + story_data.group("story_name") + "'")



While the regular expression above may look complicated, the goal is simple: we want to find and store the story name. Let's look at it step by step to see how it works:

  1. We start by searching for all of the story data. This is the text between the <tw-storydata> opening and closing HTML element
  2. To do this, we pass the re.search() method a string parameter that contains the regular expression to describe what we are seeking
  3. Start with a backslash character (\) to wrap to the next line to make the regular expression easier to read
  4. Add the <tw-storydata opening element that we are looking for
  5. The period character (.) indicates that we are looking for any character
  6. The asterisk character (*) modifies the search to look for any number of characters
  7. The question mark (?) modifies the search to be non-greedy, so the search will stop at the first instance of name=" encountered after the <tw-storydata opening element
  8. Another period + asterisk + question mark + closing angled bracket (.*?>) indicates a non-greedy search for a closing bracket
  9. The next step is to look for and capture the story name, so we search for name=" attribute then use the Python syntax for a named group syntax (?P<story_name>.*?)
  10. We round out the search with the closing element to validate that the structure conforms to our expectations.


Once the regular expression has run, we print out the story name by looking up the named group, which we called "story_name".

for passage in re.findall('\

<tw-passagedata.*?\

name="(.*?)"\

.*?>\

(.*?)\

</tw-passagedata>', file_text):

  print(passage[0] + ": '" + passage[1] + "'")


The snippet of code finds and captures each passage within the story. Here's how it works:

  1. We loop through all the passages that are found match the pattern we define
  2. The <tw-passagedata> elements have a variety of attributes, but we only capture text within name="" by using capturing groups with the(.*?)
  3. We also capture the actual passage by doing a non-greedy search for the closing angle bracket .*?> then use another capturing group
  4. We then print the passage name and the actual passage text.


Ready to give it a try? Jump in and play with this example, complete with the code needed to import the HTML file from Google Drive.


Follow along with this video to help you see how to run this on a Raspberry Pi. Let us know if you have any questions by leaving a comment below.

Linear programming has been effectively used in hospitals to solve nursing staff scheduling problems
By Python Start 22 Jun, 2020
See how to solve a staffing problem with PuLP, a linear programming toolkit for Python. In this post, we consider a hospital nursing staff scheduling problem.
Python Packages
By Python Start 04 Mar, 2017
One of the key reasons for Python's success is its tremendous ecosystem. Learn more in this blog post.
Python
By Python Start 25 Feb, 2017
Python is becoming a popular programming language, and here are 7 examples to prove it
Share by: