Processing Text Data

Overview, Objectives, and Key Terms

Lectures 5 through 9 covered the basic logical structures used in programming and their implementation in Python. Lecture 10 presented Python’s built-in container types. In this lecture, we turn to the practical problem of processing text data. Often, such data starts life in files on our machines. Ultimately, that data is represented as one (or more) strings that can be processed using a combination of the structures already covered (particularly, loops) and more specialized string functions. We’ll wrap up with ways in which we can output existing data into useful text-based formats.

Objectives

By the end of this lesson, you should be able to

  • Open and process text files.
  • Use string functions to parse data into desired formats.
  • Convert data into desired string formats.
  • Write strings to text files

Key Terms

  • open
  • close
  • read
  • str
  • str.split
  • str.count
  • str.find
  • str.isnumeric
  • str.replace
  • in operator
  • str.format
  • {} for replacement
  • write

Reading Text Files

We’ll start with a case we’ve seen before: the text file from Lecture 4, the contents of which are

time (s)   vel (m/s)  acc (m/s**2)
0.00000000 1.00000000 0.00000000
0.22222222 1.24884887 0.01097394
0.44444444 1.55962350 0.08779150
0.66666667 1.94773404 0.29629630
0.88888889 2.43242545 0.70233196
1.11111111 3.03773178 1.37174211
1.33333333 3.79366789 2.37037037
1.55555556 4.73771786 3.76406036
1.77777778 5.91669359 5.61865569
2.00000000 7.38905610 8.00000000

It was suggested then that any data with such a simple format is best read by using np.loadtxt. However, the simplicity of this data makes it useful for studying more traditional file processing.

In Python, files are first opened to produce a file handle, i.e., a variable that provides the user access to the file and various functions with which to inspect the file. To load a file and produce such a file handle requires the built-in open function. This function is typically called with two arguments: a filename, always a str value, and the access mode, which can be r for read, w for write, and a for append. For now, we’ll focus on reading files, and we can open data.txt for reading by executing

In [1]:
f = open('data.txt', 'r')
f

The particular value of f is not important, but it does suggest it is associated with our text file and that it is set for reading. At this point, though, the file is only ready to be processed; we still need to do something with it. Let’s see what we can do with the file handle f:

In [2]:
for item in dir(f):
    if not item[0:1] == '_':
        print(item)

In fact, file handles have quite a few functions. The important ones for getting the content of the file into our program are read and readlines. We’ll explore write later on. Let’s start with f.read():

In [3]:
help(f.read)

By default, read will read all the characters. Although not stated, the function produces a str with all of the characters read. Hence, we can turn the contents of our text file (connected now to f) into a single string via

In [4]:
s = f.read()
s

Well, the contents sure are there, but the formatting is a bit weird. If we print the string, we see what we might expect:

In [5]:
print(s)

By default, displaying a str variable (and not printing it) will show all of the characters, including the special \n character. Here, \n is a newline character and represents the break between lines in text files. When a string is printed, the effects of special characters (like \n) are shown.

Exercise: Define a string that, when printed, shows “ABC” on the first line and “XYZ” on the second line.

Now that we’ve read the file into a single string, we need to close it:

In [6]:
f.close()

It is important to close files once read. Doing so helps to prevent multiple programs from accessing the same file.

Note: Always close a file once it has been read.

Exercise: The open function is not limited to files ending in .txt. Any text file works, including .py files. Go and open a previous homework or laboratory file (making sure to set the mode to r and not w!). Read the contents into a string, and use a loop to count the number of # characters in your file.


In addition to f.read(), we can use f.readlines(). Now, we’ve closed the file, so we need to open it again before we read it. Then, following the advice above, the file is closed:

In [7]:
f = open('data.txt', 'r')
lines = f.readlines()
f.close()

The readlines function reads the same contents as read but the result is a list of each line of the text file:

In [8]:
lines
Out[8]:
['time (s)   vel (m/s)  acc (m/s**2)\n',
 '0.00000000 1.00000000 0.00000000\n',
 '0.22222222 1.24884887 0.01097394\n',
 '0.44444444 1.55962350 0.08779150\n',
 '0.66666667 1.94773404 0.29629630\n',
 '0.88888889 2.43242545 0.70233196\n',
 '1.11111111 3.03773178 1.37174211\n',
 '1.33333333 3.79366789 2.37037037\n',
 '1.55555556 4.73771786 3.76406036\n',
 '1.77777778 5.91669359 5.61865569\n',
 '2.00000000 7.38905610 8.00000000\n']

Often, data is easier to process by first separating a text file into individual lines.


Exercise: Go find again an old homework file ending in .py. Open it and read its contents using readlines. How many lines does the file contain?


Parsing Strings

With read and readlines we get a single string for the whole file or a list of strings corresponding to each line, respectively. This is progress, but we need to do more in order to extract values (whatever they may be) from the string.

For the example of data.txt, the values of interest are the time, velocity, and acceleration. Hence, a representative string is the single line

'0.00000000 1.00000000 0.00000000\n'

which is the value of lines[1]. To extract the three, individual values, the split function can be used:

In [9]:
t, v, a = lines[1].split()
print(t)
print(v)
print(a)
0.00000000
1.00000000
0.00000000

In fact, str values can be split based on different criteria. By default, the str value is divided into sequences of characters separated by one or more white spaces (i.e., one or more ' ' characters). For example:

In [10]:
'The quick brown fox jumps over the lazy dog'.split()
Out[10]:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Here, we’ve split a (potentially familiar) sentence into its individual words. This sentence is a pangram, i.e., a sentence with all the letters of the alphabet. If, however, we had a string with comma-separated values, we could get those values via

In [11]:
'1,1,2,3,5,8,13'.split(',')
Out[11]:
['1', '1', '2', '3', '5', '8', '13']

Hence, split can be used to parse data that has a reasonably simple structure.


Exercise: Write a short program to turn the string '1,1,2,3,5,8,13' into an ndarray.

Solution. The string must be split into individual strings, those individual strings must be turned into numbers, and put those numbers must be put into an array. To size the array, we use the number of items resulting from the split. Here is a complete approach:

In [12]:
import numpy as np
s = '1,1,2,3,5,8,13' # start with the string
v = s.split(',')     # split it into a list of str values
a = np.zeros(len(v)) # intialize an array of proper length
for i in range(len(v)):
    a[i] = int(v[i]) # convert the ith value into an int

The str type provides several other functions to help one parse contents of a str value. For instance, how many times does the letter 'o' appear in 'hello world'?

In [13]:
'hello world'.count('o')
Out[13]:
2

Where does the first 'o' occur?

In [14]:
'hello world'.find('o')
Out[14]:
4

Is there an x in it?

In [15]:
'hello world'.find('x') # -1 if not found
Out[15]:
-1
In [16]:
'x' in 'hello world' # false if not one of the elements (here, characters)
Out[16]:
False

Does '1984' contain only numerical values? How about '3.14159'?

In [17]:
'1984'.isnumeric() # yes, just digits
Out[17]:
True
In [18]:
'3.14159'.isnumeric() # no, the . is not numerical
Out[18]:
False

If the data to be parsed has non-numerical values, we sometimes need to wrangle the data into a more usable form. Suppose our data for time, velocity, and acceleration suffered from a measurement error at 100 s, leading to this line

0.00000000 1.00000000 N/A

Obviously, N/A is not a number (and stands for “not available”), but it might be the value substituted by the measurement software if that particular value were missed. In order to process the data, it might be reasonable to treat all the N/A values as a zero, which could be accomplished by replacing N/A with 0.0:

In [19]:
s = '0.00000000 1.00000000 N/A'
s = s.replace('N/A', '0.0')
s
Out[19]:
'0.00000000 1.00000000 0.0'

Putting It All Together: Processing data.txt

Files can be read, and their contents can be put into a single string or a list of strings for each line of text. Using loops, these lines can be processed one by one, perhaps using split, find, count, or replace, depending on the application.

For data.txt, let us produce the three arrays t, v, and a. Take as known that the first line provides text information about the columns of data. Then, the entire processing can be done via

In [20]:
import numpy as np

# open file, read lines, and close file
f = open('data.txt', 'r')
lines = f.readlines()
f.close()

# initialize empty
t, v, a = [], [], []
for line in lines[1:]:
    vals = line.split()
    t.append(float(vals[0]))
    v.append(float(vals[1]))
    a.append(float(vals[2]))
t = np.array(t)
v = np.array(v)
a = np.array(a)
t
Out[20]:
array([0.        , 0.22222222, 0.44444444, 0.66666667, 0.88888889,
       1.11111111, 1.33333333, 1.55555556, 1.77777778, 2.        ])

Writing to File

Having read data from a file, analysis can be performed. The results of such analysis are often displayed visually (think plt.plot). Other applications may require that data be saved to file. If the data is solely array-based, then np.savetxt is a good option. More complicated (or less structured) information requires a different approach.

Just as file handles can be created to read from a text file, they can be created to save to a text file. The only difference in syntax is the use of w instead of r:

In [21]:
f = open('new_data.txt', 'w')

Any string can be written to file using the write function:

In [22]:
f.write('Here is some sample text!')
f.close() # Always close a file when done.

The output is the number of characters written. Here, f is closed after writing just one string. However, several lines can be written in sequence, with the contents of any additional string placed after the contents already written.

Often, numerical data should be written in a format that is easily read. The right way to produce formatted text is to use the str.format method. For example, consider the repeating decimal corresponding to 1/3. To print that number with just four decimal places, one could use

In [23]:
"{:.4f}".format(1/3)
Out[23]:
'0.3333'

Several strings, integers, and floating-point values can be formatted all at once:

In [24]:
"{} divided by {} is approximately {:.4f}".format(1, 3, 1/3)
Out[24]:
'1 divided by 3 is approximately 0.3333'

To use format requires a string with one or more sets of {}. The arguments to format are the values to be formatted. If the {} appears without any contents, the corresponding value is formatted into its default str representation. For example, compare the following:

In [25]:
str(1/3)
"{}".format(1/3)
Out[25]:
'0.3333333333333333'

That sure includes a lot of decimal places!

More specific formatting instructions can be passed. For float values, the syntax {:.4f} yields a formatted value with any number of digits to the left of the decimal point and exactly four digits to the right of the decimal point. The colon : is always included when a format specification is given (such as .4f). If scientific notation is preferred, one can use a format like {:.4e}. Compare the following:

In [26]:
'{:.4f}'.format(1/3)
'{:.4e}'.format(1/3)
Out[26]:
'3.3333e-01'

Python provides several other formats for handling special int, float, and other needs, but {}, {:.4f}, and {:.4e} (with values other than 4) should be sufficient for most simple tasks.

With formatting, the time, velocity, and acceleration data read above can be written back to file :

In [27]:
f = open('new_data.txt', 'w')
# write the header information (remember the newline \n!)
f.write('time (s)   vel (m/s)  acc (m/s**2)\n')
for i in range(len(t)):
    # produce each line of text (again, rememeber \n!)
    s = "{:.8f} {:.8f} {:.8f}\n".format(t[i], v[i], a[i])
    f.write(s)
f.close()

Now, new_data.txt should have exactly the same contents.


Exercise: Go and verify that the previous code produces the expected file new_data.txt. This will require that you download and run this notebook or copy the code and paste it into Spyder.


Exercise: Given a = 1, b = 1/9, and c = 'Python', produce the string '1   0.111   Python'.


Exercise: Given A = np.random.rand(5, 5), write its values to file element by element. Each line of the text file should look similar to 1 5 0.569113, where the values 1 and 5 correspond to the row and column the element.

Solution. First, recognize the desired format: an integer, a space, an integer, a space, and a float with 6 decimal places. Using the format function, such a string is produce via "{} {} {:.6f}".format(i1, i2, f1)) where i1 and i2 have int values and f1 has a float value. Second, that format can be used inside a nested loop structure that iterates through all rows and columns and writes i, j, and A[i, j] to file.

In [28]:
A = np.random.rand(5, 5)
f = open('data_from_A.txt', 'w')
for i in range(5):
    for j in range(5):
        f.write("{} {} {:.6f}\n".format(i, j, A[i, j]))
f.close()

Remember that the \n character is needed to end a line.


Further Reading

The student interested in more complex formatting should read the documentation on Python’s formatting syntax.