Processing Text Data¶
Overview, Objectives, and Key Terms¶
Lectures 5 through 9 covered the basic logical structures used in programming and their implementation in Python. Lecture 10 presented Python’s built-in container types. In this lecture, we turn to the practical problem of processing text data. Often, such data starts life in files on our machines. Ultimately, that data is represented as one (or more) strings that can be processed using a combination of the structures already covered (particularly, loops) and more specialized string functions. We’ll wrap up with ways in which we can output existing data into useful text-based formats.
Objectives¶
By the end of this lesson, you should be able to
- Open and process text files.
- Use string functions to parse data into desired formats.
- Convert data into desired string formats.
- Write strings to text files
Key Terms¶
open
close
read
str
str.split
str.count
str.find
str.isnumeric
str.replace
in
operatorstr.format
{}
for replacementwrite
Reading Text Files¶
We’ll start with a case we’ve seen before: the text file from Lecture 4, the contents of which are
time (s) vel (m/s) acc (m/s**2)
0.00000000 1.00000000 0.00000000
0.22222222 1.24884887 0.01097394
0.44444444 1.55962350 0.08779150
0.66666667 1.94773404 0.29629630
0.88888889 2.43242545 0.70233196
1.11111111 3.03773178 1.37174211
1.33333333 3.79366789 2.37037037
1.55555556 4.73771786 3.76406036
1.77777778 5.91669359 5.61865569
2.00000000 7.38905610 8.00000000
It was suggested then that any data with such a simple format is best
read by using np.loadtxt
. However, the simplicity of this data makes
it useful for studying more traditional file processing.
In Python, files are first opened to produce a file handle, i.e., a
variable that provides the user access to the file and various functions
with which to inspect the file. To load a file and produce such a file
handle requires the built-in open
function. This function is
typically called with two arguments: a filename, always a str
value,
and the access mode, which can be r
for read, w
for write,
and a
for append. For now, we’ll focus on reading files, and we
can open data.txt
for reading by executing
In [1]:
f = open('data.txt', 'r')
f
The particular value of f
is not important, but it does suggest it
is associated with our text file and that it is set for reading. At this
point, though, the file is only ready to be processed; we still need
to do something with it. Let’s see what we can do with the file handle
f
:
In [2]:
for item in dir(f):
if not item[0:1] == '_':
print(item)
In fact, file handles have quite a few functions. The important ones for
getting the content of the file into our program are read
and
readlines
. We’ll explore write
later on. Let’s start with
f.read()
:
In [3]:
help(f.read)
By default, read
will read all the characters. Although not
stated, the function produces a str
with all of the characters read.
Hence, we can turn the contents of our text file (connected now to
f
) into a single string via
In [4]:
s = f.read()
s
Well, the contents sure are there, but the formatting is a bit weird. If we print the string, we see what we might expect:
In [5]:
print(s)
By default, displaying a str
variable (and not printing it) will
show all of the characters, including the special \n
character.
Here, \n
is a newline character and represents the break between
lines in text files. When a string is printed, the effects of special
characters (like \n
) are shown.
Exercise: Define a string that, when printed, shows “ABC” on the first line and “XYZ” on the second line.
Now that we’ve read the file into a single string, we need to close it:
In [6]:
f.close()
It is important to close files once read. Doing so helps to prevent multiple programs from accessing the same file.
Note: Always close a file once it has been read.
Exercise: The open
function is not limited to files ending in
.txt
. Any text file works, including .py
files. Go and open a
previous homework or laboratory file (making sure to set the mode to
r
and not w
!). Read the contents into a string, and use a loop
to count the number of #
characters in your file.
In addition to f.read()
, we can use f.readlines()
. Now, we’ve
closed the file, so we need to open it again before we read it. Then,
following the advice above, the file is closed:
In [7]:
f = open('data.txt', 'r')
lines = f.readlines()
f.close()
The readlines
function reads the same contents as read
but the
result is a list
of each line of the text file:
In [8]:
lines
Out[8]:
['time (s) vel (m/s) acc (m/s**2)\n',
'0.00000000 1.00000000 0.00000000\n',
'0.22222222 1.24884887 0.01097394\n',
'0.44444444 1.55962350 0.08779150\n',
'0.66666667 1.94773404 0.29629630\n',
'0.88888889 2.43242545 0.70233196\n',
'1.11111111 3.03773178 1.37174211\n',
'1.33333333 3.79366789 2.37037037\n',
'1.55555556 4.73771786 3.76406036\n',
'1.77777778 5.91669359 5.61865569\n',
'2.00000000 7.38905610 8.00000000\n']
Often, data is easier to process by first separating a text file into individual lines.
Exercise: Go find again an old homework file ending in .py
. Open
it and read its contents using readlines
. How many lines does the
file contain?
Parsing Strings¶
With read
and readlines
we get a single string for the whole
file or a list
of strings corresponding to each line, respectively.
This is progress, but we need to do more in order to extract values
(whatever they may be) from the string.
For the example of data.txt
, the values of interest are the time,
velocity, and acceleration. Hence, a representative string is the single
line
'0.00000000 1.00000000 0.00000000\n'
which is the value of lines[1]
. To extract the three, individual
values, the split
function can be used:
In [9]:
t, v, a = lines[1].split()
print(t)
print(v)
print(a)
0.00000000
1.00000000
0.00000000
In fact, str
values can be split based on different criteria. By
default, the str
value is divided into sequences of characters
separated by one or more white spaces (i.e., one or more ' '
characters). For example:
In [10]:
'The quick brown fox jumps over the lazy dog'.split()
Out[10]:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Here, we’ve split a (potentially familiar) sentence into its individual words. This sentence is a pangram, i.e., a sentence with all the letters of the alphabet. If, however, we had a string with comma-separated values, we could get those values via
In [11]:
'1,1,2,3,5,8,13'.split(',')
Out[11]:
['1', '1', '2', '3', '5', '8', '13']
Hence, split
can be used to parse data that has a reasonably simple
structure.
Exercise: Write a short program to turn the string
'1,1,2,3,5,8,13'
into an ndarray
.
Solution. The string must be split into individual strings, those individual strings must be turned into numbers, and put those numbers must be put into an array. To size the array, we use the number of items resulting from the split. Here is a complete approach:
In [12]:
import numpy as np
s = '1,1,2,3,5,8,13' # start with the string
v = s.split(',') # split it into a list of str values
a = np.zeros(len(v)) # intialize an array of proper length
for i in range(len(v)):
a[i] = int(v[i]) # convert the ith value into an int
The str
type provides several other functions to help one parse
contents of a str
value. For instance, how many times does the
letter 'o'
appear in 'hello world'
?
In [13]:
'hello world'.count('o')
Out[13]:
2
Where does the first 'o'
occur?
In [14]:
'hello world'.find('o')
Out[14]:
4
Is there an x
in it?
In [15]:
'hello world'.find('x') # -1 if not found
Out[15]:
-1
In [16]:
'x' in 'hello world' # false if not one of the elements (here, characters)
Out[16]:
False
Does '1984'
contain only numerical values? How about '3.14159'
?
In [17]:
'1984'.isnumeric() # yes, just digits
Out[17]:
True
In [18]:
'3.14159'.isnumeric() # no, the . is not numerical
Out[18]:
False
If the data to be parsed has non-numerical values, we sometimes need to wrangle the data into a more usable form. Suppose our data for time, velocity, and acceleration suffered from a measurement error at 100 s, leading to this line
0.00000000 1.00000000 N/A
Obviously, N/A
is not a number (and stands for “not available”), but
it might be the value substituted by the measurement software if that
particular value were missed. In order to process the data, it might be
reasonable to treat all the N/A
values as a zero, which could be
accomplished by replacing N/A
with 0.0
:
In [19]:
s = '0.00000000 1.00000000 N/A'
s = s.replace('N/A', '0.0')
s
Out[19]:
'0.00000000 1.00000000 0.0'
Putting It All Together: Processing data.txt
¶
Files can be read, and their contents can be put into a single string or
a list of strings for each line of text. Using loops, these lines can be
processed one by one, perhaps using split
, find
, count
, or
replace
, depending on the application.
For data.txt
, let us produce the three arrays t
, v
, and
a
. Take as known that the first line provides text information about
the columns of data. Then, the entire processing can be done via
In [20]:
import numpy as np
# open file, read lines, and close file
f = open('data.txt', 'r')
lines = f.readlines()
f.close()
# initialize empty
t, v, a = [], [], []
for line in lines[1:]:
vals = line.split()
t.append(float(vals[0]))
v.append(float(vals[1]))
a.append(float(vals[2]))
t = np.array(t)
v = np.array(v)
a = np.array(a)
t
Out[20]:
array([0. , 0.22222222, 0.44444444, 0.66666667, 0.88888889,
1.11111111, 1.33333333, 1.55555556, 1.77777778, 2. ])
Writing to File¶
Having read data from a file, analysis can be performed. The results of
such analysis are often displayed visually (think plt.plot
). Other
applications may require that data be saved to file. If the data is
solely array-based, then np.savetxt
is a good option. More
complicated (or less structured) information requires a different
approach.
Just as file handles can be created to read from a text file, they can
be created to save to a text file. The only difference in syntax is
the use of w
instead of r
:
In [21]:
f = open('new_data.txt', 'w')
Any string can be written to file using the write
function:
In [22]:
f.write('Here is some sample text!')
f.close() # Always close a file when done.
The output is the number of characters written. Here, f
is closed
after writing just one string. However, several lines can be written in
sequence, with the contents of any additional string placed after the
contents already written.
Often, numerical data should be written in a format that is easily
read. The right way to produce formatted text is to use the
str.format
method. For example, consider the repeating decimal
corresponding to 1/3. To print that number with just four decimal
places, one could use
In [23]:
"{:.4f}".format(1/3)
Out[23]:
'0.3333'
Several strings, integers, and floating-point values can be formatted all at once:
In [24]:
"{} divided by {} is approximately {:.4f}".format(1, 3, 1/3)
Out[24]:
'1 divided by 3 is approximately 0.3333'
To use format
requires a string with one or more sets of {}
. The
arguments to format
are the values to be formatted. If the {}
appears without any contents, the corresponding value is formatted into
its default str
representation. For example, compare the following:
In [25]:
str(1/3)
"{}".format(1/3)
Out[25]:
'0.3333333333333333'
That sure includes a lot of decimal places!
More specific formatting instructions can be passed. For float
values, the syntax {:.4f}
yields a formatted value with any number
of digits to the left of the decimal point and exactly four digits to
the right of the decimal point. The colon :
is always included when
a format specification is given (such as .4f
). If scientific
notation is preferred, one can use a format like {:.4e}
. Compare the
following:
In [26]:
'{:.4f}'.format(1/3)
'{:.4e}'.format(1/3)
Out[26]:
'3.3333e-01'
Python provides several other
formats
for handling special int
, float
, and other needs, but {}
,
{:.4f}
, and {:.4e}
(with values other than 4) should be
sufficient for most simple tasks.
With formatting, the time, velocity, and acceleration data read above can be written back to file :
In [27]:
f = open('new_data.txt', 'w')
# write the header information (remember the newline \n!)
f.write('time (s) vel (m/s) acc (m/s**2)\n')
for i in range(len(t)):
# produce each line of text (again, rememeber \n!)
s = "{:.8f} {:.8f} {:.8f}\n".format(t[i], v[i], a[i])
f.write(s)
f.close()
Now, new_data.txt
should have exactly the same contents.
Exercise: Go and verify that the previous code produces the expected
file new_data.txt
. This will require that you download and run this
notebook or copy the code and paste it into Spyder.
Exercise: Given a = 1
, b = 1/9
, and c = 'Python'
,
produce the string '1 0.111 Python'
.
Exercise: Given A = np.random.rand(5, 5)
, write its values to
file element by element. Each line of the text file should look similar
to 1 5 0.569113
, where the values 1 and 5 correspond to the row and
column the element.
Solution. First, recognize the desired format: an integer, a space, an
integer, a space, and a float with 6 decimal places. Using the format
function, such a string is produce via
"{} {} {:.6f}".format(i1, i2, f1)
) where i1
and i2
have
int
values and f1
has a float
value. Second, that format can
be used inside a nested loop structure that iterates through all rows
and columns and writes i
, j
, and A[i, j]
to file.
In [28]:
A = np.random.rand(5, 5)
f = open('data_from_A.txt', 'w')
for i in range(5):
for j in range(5):
f.write("{} {} {:.6f}\n".format(i, j, A[i, j]))
f.close()
Remember that the \n
character is needed to end a line.
Further Reading¶
The student interested in more complex formatting should read the documentation on Python’s formatting syntax.