Part B: Strings, Lists, Tuples, Loops, Conditionals, File I/O

As we will come to appreciate, a running Python program consists of objects. Even numbers are objects. Last time we worked with objects that were vectors. Python attaches many things to an object. The vector objects had component values, but the function length() was also attached, for example. Learning to use and design objects is a key part of using Python effectively. Here we'll talk about some natural kinds of object in Python. Soon we'll use these to build more complex objects of our own.

Sequence objects

Sequence objects are objects that contain elements which are referred to by indices. They include lists, arrays, text strings, and the like. The elements of any sequence object seq can be obtained by indexing: seq[0] or seq[1]. The indices start from 0; that is, seq[0] is the first element and seq[1] is the second.

It is also possible to index from the end: seq[-1] is the last element of a sequence, seq[-2] the next-to-last, and so on.

Example:
In [38]: text = 'abc'

In [39]: print text[1]
b

In [40]: print text[-1]
c

Subsequences can be extracted by slicing: seq[0:5] is a subsequence containing the first five elements, numbered 0, 1, 2, 3, and 4, but not element number 5. Negative indices are allowed: seq[1:-1] strips off the first and last element of a sequence.

Example:
In [41]: text = 'A somewhat longer string.'

In [42]: print text[2:10]
somewhat

In [43]: print text[-7:-1]
string
Question: Why is there no period '.' in the output?

The length of a sequence can be determined by len(seq).
In [44]: print len(text)
25

Lists

Lists are sequences that can contain arbitrary objects—such as, numbers, strings, vectors, and even other lists:
In [17]: some_prime_numbers = [2, 3, 5, 7, 11, 13]

In [18]: names = ['Smith', 'Jones']

Lists are delimited by square brackets.

Example:
In [19]: a_mixed_list = [3, [2, 'b'], array([1, 1, 0]) ]
We got an error! Remembering the first tutorial, we load in
from numpy import array
Now it works:
In [20]: a_mixed_list = [3, [2, 'b'], array([1, 1, 0])
Question: There are two lists here. Where are they?

Answer: a_mixed_list and [2, 'b'] are lists, while array([1, 1, 0]) is a numpy array.
In [20]: print array([1,1,0])
[1 1 0]
No commas separate the components—a hint that it is not a Python list. Nonetheless, it is still indexable like a list.

Let's experiment:
In [21]: print a_mixed_list[1]
[2, 'b']

In [22]: print a_mixed_list[1][1]
b

In [23]: print a_mixed_list[2]   
[1.0, 1.0, 0.0]

In [24]: print a_mixed_list[2][1]
1

So we can combine indexing, if the object retrieved is also a list.

Elements and subsequences of a list can be changed by assigning a new value:
In [25]: print names
['Smith', 'Jones']

In [26]: names[1] = 'Python'

In [27]: print names
['Smith', 'Python']
Here we insert a list starting at the fourth index location:
In [28]: print some_prime_numbers
[2, 3, 5, 7, 11, 13]

In [29]: some_prime_numbers[3:] = [17, 19, 23, 29]

In [30]: print some_prime_numbers
[2, 3, 5, 17, 19, 23, 29]

A new element can be appended at the end with list.append(new_element).
In [40]: names.append('Fred')

In [42]: print names
['Smith', 'Python', 'Fred']

A list can be sorted with list.sort().
In [43]: names.sort()

In [44]: print names
['Fred', 'Python', 'Smith']

A list can be reversed with list.reverse()
In [45]: names.reverse()

In [46]: print names
['Smith', 'Python', 'Fred']

Two lists can be concatenated, just like we did previously with text strings:
In [47]: [0, 1] + ['a', 'b']
Out[47]: [0, 1, 'a', 'b']

Lists can also be repeated, like text strings:
In [40]: 3*['Hallelujah']
Out[40]: ['Hallelujah', 'Hallelujah', 'Hallelujah']
See Chapter 7 'Strings' in Learning Python for the many other list operations that are possible.

Tuples

Tuples are much like lists, except that they cannot be changed. Once created, a tuple always has the same elements. They can be used, therefore, in situations where a modifiable sequence does not make sense; for example, as keys in a database.

Example:
In [67]: tuple_1 = (1, 2)

In [68]: tuple_2 = ('a', 'b')
Tuples are delimited by parentheses.

Tuples behave much as list objects do:
In [8]: print tuple_1[1]
Out[8]: 2

Let's try:
In [52]: combined_tuple = tuple_1 + 2*tuple_2

In [53]: print combined_tuple
(1, 2, 'a', 'b', 'a', 'b')

However, try
In [9]: tuple_1[1] = 3
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/Users/.../Software/ in ()

TypeError: 'tuple' object does not support item assignment

So tuples differ from lists: you cannot change the values of their components once created.

Loops over sequences

It is often necessary to repeat some operation for each element of a sequence. This is called a loop over the sequence.
In [54]: for prime_number in [2, 3, 5, 7]:
   ....:     square = prime_number**2
   ....:     print square
   ....:     
   ....:     
4
9
25
49

Okay, a lot just happened here; let's unpack it.

First, we have the structure of the for loop: we define a variable prime_number that successively takes on values in the list [2, 3, 5, 7].

The for loop line ends with a colon (':').

Second, the operations that are part of the loop are indicated by indentation. If you are using iPython, you will see that the indentation is automatically provided for you.

Generally, say when you don't use iPython, you must provide the indentation yourself. This is a syntactic requirement of Python.

Third, to complete the loop and execute the code, two blank lines need to be entered if the text is indented.

(In iPython, I use 'ctrl-U' to remove the indentation, if I want to type at the line's beginning. See below.)

Loops work with any sequence. Here is an example with text strings:
In [57]: for vowel in 'aeiou':
   ....:     print 10*vowel
   ....:     
   ....:     
aaaaaaaaaa
eeeeeeeeee
iiiiiiiiii
oooooooooo
uuuuuuuuuu

Loops over a range of numbers are just a special case:
In [58]: from math import sqrt

In [59]: for i in range(10):
   ....:     print i, sqrt(i)
   ....:     
   ....:     
0 0.0
1 1.0
2 1.41421356237
3 1.73205080757
4 2.0
5 2.2360679775
6 2.44948974278
7 2.64575131106
8 2.82842712475
9 3.0

The function range(n) returns a list containing the first n integers; that is, from 0 to n-1:
In [3]: print range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The form range(i, j) returns all integers from i to j-1:
In [4]: print range(4,10)
[4, 5, 6, 7, 8, 9]

and, finally, the form range(i, j, k) returns i, i+k, i+2*k, and so on, up to j-1:
In [5]: print range(4,10,2)
[4, 6, 8]

range() returns a list, as we noted. Often, though, one wants to repeat an operation, perhaps controlled by a for loop, for a very large number of times. Using range(10000000) would create list of ten million integers. This is wasteful of memory and also executes slowly. In these circumstances you can use the xrange() function which returns, not a list, but an object. Each time that object is called it returns a single number—the next in the sequence.
In [6]: print xrange(10)
xrange(10)

In [7]: for i in xrange(10):
   ...:     print i
   ...:     
   ...:     
0
1
2
3
4
5
6
7
8
9

Testing conditions

The most frequent conditions that are tested are equality and order:
equal a == b
not equal a != b
greater than a > b
less than a < b
greater than or equal a >= b
less than or equal a <= b

Several conditions can be combined with and and or, and negations are formed with not. The result of a condition is 1 for "true" and 0 for "false".

Typically, conditions are used for decisions:
In [61]: a = 'def'

In [62]: b = 'def'

In [63]: if a == b:
   ....:     print "equal"
   ....: elif a > b:
   ....:     print "greater"
   ....: else:
   ....:     print "less"
   ....:     
   ....:     
equal

There can be any number of elif branches (including none), and the else branch is optional. These must start at the level of indentation that matches the corresponding if.

Note that each line of the control structure ('if', 'elseif', and 'else') ends with a colon (':'). And the commands within each portion of the control structure are indented.

(Again, in iPython I use 'ctrl-U' to remove the automatic indentation and start typing at the line's beginning.)

Conditions can also be used to control a loop:
In [65]: x = 1.

In [66]: while x > 1.e-2:
   ....:     print x
   ....:     x = x/2.
   ....:     
   ....:     
1.0
0.5
0.25
0.125
0.0625
0.03125
0.015625

This illustrates a companion control structure—a while loop.


Text files

File objects are pre-defined in a Python module that is automatically loaded when Python starts.

First thing, we need a file with something in it:
In [11]: cat > t
df
sdf
dsf
d
^D

This makes a file named 't' with those miscellaneous lines.

Reminder, the interactive session above is all done within iPython and iPython allows one to use some of the basic Unix shell commands and syntax, such as cat, ls, file redirection (< and >) and so on.) I ended the input to the cat Unix command with 'ctrl-D'.

It then produced the file named t:
In [12]: cat t
df
sdf
dsf
d

For a refresher on Unix shell commands, see an online tutorial, such as one of those listed under Computing Tutorials.

If your system doesn't have cat, then somehow make a file 't', with something in it, like the above.

Reading

Text files can be treated as sequences of lines, with the limitation that the lines must be read in sequence. The following program opens the file we just made and prints all of its lines:
In [13]: f = open('t')

In [14]: for line in f:
   ....:     print line[:-1]
   ....:     
   ....:     
df
sdf
dsf
d

Why line[:-1]? The last character of each line is the new-line character, which we don't want to print. (It would create blank lines between the lines of the file; try using only print line.)

Question: What happens if the file you try to open doesn't exist?

Text file objects, with several convenient features, are also defined in the module gzip.

Text file objects defined there can work with compressed files. Any file name ending in .Z or .gz will be assumed to refer to a compressed file. (.Z files are created by Unix compress and .gz files by Unix gzip.)

When programs access the GzipFile object they retrieve the uncompressed version of the data.

So, using the iPython shell command escape '!', I run the gzip compression program on the file:
In [15]: !gzip t

Giving the compressed version 't.gz':
In [16]: ls t
ls: t: No such file or directory

In [17]: ls t.gz
t.gz

gzip replaced our file original t with the compressed version t.gz.

What's there?
In [18]: cat t.gz
?HtKI?*NI?J)b.??r?

Raw data representing the original information, but in a smaller file.

So how does a program access the original information? A TextFile object knows how to decompress itself.
In [19]: import gzip

In [20]: for line in gzip.GzipFile('t.gz'):
   ....:     print line[:-1]
   ....:     
   ....:     
df
sdf
dsf
d

You can even use a URL (Universal Resource Locater) instead of a filename and so read data directly via the Internet. Let's try this on the class home page:
In [21]: import urllib

In [22]: for line in urllib.urlopen('http://csc.ucdavis.edu/~chaos/courses/nlp/'):
   ....:     print line[:-1]
   ....:     
   ....:     

Nonlinear Physics: Modeling Chaos and Complexity---Spring 2010 Course, Physics 150/250


...

and lots of other output from that page.

Writing

Files can be opened for writing instead of reading. This is specified by an additional argument to open(). See the 'w' in the command below:
In [23]: file = open('tt','w')

In [24]: file.write('The first line\n')

In [25]: file.write('And the')

In [26]: file.write(' second li')

In [27]: file.write('ne')

In [28]: file.write('\n')

In [29]: file.close()

In [30]: cat tt
The first line
And the second line

Each write() command appends the string argument to the output file. The '\n' represents the new-line character. Notice the role that it played in giving two lines.

Files opened for writing should be closed at the end to make sure that all data is actually written to the file. At the end of a program, all open files will be closed automatically, but it is better not to rely on this.

Using the module gzip automatic compression is also available for writing. (But not URLs, since servers do not permit write access; well, some might, but they shouldn't!)

In [31]: import gzip

In [32]: file = gzip.GzipFile('a_compressed_file.gz', 'w')

In [33]: file.write('The first line\n')

In [34]: file.write('And the')

In [35]: file.write(' second li')

In [36]: file.write('ne')

In [37]: file.write('\n')

In [38]: file.close()

So this works very similarly to the standard file object.

Anyway, let's check what we wrote to the file:
In [39]: for line in gzip.GzipFile('a_compressed_file.gz'):
   ....:     print line[:-1]
   ....:     
   ....:     
The first line
And the second line

Useful string operations

Python contains a number of string operations that are particularly useful for reading, writing, formatting text files. The older module string contains some common string operations, too, but many have been built into Python now. Only the most important ones will be described here; see Learning Python Chapter 5 or the Python Library Reference for a complete list.

Getting data out of a string

The function strip(string) removes leading and trailing white space from a string. This is also available this way:
In [91]: s = '    abc     '

In [92]: s.strip()
Out[92]: 'abc'

The function split(string) returns a list of the words in the string, with "words" being anything between spaces.
In [93]: s = 'as as ab ab'

In [94]: s.split()
Out[94]: ['as', 'as', 'ab', 'ab']

By the way, the word separator can be changed to any arbitrary string by using split(string, separator).
In [96]: s.split('a')
Out[96]: ['', 's ', 's ', 'b ', 'b']

To extract numbers from strings, use the functions int(string) to return an integer and float(string) to return a real number.
In [111]: int("42")
Out[111]: 42

In [112]: float("42.22")
Out[112]: 42.219999999999999

To find a specific text in a string, use s.find(text). It returns the first index at which text occurs in string, or -1 if it doesn't occur at all.
In [113]: s = 'A sentence with words.'

In [114]: s.find('with')
Out[114]: 11

Let's put some of these to work in simple programs.

Example: First, create simple data file:
In [64]: cat > data
1 3.4
2 5.2
3 9.7
^D

The following reads a file and prints the sum of all numbers in the second column.
In [65]: f = open('data')

In [66]: sum = 0.0

In [67]: for line in f:
   ....:     sum = sum + float(line.split()[1])
   ....:     
   ....:     

In [68]: print "Sum = ", sum
Sum =  18.3

Example: The following program prints the name of all user accounts on your computer (if you're running on Unix!):
In [72]: f = open('/etc/passwd')

In [73]: for line in f:
   ....:     print line.split(':')[0]
   ....:     
   ....:     
nobody
root
daemon
uucp
lp
postfix
www
eppc
mysql
sshd
...

Converting data into a string

Any Python object (numbers, strings, vectors, ...) can be turned into a string by writing it in inverse apostrophes:
In [74]: from numpy import array

In [75]: a = 42

In [76]: b = 1./3.

In [77]: c = array([0, 2, 1])

In [78]: print(`a` + ' ' + `b` + ' ' + `c`)
42 0.33333333333333331 array([0, 2, 1])

Another way to convert anything into a string is the function str(data).

The two methods do not always give the same result, although they do for numbers. In general, str(data) produces a "nice" representation of the value, whereas the inverse apostrophes return a representation that indicates not only the value, but also the type of the data. For example, if s is a string, then str(s) is the same as s, whereas `s` returns s enclosed in apostrophes to indicate that the data is a string. In practice, try both and use the one you like best.

The function join(words) takes a list of strings and returns the concatenation with words separated by a space. The last line of the preceding program could therefore simply be
In [83]: import string

In [84]: print string.join([`a`, `b`, `c`])
42 0.33333333333333331 Vector(0,2,1)

The word separator can again be changed to an arbitrary string.

The functions string.lower(string) and string.upper(string) convert a string to lower- or uppercase letters.
In [86]: string.lower("HELLO WORLD")
Out[86]: 'hello world'

In [87]: string.upper("goodbye world")
Out[87]: 'GOODBYE WORLD'

The function string.ljust(string, width) returns a string of at least width characters in which string is left-justified. The functions string.rjust and string.center work similarly, but place the supplied text at the end or in the center.

Useful functions not described here

Python has a very large collection of commands for dealing with more or less specialized forms of text. It is impossible to describe them all here, or even list them. You can find all the information you need in the Python Library Reference.

First, there are many functions in the module string that have not been described here.

Here's a useful feature of iPython. We just imported the string package. What functions does it have? To find a simple list we type a <TAB> character after string. (note the period after string):
In [83]: string.
string.Template            string.center
string._TemplateMetaclass  string.count
string.__class__           string.digits
string.__delattr__         string.expandtabs
string.__dict__            string.find
string.__doc__             string.hexdigits
string.__file__            string.index
string.__getattribute__    string.index_error
string.__hash__            string.join
string.__init__            string.joinfields
string.__name__            string.letters
string.__new__             string.ljust
string.__reduce__          string.lower
string.__reduce_ex__       string.lowercase
string.__repr__            string.lstrip
string.__setattr__         string.maketrans
string.__str__             string.octdigits
string._float              string.printable
string._idmap              string.punctuation
string._idmapL             string.replace
string._int                string.rfind
string._long               string.rindex
string._multimap           string.rjust
string._re                 string.rsplit
string.ascii_letters       string.rstrip
string.ascii_lowercase     string.split
string.ascii_uppercase     string.splitfields
string.atof                string.strip
string.atof_error          string.swapcase
string.atoi                string.translate
string.atoi_error          string.upper
string.atol                string.uppercase
string.atol_error          string.whitespace
string.capitalize          string.zfill
string.capwords            

No shortage of string functions! If we want to find out about how to use one of these, in iPython we type a question mark after the function name:
In [83]: string.join?
Type:           function
Base Class:     
String Form:    
Namespace:      Interactive
File:           /sw/lib/python2.4/string.py
Definition:     string.join(words, sep=' ')
Docstring:
    join(list [,sep]) -> string
    
    Return a string composed of the words in list, with
    intervening occurrences of sep.  The default separator is a
    single space.
    
    (joinfields and join are synonymous)

Advanced text manipulation

In addition, an important set of functions deals with finding and changing data according to patterns called regular expressions. These functions are located in the module re. They are very powerful, but the syntax of regular expressions (also used by Unix tools like grep and editors like vi and emacs) is a bit complicated. So you'll have to look up the documentation.

The module htmllib contains functions to extract data from HTML files, which are typically used on the World-Wide Web. The module formatter provides a way to create HTML files.


Table of Contents