As we will come to appreciate, a running Python program consists of objects. Even numbers are objects. Last time we worked with objects that were vectors. Python attaches many things to an object. The vector objects had component values, but the function length() was also attached, for example. Learning to use and design objects is a key part of using Python effectively. Here we'll talk about some natural kinds of object in Python. Soon we'll use these to build more complex objects of our own.
Sequence objects are objects that contain elements which are referred to by indices. They include lists, arrays, text strings, and the like. The elements of any sequence object seq can be obtained by indexing: seq[0] or seq[1]. The indices start from 0; that is, seq[0] is the first element and seq[1] is the second.
It is also possible to index from the end: seq[-1] is the last element of a sequence, seq[-2] the next-to-last, and so on.
Example:
In [38]: text = 'abc' In [39]: print text[1] b In [40]: print text[-1] c |
Subsequences can be extracted by slicing: seq[0:5] is a subsequence containing the first five elements, numbered 0, 1, 2, 3, and 4, but not element number 5. Negative indices are allowed: seq[1:-1] strips off the first and last element of a sequence.
Example:
In [41]: text = 'A somewhat longer string.' In [42]: print text[2:10] somewhat In [43]: print text[-7:-1] string |
The length of a sequence can be determined by len(seq).
In [44]: print len(text) 25 |
Lists are sequences that can contain arbitrary objects—such as, numbers, strings, vectors, and even other lists:
In [17]: some_prime_numbers = [2, 3, 5, 7, 11, 13] In [18]: names = ['Smith', 'Jones'] |
Example:
In [19]: a_mixed_list = [3, [2, 'b'], array([1, 1, 0]) ] |
from numpy import array |
In [20]: a_mixed_list = [3, [2, 'b'], array([1, 1, 0]) |
Answer: a_mixed_list and [2, 'b'] are lists, while array([1, 1, 0]) is a numpy array.
In [20]: print array([1,1,0]) [1 1 0] |
Let's experiment:
In [21]: print a_mixed_list[1] [2, 'b'] In [22]: print a_mixed_list[1][1] b In [23]: print a_mixed_list[2] [1.0, 1.0, 0.0] In [24]: print a_mixed_list[2][1] 1 |
So we can combine indexing, if the object retrieved is also a list.
Elements and subsequences of a list can be changed by assigning a new value:
In [25]: print names ['Smith', 'Jones'] In [26]: names[1] = 'Python' In [27]: print names ['Smith', 'Python'] |
In [28]: print some_prime_numbers [2, 3, 5, 7, 11, 13] In [29]: some_prime_numbers[3:] = [17, 19, 23, 29] In [30]: print some_prime_numbers [2, 3, 5, 17, 19, 23, 29] |
A new element can be appended at the end with list.append(new_element).
In [40]: names.append('Fred') In [42]: print names ['Smith', 'Python', 'Fred'] |
A list can be sorted with list.sort().
In [43]: names.sort() In [44]: print names ['Fred', 'Python', 'Smith'] |
A list can be reversed with list.reverse()
In [45]: names.reverse() In [46]: print names ['Smith', 'Python', 'Fred'] |
Two lists can be concatenated, just like we did previously with text strings:
In [47]: [0, 1] + ['a', 'b'] Out[47]: [0, 1, 'a', 'b'] |
Lists can also be repeated, like text strings:
In [40]: 3*['Hallelujah'] Out[40]: ['Hallelujah', 'Hallelujah', 'Hallelujah'] |
Tuples are much like lists, except that they cannot be changed. Once created, a tuple always has the same elements. They can be used, therefore, in situations where a modifiable sequence does not make sense; for example, as keys in a database.
Example:
In [67]: tuple_1 = (1, 2) In [68]: tuple_2 = ('a', 'b') |
Tuples behave much as list objects do:
In [8]: print tuple_1[1] Out[8]: 2 |
Let's try:
In [52]: combined_tuple = tuple_1 + 2*tuple_2 In [53]: print combined_tuple (1, 2, 'a', 'b', 'a', 'b') |
However, try
In [9]: tuple_1[1] = 3 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /Users/.../Software/ |
So tuples differ from lists: you cannot change the values of their components once created.
It is often necessary to repeat some operation for each element of a sequence. This is called a loop over the sequence.
In [54]: for prime_number in [2, 3, 5, 7]: ....: square = prime_number**2 ....: print square ....: ....: 4 9 25 49 |
Okay, a lot just happened here; let's unpack it.
First, we have the structure of the for loop: we define a variable prime_number that successively takes on values in the list [2, 3, 5, 7].
The for loop line ends with a colon (':').
Second, the operations that are part of the loop are indicated by indentation. If you are using iPython, you will see that the indentation is automatically provided for you.
Generally, say when you don't use iPython, you must provide the indentation yourself. This is a syntactic requirement of Python.
Third, to complete the loop and execute the code, two blank lines need to be entered if the text is indented.
(In iPython, I use 'ctrl-U' to remove the indentation, if I want to type at the line's beginning. See below.)
Loops work with any sequence. Here is an example with text strings:
In [57]: for vowel in 'aeiou': ....: print 10*vowel ....: ....: aaaaaaaaaa eeeeeeeeee iiiiiiiiii oooooooooo uuuuuuuuuu |
Loops over a range of numbers are just a special case:
In [58]: from math import sqrt In [59]: for i in range(10): ....: print i, sqrt(i) ....: ....: 0 0.0 1 1.0 2 1.41421356237 3 1.73205080757 4 2.0 5 2.2360679775 6 2.44948974278 7 2.64575131106 8 2.82842712475 9 3.0 |
The function range(n) returns a list containing the first n integers; that is, from 0 to n-1:
In [3]: print range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] |
The form range(i, j) returns all integers from i to j-1:
In [4]: print range(4,10) [4, 5, 6, 7, 8, 9] |
and, finally, the form range(i, j, k) returns i, i+k, i+2*k, and so on, up to j-1:
In [5]: print range(4,10,2) [4, 6, 8] |
range() returns a list, as we noted. Often, though, one wants to repeat an operation, perhaps controlled by a for loop, for a very large number of times. Using range(10000000) would create list of ten million integers. This is wasteful of memory and also executes slowly. In these circumstances you can use the xrange() function which returns, not a list, but an object. Each time that object is called it returns a single number—the next in the sequence.
In [6]: print xrange(10) xrange(10) In [7]: for i in xrange(10): ...: print i ...: ...: 0 1 2 3 4 5 6 7 8 9 |
The most frequent conditions that are tested are equality and order:
equal | a == b |
---|---|
not equal | a != b |
greater than | a > b |
less than | a < b |
greater than or equal | a >= b |
less than or equal | a <= b |
Several conditions can be combined with and and or, and negations are formed with not. The result of a condition is 1 for "true" and 0 for "false".
Typically, conditions are used for decisions:
In [61]: a = 'def' In [62]: b = 'def' In [63]: if a == b: ....: print "equal" ....: elif a > b: ....: print "greater" ....: else: ....: print "less" ....: ....: equal |
There can be any number of elif branches (including none), and the else branch is optional. These must start at the level of indentation that matches the corresponding if.
Note that each line of the control structure ('if', 'elseif', and 'else') ends with a colon (':'). And the commands within each portion of the control structure are indented.
(Again, in iPython I use 'ctrl-U' to remove the automatic indentation and start typing at the line's beginning.)
Conditions can also be used to control a loop:
In [65]: x = 1. In [66]: while x > 1.e-2: ....: print x ....: x = x/2. ....: ....: 1.0 0.5 0.25 0.125 0.0625 0.03125 0.015625 |
This illustrates a companion control structure—a while loop.
File objects are pre-defined in a Python module that is automatically loaded when Python starts.
First thing, we need a file with something in it:
In [11]: cat > t df sdf dsf d ^D |
This makes a file named 't' with those miscellaneous lines.
Reminder, the interactive session above is all done within iPython and iPython allows one to use some of the basic Unix shell commands and syntax, such as cat, ls, file redirection (< and >) and so on.) I ended the input to the cat Unix command with 'ctrl-D'.
It then produced the file named t:
In [12]: cat t df sdf dsf d |
For a refresher on Unix shell commands, see an online tutorial, such as one of those listed under Computing Tutorials.
If your system doesn't have cat, then somehow make a file 't', with something in it, like the above.
Reading
Text files can be treated as sequences of lines, with the limitation that the lines must be read in sequence. The following program opens the file we just made and prints all of its lines:
In [13]: f = open('t') In [14]: for line in f: ....: print line[:-1] ....: ....: df sdf dsf d |
Why line[:-1]? The last character of each line is the new-line character, which we don't want to print. (It would create blank lines between the lines of the file; try using only print line.)
Question: What happens if the file you try to open doesn't exist?
Text file objects, with several convenient features, are also defined in the module gzip.
Text file objects defined there can work with compressed files. Any file name ending in .Z or .gz will be assumed to refer to a compressed file. (.Z files are created by Unix compress and .gz files by Unix gzip.)
When programs access the GzipFile object they retrieve the uncompressed version of the data.
So, using the iPython shell command escape '!', I run the gzip compression program on the file:
In [15]: !gzip t |
Giving the compressed version 't.gz':
In [16]: ls t ls: t: No such file or directory In [17]: ls t.gz t.gz |
gzip replaced our file original t with the compressed version t.gz.
What's there?
In [18]: cat t.gz ?HtKI?*NI?J)b.??r? |
Raw data representing the original information, but in a smaller file.
So how does a program access the original information? A TextFile object knows how to decompress itself.
In [19]: import gzip In [20]: for line in gzip.GzipFile('t.gz'): ....: print line[:-1] ....: ....: df sdf dsf d |
You can even use a URL (Universal Resource Locater) instead of a filename and so read data directly via the Internet. Let's try this on the class home page:
In [21]: import urllib In [22]: for line in urllib.urlopen('http://csc.ucdavis.edu/~chaos/courses/nlp/'): ....: print line[:-1] ....: ....: |
and lots of other output from that page.
Writing
Files can be opened for writing instead of reading. This is specified by an additional argument to open(). See the 'w' in the command below:
In [23]: file = open('tt','w') In [24]: file.write('The first line\n') In [25]: file.write('And the') In [26]: file.write(' second li') In [27]: file.write('ne') In [28]: file.write('\n') In [29]: file.close() In [30]: cat tt The first line And the second line |
Each write() command appends the string argument to the output file. The '\n' represents the new-line character. Notice the role that it played in giving two lines.
Files opened for writing should be closed at the end to make sure that all data is actually written to the file. At the end of a program, all open files will be closed automatically, but it is better not to rely on this.
Using the module gzip automatic compression is also available for writing. (But not URLs, since servers do not permit write access; well, some might, but they shouldn't!)
In [31]: import gzip In [32]: file = gzip.GzipFile('a_compressed_file.gz', 'w') In [33]: file.write('The first line\n') In [34]: file.write('And the') In [35]: file.write(' second li') In [36]: file.write('ne') In [37]: file.write('\n') In [38]: file.close() |
So this works very similarly to the standard file object.
Anyway, let's check what we wrote to the file:
In [39]: for line in gzip.GzipFile('a_compressed_file.gz'): ....: print line[:-1] ....: ....: The first line And the second line |
Python contains a number of string operations that are particularly useful for reading, writing, formatting text files. The older module string contains some common string operations, too, but many have been built into Python now. Only the most important ones will be described here; see Learning Python Chapter 5 or the Python Library Reference for a complete list.
Getting data out of a string
The function strip(string) removes leading and trailing white space from a string. This is also available this way:
In [91]: s = ' abc ' In [92]: s.strip() Out[92]: 'abc' |
The function split(string) returns a list of the words in the string, with "words" being anything between spaces.
In [93]: s = 'as as ab ab' In [94]: s.split() Out[94]: ['as', 'as', 'ab', 'ab'] |
By the way, the word separator can be changed to any arbitrary string by using split(string, separator).
In [96]: s.split('a') Out[96]: ['', 's ', 's ', 'b ', 'b'] |
To extract numbers from strings, use the functions int(string) to return an integer and float(string) to return a real number.
In [111]: int("42") Out[111]: 42 In [112]: float("42.22") Out[112]: 42.219999999999999 |
To find a specific text in a string, use s.find(text). It returns the first index at which text occurs in string, or -1 if it doesn't occur at all.
In [113]: s = 'A sentence with words.' In [114]: s.find('with') Out[114]: 11 |
Let's put some of these to work in simple programs.
Example: First, create simple data file:
In [64]: cat > data 1 3.4 2 5.2 3 9.7 ^D |
The following reads a file and prints the sum of all numbers in the second column.
In [65]: f = open('data') In [66]: sum = 0.0 In [67]: for line in f: ....: sum = sum + float(line.split()[1]) ....: ....: In [68]: print "Sum = ", sum Sum = 18.3 |
Example: The following program prints the name of all user accounts on your computer (if you're running on Unix!):
In [72]: f = open('/etc/passwd') In [73]: for line in f: ....: print line.split(':')[0] ....: ....: nobody root daemon uucp lp postfix www eppc mysql sshd ... |
Converting data into a string
Any Python object (numbers, strings, vectors, ...) can be turned into a string by writing it in inverse apostrophes:
In [74]: from numpy import array In [75]: a = 42 In [76]: b = 1./3. In [77]: c = array([0, 2, 1]) In [78]: print(`a` + ' ' + `b` + ' ' + `c`) 42 0.33333333333333331 array([0, 2, 1]) |
Another way to convert anything into a string is the function str(data).
The two methods do not always give the same result, although they do for numbers. In general, str(data) produces a "nice" representation of the value, whereas the inverse apostrophes return a representation that indicates not only the value, but also the type of the data. For example, if s is a string, then str(s) is the same as s, whereas `s` returns s enclosed in apostrophes to indicate that the data is a string. In practice, try both and use the one you like best.
The function join(words) takes a list of strings and returns the concatenation with words separated by a space. The last line of the preceding program could therefore simply be
In [83]: import string In [84]: print string.join([`a`, `b`, `c`]) 42 0.33333333333333331 Vector(0,2,1) |
The word separator can again be changed to an arbitrary string.
The functions string.lower(string) and string.upper(string) convert a string to lower- or uppercase letters.
In [86]: string.lower("HELLO WORLD") Out[86]: 'hello world' In [87]: string.upper("goodbye world") Out[87]: 'GOODBYE WORLD' |
The function string.ljust(string, width) returns a string of at least width characters in which string is left-justified. The functions string.rjust and string.center work similarly, but place the supplied text at the end or in the center.
Useful functions not described here
Python has a very large collection of commands for dealing with more or less specialized forms of text. It is impossible to describe them all here, or even list them. You can find all the information you need in the Python Library Reference.
First, there are many functions in the module string that have not been described here.
Here's a useful feature of iPython. We just imported the string package. What functions does it have? To find a simple list we type a <TAB> character after string. (note the period after string):
In [83]: string. string.Template string.center string._TemplateMetaclass string.count string.__class__ string.digits string.__delattr__ string.expandtabs string.__dict__ string.find string.__doc__ string.hexdigits string.__file__ string.index string.__getattribute__ string.index_error string.__hash__ string.join string.__init__ string.joinfields string.__name__ string.letters string.__new__ string.ljust string.__reduce__ string.lower string.__reduce_ex__ string.lowercase string.__repr__ string.lstrip string.__setattr__ string.maketrans string.__str__ string.octdigits string._float string.printable string._idmap string.punctuation string._idmapL string.replace string._int string.rfind string._long string.rindex string._multimap string.rjust string._re string.rsplit string.ascii_letters string.rstrip string.ascii_lowercase string.split string.ascii_uppercase string.splitfields string.atof string.strip string.atof_error string.swapcase string.atoi string.translate string.atoi_error string.upper string.atol string.uppercase string.atol_error string.whitespace string.capitalize string.zfill string.capwords |
No shortage of string functions! If we want to find out about how to use one of these, in iPython we type a question mark after the function name:
In [83]: string.join? Type: function Base Class: |
Advanced text manipulation
In addition, an important set of functions deals with finding and changing data according to patterns called regular expressions. These functions are located in the module re. They are very powerful, but the syntax of regular expressions (also used by Unix tools like grep and editors like vi and emacs) is a bit complicated. So you'll have to look up the documentation.
The module htmllib contains functions to extract data from HTML files, which are typically used on the World-Wide Web. The module formatter provides a way to create HTML files.