Handling Files

What will we cover?
  • How to open a file
  • How to read and write to an open file
  • How to close a file.
  • Building an address book
  • Handling binary data files

Handling files often poses problems for beginners although the reason for this puzzles me slightly. Files in a programming sense are really no different from files that you use in a word processor or other application: you open them, do some work and then close them again.

The biggest differences are that in a program you access the file sequentially, that is, you read one line at a time starting at the beginning. In practice the word processor often does the same, it just holds the entire file in memory while you work on it and then writes it all back out when you close it. The other difference is that, when programming, you normally open the file as read only or write only. You can write by creating a new file from scratch (or overwriting an existing one) or by appending to an existing one.

One other thing you can do while processing a file is that you can go back to the beginning.

Files - Input and Output

Let's see that in practice. We will assume that a file exists called menu.txt and that it holds a list of meals:

spam & eggs
spam & chips
spam & spam

Now we will write a program to read the file and display the output - like the 'cat' command in Unix or the 'type' command in DOS.

# First open the file to read(r)
inp = file("menu.txt","r")
# read the file into a list then print
# each item
for line in inp.readlines():
    print line
# Now close it again
inp.close()

Note 1: file() takes two arguments. The first is the filename (which may be passed as a variable or a literal string, as we did here). The second is the mode. The mode determines whether we are opening the file for reading(r) or writing(w), and also whether it's for ASCII text or binary usage - by adding a 'b' to the 'r' or 'w', as in: open(fn,"rb")

Note 2: We used the file() function to open the file, older versions of Python used the built in function open() instead. The parameters are identical but open() is now deprecated and file() preferred.

Note 2: We read and close the file using functions preceded by the file variable. This notation is known as method invocation and is another example of Object Orientation. Don't worry about it for now, except to realize that it's related in some ways to modules. You can think of a file variable as being a reference to a module containing functions that operate on files and which we automatically import every time we create a file type variable.

Note 3: We close the file at the end with the close() method. In Pytho,n files are automatically closed at the end of the program but it is good practice to get into the habit of closing your files explicitly. Why? Well, the operating system may not write the data out to the file until it is closed (this can boost performance). What this means is that if the program exits unexpectedly there is a danger that your precious data may not have been written to the file! So the moral is: once you finish writing to a file, close it.

Consider how you could cope with long files. First of all you would need to read the file one line at a time (in Python by using readline() and a while loop instead of readlines() and a for loop. You might then use a line_count variable which is incremented for each line and then tested to see whether it is equal to 25 (for a 25 line screen). If so, you request the user to press a key (enter, say) before resetting line_count to zero and continuing. You might like to try that as an excercise...

Since Python version 2.2 it has also been possible to treat the file as a list so you don't need to use readlines() inside a for loop, you just iterate over the file. Let's rewrite the previous example to see this feature in action:

# First open the file to read(r)
inp = file("menu.txt","r")
# iterate over the file printing each item
for line in inp:
    print line
# Now close it again
inp.close()

Really that's all there is to it. You open the file, read it in and manipulate it any way you want to. When you're finished you close the file. To create a 'copy' command in Python, we simply open a new file in write mode and write the lines to that file instead of printing them. Like this:

# Create the equivalent of: COPY MENU.TXT MENU.BAK

# First open the files to read(r) and write(w)
inp = file("menu.txt","r")
outp = file("menu.bak","w")

# read file, copying each line to new file
for line in inp:
    outp.write(line)

print "1 file copied..."

# Now close the files
inp.close()
outp.close()

Did you notice that I added a print statement just to reassure the user that something actually happened? This kind of user feedback is usually a good idea.

One final twist is that you might want to append data to the end of an existing file. One way to do that would be to open the file for input, read the data into a list, append the data to the list and then write the whole list out to a new version of the old file. If the file is short that's not a problem but if the file is very large, maybe over 100Mb, then you will simply run out of memory to hold the list. Fortunately there's another mode "a" that we can pass to open() which allows us to append directly to an existing file just by writing. Even better, if the file doesn't exist it will open a new file just as if you'd specified "w".

As an example, let's assume we have a log file that we use for capturing error messages. We don't want to delete the existing messages so we choose to append the error, like this:

def logError(msg):
   err = file("Errors.log","a")
   err.write(msg)
   err.close()

In the real world we would probably want to limit the size of the file in some way. A common technique is to create a filename based on the date, thus when the date changes we automatically create a new file and it is easy for the maintainers of the system to find the errors for a particular day and to archive away old error files if they are not needed. (Recall that there is a time module that can be used to find out the current date.)

The Address Book Revisited

You remember the address book program we introduced during the Raw Materials topic and then expanded in the Talking to the User topic? Let's start to make it really useful by saving it to a file and, of course, reading the file at startup. We'll do this by writing some functions so in this example we pull together several of the strands that we've covered in the last few topics.

The basic design will require a function to read the file at startup, another to write the file at the end of the program. We will also create a function to present the user with a menu of options and a separate function for each menu selection. The menu will allow the user to:

Loading the Address Book

def readBook(book):
    import os
    filename = 'addbook.dat'
    if os.path.exists(filename):
       store = file(filename,'r')
       while store:
          name = store.readline().strip()
	  entry = store.readline().strip()
          book[name] = entry
    else:
        store = file(filename,'w') # create new empty file
    store.close()

Notice the use of strip() to remove the newline character from the end of the line.

Saving the Address Book

def saveBook(book):
    store = file("addbook.dat",'w')
    for name,entry in book.items():
        store.write(name + '\n')
        store.write(entry + '\n')
    store.close()

Notice we need to add a newline character ('\n') when we write the data.

Getting User Input

def getChoice(menu):
    print menu
    choice = int( raw_input("Select a choice(1-4): ") )
    return choice

Adding an Entry

def addEntry(book):
    name = raw_input("Enter a name: ")
    entry = raw_input("Enter street, town and phone number: ")
    book[name] = entry

Removing an entry

def removeEntry(book):
    name = raw_input("Enter a name: ")
    del(book[name])

Finding an entry

def findEntry(book):
    name = raw_input("Enter a name: ")
    if name in book.keys():
       print name, book[name]
    else: print "Sorry, no entry for: ", name

Quitting the program

Actually I won't write a separate function for this, instead I'll make the quit option the test in my menu while loop. So the main program will look like this:

def main():
    theMenu = '''
    1) Add Entry
    2) Remove Entry
    3) Find Entry
    4) Quit and save
    '''
    theBook = {}
    readBook(theBook)
    choice = getChoice(theMenu)
    while choice != 4:
        if choice == 1:
            addEntry(theBook)
        elif choice == 2:
            removeEntry(theBook)
        elif choice == 3:
            findEntry(theBook)
        else: print "Invalid choice, try again"
        choice = getChoice(theMenu)
    saveBook(theBook)

Now the only thing left to do is call the main() function when the program is run, and to do that we use a bit of Python magic like this:

if __name__ == "__main__":
    main()

This bit of magic allows us to use any python file as a module by importing it, or as a program by running it. The difference is that when the program is imported, the internal variable __name__ is set to the module name but when the file is run, the value of __name__ is set to "__main__". Sneaky, eh? Now if you type all that code into a new text file and save it as addressbook.py, you should be able to run it from an OS prompt by typing:

C:\PROJECTS> python addressbook.py

Or just double click the file in Explorer, it should start up in its own DOS window, and the window will close when you select the quit option.

Or in Linux:

$ python addressbook.py

Study the code, see if you can find the mistakes (I've left, at least, two minor bugs for you to find, there may be more!) and try to fix them. This 60 odd line program is typical of the sort of thing you can start writing for yourself. There are a couple of things we can do to improve it which I'll cover in the next section, but even as it stands it's a reasonably useful little tool.

VBScript and JavaScript

Neither VBScript nor JavaScript have native file handling capabilities. This is a security feature to ensure no-one can read your files when you innocently load a web page, but it does restrict their general usefulness. However, as we saw with reusable modules there is a way to do it using Windows Script Host. WSH provides a FileSystem object which allows any WSH language to read files. We will look at a JavaScript example in detail then show similavcode in VBScript for comparison, but as before the key elements will really be calls to the WScript objects.

Before we can look at the code in detail its worth taking time to describe the File System Object Model. An Object Model is a set of related objects which can be used by the programmer. The WSH FileSystem object model consists of the FSO object, a number of File objects, including the TextFile object which we will use. There are also some helper objects, most notable of which is, for our purposes, the TextStream object. Basically we will create an instance of the FSO object, then use it to create our TextFile objects and from these in turn create TextStream objects to which we can read or write text. The TextStream objects themselves are what we actually read/write from the files.

Type the following code into a file called testFiles.js and run it using cscript as described in the earlier introduction to WSH.

Opening a file

To open a file in WSH we create an FSO object then create a TextFile object from that:

var fileName, fso, txtFile, outFile, line;

// Get file name
fso = new ActiveXObject("Scripting.FileSystemObject");
WScript.Echo("What file name? ");
fileName = WScript.StdIn.Readline();

// open txtFile to read, outFile to write
txtFile = fso.OpenTextFile(fileName, 1); // mode 1 = Read
fileName = fileName + ".BAK"
outFile = fso.CreateTextFile(fileName);

Reading and Writing a file

// loop over file till it reaches the end
while ( !txtFile.AtEndOfStream ){
    line = txtFile.ReadLine();
    WScript.Echo(line);
    outFile.WriteLine( line );
    }

Closing files

txtFile.close();
outFile.close();

And in VBScript

<?xml version="1.0"?>

<job>
  <script language="VBScript">
      Dim fso, inFile, outFile, inFileName, outFileName
      Set fso = CReateObject("Scripting.FileSystemObject")
      
      WScript.Echo "Type a filename to backup"
      inFileName = WScript.StdIn.ReadLine
      outFileName = inFileName &amp; ".BAK"
      
      ' open the files
      Set inFile = fso.OpenTextFile(inFileName, 1)
      Set outFile = fso.CreateTextFile(outFileName)

      ' read the file and write to the backup copy
      While not inFile.AtEndOfStream
         line = inFile.ReadLine
	 outFile.WriteLine(line)
      Wend
      
      ' close both files
      inFile.Close
      outFile.Close
      
      WScript.Echo inFileName &amp; " backed up to " &amp; outFileName
  </script>
</job>

Handling Non-Text Files

Handling text is one of the most common things that programmers do, but sometimes we need to process raw binary data too. This is not done so often in VBScript or JavaScript so I will only cover how Python does it.

Opening and Closing Binary Files

The key difference between text files and binary files is that text files are composed of octets, or bytes, of binary data whereby each byte represents a character represents a character and the end of the file is marked by a special byte pattern, known generically as end of file, or eof. A binary file contains arbitrary binary data and thus no specific value can be used to identify end of file, thus a different mode of operation is required to read these files. The end result of this is that when we open a file in Python (or indeed any other language) we must specify that it is being opened in binary mode. The way we do this in Python is to add a 'b' to the mode parameter, like this:

binfile = file("aBinaryFile.bin","rb")

The only difference from opening a text file is the mode value of "rb". You can use any of the other modes too, simply add a 'b': "wb" to write, "ab" to append.

Closing a binary file is no different to a text file, simply call the close() method of the open file object:

binfile.close()

Because the file was opened in binary mode there is no need to given Python any extra information, it knows how to close the file correctly.

Data Representation and Storage

Before we discuss how to access the data within a binary file we need to consider how data is represented and stored on a computer. All data is stored as a sequence of binary digits, or bits. These bits are grouped into sets of 8 or 16 called bytes or words respectively. (A group of 4 is sometimes called a nibble!) A byte can be any one of 256 different bit patterns and these are given the values 0-255.


The information we manipulate in our programs, strings, numbers etc must all be converted into sequences of bytes. Thus the characters that we use in strings are each allocated a particular byte pattern. There were originally several such encodings, but the most common is the ASCII (American Standard Coding for Information Interchange). Unfortunately pure ASCII only caters for 128 values which is not enough for non English languages. A new encoding standard known as Unicode has been produced, which can use data words instead of bytes to represent characters, and allows for over 65000 characters. A subset of Unicode called UTF8 corresponds closely to the earlier ASCII coding. Python by default supports ASCII and by prepending a u in front of a string we can tell Python to treat the string as Unicode.


In the same way numbers need to be converted to binary codings too. For small integers it is simple enough to use the byte values directly, but for numbers larger than 255 (or negative numbers, or fractions) some additional work needs to be done. Over time various standard codings have emerged for numerical data and most programming languages and opeating systems use these. For example, the American Institute of Electrical and Electonic Engineering (IEEE) have defined a number of codings for floating point numbers.


The point of all of this is that when we read a binary file we have to interpret the raw bit patterns into the correct type of data for our program. It is perfectly possible to interpret a stream of bytes that were originally written as a character string as a set of floating point numbers. Or course the original meaning will have been lost but the bit patterns could represent either. So when we read binary data it is extremely important that we convert it into the correct data type.

The Struct Module

To encode/decode binary data Python provides a module called struct, short for structure. struct works very much like the format strings we have been using to print mixed data. We provide a string representing the data we are reading and apply it to the byte stream that we are trying to interpret. We can also use struct to convert a set of data to a byte stream for writing, either to a binary file (or even a communications line!).

There are many different convertion format codes but we will only use the integer and string codes here. (You can look up the others on the Python documentation for the struct module.) The codes for integer and string are i, and s respectively. The struct format strings consist of sequences of codes with numbers prepended to indicate how many of the items we need, for example 4s means a string of four characters.

Let's assume we wanted to write the address details, from our Address Book program above, as binary data with the street number as an integer and the rest as a string (This is a bad idea in practice since street "numbers" sometimes include letters!). The format string would look like:

'i34s' # assuming 34 characters in the address!
To cope with multiple address lengths we could write a function to create the binary string like this:
def formatAddress(address): 
    # split breaks a string into a list of 'words'
    fields = address.split()[0]
    number = fields[0]
    rest = ''
    for field in fields[1:]: rest.append(field)
    format = "i%ds" % len(rest)  # create the format string
    return struct.pack(format, number, rest)

So we used a string method - split() - (more on them in the next topic!) to split the address string into its parts, extract the first one as the number and then use a for loop to join the remaining fields back together. The length of that string is the number we need inthe struct format string. Phew!

formatAddress() will return a sequence of bytes containg the binary representation of our address. Now that we have our binary data let's see how we can write that to a binary file and then read it back again.

Reading & Writing Using Struct

Let's create a binary file containing a single address line using the formatAddress() function defined above. We need to open the file for writing in 'wb' mode, encode the data, write it to the file and then close the file. Let's try it:

import struct
f = file('address.bin','wb')
data = "10 Some St, Anytown, 0171 234 8765"
bindata = formatAddress(data)
f.write(bindata)
f.close()

You can check that the data is indeed in binary format by opening address.bin in notepad. The characters will be readable but the number will not look like 10!

To read it back again we need to open the file in 'rb' mode, read the data into a sequence of bytes, close the file and finally unpack the data using a struct format string. The question is how do we tell what the format string looks like? In this case we know it must be like the one we created in formatAddress(), namely 'iNc' where N is a variable number. How do we determine the value of N?

The struct module provides some helper functions that return the size of each data type, so by firing up the Python prompt and experimenting we can find out how many bytes of data we will get back for each data type:

>>> import struct
>>> print struct.calcsize('i')
4
>>> print struct.calcsize('s')
1

Ok, we know that our data will comprize 4 bytes for the number and one byte for each character. So N will be the length of the data minus 4. Let's try using that to read our file:

import struct
f = file('address.bin','rb')
data = f.read()
f.close()
fmtString = "i%ds" % (len(data) - 4)
number, rest = struct.unpack(fmtString, data)
address = ''
for field in (number,rest): 
    address.append(field)

And that's it on binary data files, or at least as much as I'm going to say on the subject. As you can see using binary data introduces several complications and unless you have a very good reason I don't recommend it. But at least if you do need to read a binary file, you can do it (provided you know what the data represented in the first place of course!)

Things to remember
  • Open files before using them
  • Files can usually only be read or written but not both at the same time
  • Python's readlines() function reads all the lines in a file, while readline() only reads one line at a time, which may help save memory.
  • Close files after use.
  • Binary files need the mode flag to end in 'b'
  • Previous  Next  Contents


    If you have any questions or feedback on this page send me mail at: alan.gauld@btinternet.com