Source

Files

Choosing a Random Line from a File

Problem

You wish to display a random line from a file.

Solution

If you are dealing with a short file, you can produce a list containing all the lines, and then choose items at random:

lines = sys.stdin.readlines()
# or file.readlines(), where file is a file object
while lines:
  line = random.choice(lines)
  lines.remove(line)
  print line,

Copying Files

Most of the time this will do:

infile = open("file.in", "rb")
outfile = open("file.out", "wb")
outfile.write(infile.read())

However for huge files you may want to do the reads/writes in pieces (or you may have to), and if you dig deeper you may find other technical problems.

Unfortunately, there’s no totally platform independent answer. On Unix, you can use os.system() to invoke the “cp” command (see your Unix manual for how it’s invoked). On DOS or Windows, use os.system() to invoke the “COPY” command. On the Mac, use macostools.copy(srcpath, dstpath). It will also copy the resource fork and Finder info.

There’s also the shutil module which contains a copyfile() function that implements the copy loop; but in Python 1.4 and earlier it opens files in text mode, and even in Python 1.5 it still isn’t good enough for the Macintosh: it doesn’t copy the resource fork and Finder info.

<<<

Note ~~~~ This should be fixed for modern Python on [[Mac OS X]], but needs to be confirmed (especially due to the way the OS now stores metadata). <<<

Creating/deleting/truncating Files or Directories

To delete (remove, unlink) a file, use os.remove(filename) or os.unlink(filename)

For documentation, see the posix section of the library manual. They are the same: unlink() is simply the Unix name for this function.

To remove a directory os.rmdir()

To create a directory os.mkdir()

To rename a file os.rename()

To truncate a file, open it using

f = open(filename, "r+")

, and use f.truncate(offset)

(The offset defaults to the current seek position. The ** “r+”** mode opens the file for reading and writing.)

There’s also os.ftruncate(fd, offset) for files opened with os.open() for advanced Unix hacks only.

Emulating ‘getch()’ in UNIX-like systems

The terminal has to be put into cbreak mode, in order to be able to read single characters from the terminal.

import sys
def getch():
  import os, tty
  fd = sys.stdin.fileno()
  tty_mode = tty.tcgetattr(fd)
  tty.setcbreak(fd)
  try:
    ch = os.read(fd, 1)
  finally:
    tty.tcsetattr(fd, tty.TCSAFLUSH, tty_mode)
  return ch

Listing Files in a Directory

Problem

You want to make a list of the files in a directory, in order to do something for every file.

Solution

The os.listdir(path) function returns a list of strings containing all the contents of the directory specified by path, in no particular order. Note that the list will also include directory names in the resulting list.

>>> os.listdir('/usr')
['X11R6', 'bin', 'dict', 'doc', 'etc', 'games', ... ]

If you want to list all the files matching a wildcard specification such as *.py, you can implement it by checking each filename in the list returned by os.listdir(). The glob() function in the <<PythonModule glob>> module can do this for you:

>>> import glob
>>> glob.glob('/usr/lib/python1.5/B*.py')
['/usr/lib/python1.5/BaseHTTPServer.py', '/usr/lib/python1.5/Bastion.py']

glob.glob() will also list directories that match the wildcard specification.

Once you’ve gotten a list of filenames, you’ll often want to limit it to those names that correspond to files and not directories. The os.path module contains functions called isfile(path), isdir(path), and islink(path) which return true if the path corresponds to a file, directory, or symbolic link.

These functions can be combined with the built-in filter() to easily select those paths that are actually files:

filelist = os.listdir('.')
filelist = filter(os.path.isfile, filelist)

Locking Files

Problem

When a file will be accessed and modified by several programs at the same time, you need to ensure that the programs can’t make conflicting changes at the same time, which can result in the file being corrupted.

Solution

This is accomplished by locking the file.

The <<PythonModule posixfile>> module provides an object that acts like Python’s standard file objects, but adds some extra methods. One of the extra methods is lock(mode, len, start, whence). mode is a string specifying whether you want a read or write lock, or to give up an already- acquired lock.

For example, to gain a write lock for the entire file:

file.lock('w')

Discussion

Locks can be for reading or writing. Multiple read locks can be held by different processes, because several processes can read the same data at the same time without harm. Only a single write lock can be held at a given time, and read locks won’t be granted while a write lock is being held.

When requesting a lock, the default is to lock the entire file. It’s also possible to lock a part of the file, by specifying the start and length of a region in the file. This allows multiple processes to modify different parts of a file at the same time. Consult <<PythonModule posixfile>> for more information.

Object Persistence

If you need to automatically save and restore objects and other data structures to files, the pickle library module solves this in a very general way (though you still can’t store things like open files, sockets or windows), and the library module shelve uses pickle and (g)dbm to create persistent mappings containing arbitrary Python objects. For possibly better performance, use the cPickle module.

A more awkward way of doing things is to use pickle‘s little sister, marshal. The marshal module provides very fast ways to store noncircular basic Python types to files and strings, and back again. Although marshal does not do fancy things like store instances or handle shared references properly, it does run extremely fast. For example loading a half megabyte of data may take less than a third of a second (on some machines). This often beats doing something more complex and general such as using gdbm with pickle/shelve.

Processing Lines in a File

You want to read through a file line by line. This is the most common way of processing text files. For example, if you’re writing a Python program that searches for text in a file, you’ll have to loop through the file contents on a line by line basis.

You can code the loop explicitly, like this:

file = open('/tmp/filename', 'r')
while True:
   line = file.readline()
   if line == "": break    # Check for end-of-file
   do_something(line)
file.close()

Or like this, which reads all of the lines of the file into a list:

file = open('/tmp/filename', 'r')
lines = file.readlines()
file.close()
for line in lines:
   do_something(line)

The fileinput module makes this even simpler by handling the loop for you:

import fileinput
for line in fileinput.input("/tmp/filename"):
   do_something(line)

For sheer simplicity, it’s hard to top reading all the lines from the file into a list, like this:

file = open('/tmp/filename', 'r')
lines = file.readlines()
file.close()
for line in lines:
   do_something(line)

Or like this:

file = open('/tmp/filename', 'r')
for line in file.readlines()
   do_something(line)

This is easy to code, but it does require reading the entire file into memory. These days, most systems will have enough memory to effortlessly handle files a few hundreds of kilobytes long. A 10 megabyte file will cause problems - swapping to disk, if not an actual crash - for many systems. Use your common sense; if you’re pretty sure you won’t need to handle large files, use file.readlines(); otherwise, use either of the two suggested solutions.

To loop over multiple files using fileinput, use a sequence of filenames (a list, tuple, etc) instead of just a single file.

   for filename in ("file1", "file2", "file3"):

      file  = open(filename, 'r')
      lines = file.readlines()
      file.close()

      for line in lines:
          do_something(line)





If you omit the names, they default to **sys.argv[1:]**, or to standard input

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Files.rst, line 426)

Explicit markup ends without a blank line; unexpected unindent.

if no arguments were given.

Reading a File Backwards

Problem

You want to process a file, but starting with the last lines first.

Partial ~~~~~~~ Solution:

If you can read the entire file into memory, you can simply read all the data into a list and then reverse the list:

L = open('myFile').readlines()
L.reverse()

On a character-by-character basis, a similarly memory-bound solution is:

import string
L = string.split(open('thefile', 'r').read(), '')
L.reverse()

Reading a File Line by Line the Fastest Way

You can use the sizehint parameter to readlines() to get some of the efficiency of readlines() without reading in the whole file. The following code isn’t optimized, but it shows the idea:

class BufferedFileReader:
  def __init__ (self, file):
    self.file = file
    self.lines = []
    self.numlines = 0
    self.index = 0

def readline (self):
  if (self.index >= self.numlines):
    self.lines = self.file.readlines(65536)
  self.numlines = len(self.lines)
  self.index = 0

  if (self.numlines == 0):
    return ""

  str = self.lines[self.index]
  self.index = self.index + 1
  return str

Reading and Writing Binary Data

Use the struct module. It allows you to take a string read from a file containing binary data (usually numbers) and convert it to Python objects; and vice versa.

For example, the following code reads two 4-byte integers in big-endian format from a file:

 import struct
f = open(filename, "rb")  # Open in binary mode for portability
s = f.read(8)
x, y = struct.unpack(">ll", s)

The ‘>’ in the format string forces big-endian data; each letter ‘l’ reads one “long integer” (4 bytes) from the string.

You can refer to the Library Reference for more details about the struct module.

Reading and Writing Compressed Files

Problem

You want to read a file that’s been compressed with the GNU gzip program, or want to write a compressed file that gzip can read.

Solution

The <<PythonModule gzip>> module provides a GzipFile class for reading and writing gzip‘ped files. GzipFile instances imitate the methods of Python’s standard file objects. To read a file:

import gzip

# Open the file in 'r' mode for reading
file = GzipFile('data.gz', 'r')
line = file.readline()          # Read a single line
data = file.read(1024) # Read 1K of data
file.close()

Writing compressed data is much the same:

import gzip

# Open the file in 'w' mode for writing
file = GzipFile('output.gz', 'w')
file.write( 'First line of output\n')
file.close()

Discussion

It’s legal to use the ‘a’ mode to append data to a compressed file; the gzip file format can handle a file with several chunks of compressed data. When reading such a file with GzipFile, you won’t be able to tell where one chunk leaves off and the next begins because GzipFile seamlessly handles the transition between them.

The <<PythonModule gzip>> module is built on top of the <<PythonModule zlib>> module, which compresses strings of data. The <<PythonModule zlib>> module can be useful on its own to save disk space or network bandwidth.

For example, you can compress data before sending it over a TCP/IP socket or storing it in a DBM file. Here’s a short example:

>>> import zlib
>>> s = "This is a test of the emergency broadcast chicken."
>>> comp = zlib.compress(s)
>>> comp
'x\234\013\311\310,V\000...'
>>> print zlib.decompress( comp )
This is a test of the emergency broadcast chicken.

The <<PythonModule zlib>> module also provides compressor and decompressor objects that can be used to compress large amounts of data without having to squeeze all of it into memory.

Recursively Walking a Directory Tree

Problem

You want to process not merely a single directory, but an entire hierarchy of files and folders, and perhaps find a specific set of files.

Solution

Import the <<PythonModule os>> module, and use os.path.walk(path, function, arg). That recursively walks through the tree rooted at path, and calls function() in each directory that’s visited.

function() must accept 3 arguments: (arg, dirname, name_list). arg is the same as the value passed to os.path.walk; dirname is the current directory name; name_list is a list of the names of the directories and filenames in the directory. You can modify name_list in place to avoid traversing certain subdirectories.

Here’s a simple program to find files:

import os, sys

def find(arg, dirname, names):
  if arg in names:
    print os.path.join(dirname, arg)

os.path.walk(sys.argv[1], find, sys.argv[2])

Sample usage:

$ python ~/t.py /tmp/ README
/tmp/xml-0.4/README
/tmp/xml-0.4/demo/README
/tmp/xml-0.4/dom/README

Renaming Files

Problem

You want to rename a file on the filesystem, and have already figured out how to deal with your operating system’s naming conventions inside Python.

Solution

Simple: use os.rename:

#Rename file with filename f1 to filename f2
os.rename(f1,f2)

Specifying Filenames

There is a problem that routinely bedevils new Python users who come from a Windows or DOS background. The problem is that on Windows, a backslash is a separator in filenames, whereas on Unix (where Python originated) the backslash has an entirely different function (as an escape character) and the //forward// slash is the filename separator character.

When a person from a Windows background starts having problems with filenames, and goes looking for a solution, the first thing that he (or she) usually finds is Python’s “raw” strings. However, from the perspective of a Windows user, raw strings aren’t truly raw they are only semi-raw, as you will rudely discover the first time that you use a raw string to specify a filename (or filename part) that ends with a single backslash.

A better strategy is to routinely code your filenames using forward slashes, and then use a Python function such as os.path.normcase to change the separator to whatever is appropriate to the local operating system. A nice bonus of ths strategy is that it is platform independent

import os.path
myFilename = "c:/mydir/myfile.txt"
myFilename = os.path.normcase(myFilename)