Source

Strings

Introduction to Strings

Relatively few tasks can be performed by only dealing with numbers; programs will usually print out reports, modify the contents of text files, parse [[HTML]] and [[XML]] documents, and perform other operations on strings of characters. Strings are therefore an important data type in modern programming languages, and Python is no exception.

In the source code for a Python program, strings can be written in several ways. They can be surrounded by either single or double quotes:

if version[:5] != 'HTTP/':
   send_error(400, "Bad request version(%s)" % `version`)

Strings are one of Python’s sequence types. This means that strings can be sliced, and the for statement can be used to iterate over the individual characters in a string:

>>> s = "Hello, world"
>>> s[0:5]
'Hello'
>>> s[-5:]
'world'
>>> for char in s:
...   print char,
...
H e l l o ,   w o r l d

Strings are immutable; once a string has been created, whether as a literal in a program’s source code or in the course of a program’s operation, you can’t modify the string in-place. Trying to change the first character of a string by slicing, as in the code s[0] = ‘Y’, fails; to create a modified version of the string, you have to assemble a whole new string with code like s = ‘Y’ + s[1:].

This means that Python has to make temporary copies of the string, which will be slow and memory-consuming if you’re modifying very large strings.

See the [[Mutable Strings]] section for a way around this by using the <<PythonModule array>> module, and the <<PythonModule UserString>> module for a MutableString type which may be useful if you absolutely must modify strings in-place and still use some of the more common <<PythonModule string>> methods.

Converting a String to a Tuple

Problem

You want to convert a tuple in a string into the corresponding tuple object; such a str2tuple() function would be used like this:

>>> myString = "(12, '  abc  '  ,   '  b,  c', 'd\\'e')"
>>> print str2tuple(myString)
(12, 1.0, "'XX'", '  abc  ', '  b,  c', "d'e", "'f")

Solution

The following str2tuple() function will do the job.

def str2tuple(s):
   return eval(s, {'__builtins__': {}})

Discussion

The above function uses the eval() built-in function, along with a special trick. eval() takes a string and evaluates the contents as a Python expression, returning the value of the expression. One or two additional parameters can be supplied that must be dictionaries that will be used as the table of global and local variables, respectively.

The str2tuple() function uses eval() and provides a dictionary containing a single key-value pair, mapping the key “____builtins____” to an empty dictionary. “____builtins____” is a special name internal to Python.

The net effect of this is to evaluate the string in a highly restricted environment. This denies access to Python’s standard built-in functions such as open(), and thus cannot access anything that isn’t hardcoded in the language. For example:

>>> str2value(' (1,"test",10.2)')
(1, 'test', 10.2)
>>> str2value('11+13')
24
>>> str2value('open("/etc/passwd")')
Traceback (innermost last):
 File "<stdin>", line 1, in ?
 File "<stdin>", line 2, in str2value
 File "<string>", line 0, in ?
NameError: open
>>>

Notice that str2value() can accept expressions such as “11+13” thanks to its use of eval(), returning an integer value. If you really want to make sure that the result is a tuple, you must use the type() built-in function to determine the type of the result and compare it with the type object for tuples.

v = str2value( text )
if type(v) is not type( () ):
   raise ValueError, "Text %s does not contain a tuple" % (text,)

A clearer way to do this is to import the <<PythonModule types>> module, which contains predefined variables for the most common Python types, and compare the type of the result to types.TupleType.

if type(v) is types.TupleType:
   raise ValueError, "Text %s does not contain a tuple" % (text,)

Converting from Strings to Numbers

Problem

You have a string such as ‘533’ and wish to convert it to the number 533.

Solution

The built-in functions int(), long(), and float() can perform this conversion from string to a numeric type:

>>> int( '533' )
533
>>> long( '37778931862957161709568' )
37778931862957161709568L
>>> float( '45e-3'), float('12.75')
(0.045, 12.75)

Note that these functions are restricted to decimal interpretation, so that int(‘0144’) will return a result of 144, and int(‘0x144’) will raise a ValueError exception.

To support different bases, or for use with versions of Python earlier than 1.5, the <<PythonModule string>> module contains the functions atoi(), atol(), and atof(), which convert from ASCII to integer, long integer, or floating point, respectively.

>>> import string
>>> string.atoi('45')
>>> string.atol( '37778931862957161709568' )
37778931862957161709568L
>>> string.atof('187.54')
187.54

atoi() and atol() have an optional second argument which can be used to specify the base for the conversion. If the specified base is 0, the functions will follow Python’s rules for integer constants: a leading “0” will cause the string to be interpreted as an octal number, and a leading “0x” will cause base-16 to be used.

>>> string.atoi('255')
255
>>> string.atoi('255', 16) # 255 hex = 597 decimal
597
>>> string.atoi('0255', 0) # Assumed to be octal
173
>>> string.atol('0x40000000000000000000', 0)
302231454903657293676544L

While you could use the built-in function eval() instead of the above functions, this is not recommended, because someone could pass you a Python expression that might have unwanted side effects.

For example, a caller might pass a string containing a call to os.system() to execute an arbitrary command; this is a serious danger for applications such as CGI scripts that need to handle data from an unknown source. eval() will also be slower than a more specialized conversion operation.

Mutable Strings

Problem

Python’s strings are immutable, so you can’t change a single character of the string without constructing a new string. Sometimes this is a problem, particularly when dealing with very large strings; constructing a new string will require copying most of the original string, and can be slow.

Solution

Use the <<PythonModule array>> module, which provides a mutable array data type that can only hold values of a single type. If used to hold only characters, an array behaves similarly to a mutable string.

import array

A = array.array('c')
A.fromstring('hello there!')
print A

This prints:

array('c', 'hello there!')

The array object A is mutable, so you can modify an element in place:

A[0] = 'j'
print A

This will print:

array('c', 'jello there!')

Most functions that require strings, such as string.split(), won’t accept array objects, so you’ll have to convert the array to a string in order to pass the data to certain functions:

print string.split( A.tostring() )

But since both strings and arrays are [[sequence types|http://www.python.org/doc/2.4.1/lib/typesseq.html]], several forms of indexing and slicing can be performed on either.

Parsing Input

Problem

You wish to parse some sort of structured input, such as a configuration file or some data file.

Solution

There are many possible ways to go about this.

For simple input parsing, the easiest approach is usually to split the line into whitespace-delimited words using string.split(), and to subsequently convert decimal strings to numeric values using string.atoi(), string.atol() or string.atof() If you want to use a delimiter other than whitespace, string.split() can handle that, and can be combined with string.strip() which removes surrounding whitespace from a string.

For more complicated input parsing, the <<PythonModule re>> module’s regular expressions are better suited for the task, and are more powerful than C’s sscanf().

Discussion

A parser for configuration files is included as the <<PythonModule ConfigParser>> module in the standard library; you should take a look at it and see if it meets your needs.

Python programmers often choose to decree that software configuration files should written in <<RFC 822>> format, specifying names and their corresponding values; this allows using the parser in the <<PythonModule rfc822>> module to read the files. A sample configuration file might then look like this:

Title: Index to Python Information
Description: Python code, information, and documentation.
Keywords: Python, Python articles, Python documentation
Palette: gold
Sidebar: none

There’s a contributed module that emulates sscanf(), written by Steve Clift and availabe from the Contributed Software section on [[here|http://www.py thon.org/ftp/python/contrib-09-Dec-1999/Misc/sscanfmodule.c.Z]].

If you’re trying to parse some sort of well-known file format, it’s possible that a Python module has already been written to deal with it. Some common cases are:

  • [[The Python Imaging Library|http://www.pythonware.com/]] can read many

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Strings.rst, line 456)

Bullet list ends without a blank line; unexpected unindent.

different graphics formats, ranging from well-known ones such as [[GIF]] and [[JPEG]], to more specialized formats such as [[DCX]] and [[TIFF]].

  • Support for many scientific file formats has been implemented for use with

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Strings.rst, line 460)

Bullet list ends without a blank line; unexpected unindent.

Numeric Python; consult the [[Scientific Computing|http://www.python.org/topics/scicomp/]] topic guide for more information.

  • The [[XML]] [[topic guide|http://www.python.org/topics/xml/]] tracks the

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Strings.rst, line 465)

Bullet list ends without a blank line; unexpected unindent.

available software for processing [[XML]] with Python.

Before spending a lot of effort implementing a module for a new file format, do some research first and see if someone has already done it; you might save yourself a lot of work.

Reversing Strings

Problem

You want to reverse a string (cba from abc), and yet you cannot find a strings method do do it.

Solution

There is a rather hackish one-liner to reverse a string, which relies on [[extended slices|http://www.python.org/doc/2.3.5/whatsnew/section- slices.html]]:

>>> text='foobar'
>>> print text[::-1]
raboof

Searching and Replacing

Problem

Given a string, you wish to replace all the occurrences of one substring with different text. For example, you wish to replace all occurrences of the string ‘USER-NAME’ with the string ‘Joseph Addison’ in the string contained in the variable data.

Solution

For simple string substitutions, the string.replace function will be the simplest and fastest solution.

import string
newdata = string.replace(data, 'USER-NAME', 'Joseph Addison')

If data contains “USER-NAME has been added to the list.”, then after the above line is executed, newdata will contain ‘Joseph Addison has been added to the list.’.

Discussion

<<<

Note ~~~~ the following has not yet been revised to account for [[Unicode]]/[[UTF-8]] details, and so should be taken with a grain of salt in modern Python - but the principles are sound.

<<<

String replacement is a common operation, and there are several ways to do it. The fastest and simplest way is the string.replace() function, which can only replace a fixed string with another fixed string. For example, the following line replaces the Latin-1 character 219, a capital U with a circumflex accent (^), with the HTML entity &amp;Ucirc;.

newdata = string.replace(data, chr(219), '&amp;Ucirc;')

The chr() built-in function takes an integer between 0 and 255, and returns a string of length 1 that contains the character with that byte value.

Multiple replacements will require multiple calls to string.replace():

newdata = string.replace(data, chr(219), '&amp;Ucirc;')
newdata = string.replace(data, chr(233), '&amp;eacute;')

The replacement string is fixed, so it can’t be varied depending on the string that was matched. For cases that require matching variable strings, such as “match anything between square brackets”, or that require varying the replacement string, you’ll have to use regular expression matching, available through the built-in <<PythonModule re>> module.

The following example uses regular expressions to replace URLs in a string with the [[HTML]] for a link to that URL, with the URL as the link text.

import re

data = """http://www.freshmeat.net
 ftp://ftp.python.org/pub/python/src/
"""

newdata = re.sub(r"""(?x)
(  # Start of group 1
 (http|ftp|mailto)  # URL scheme
 :                  # Separating colon
 \S+                # Everything up to the next whitespace character
)  # End of group 1
""", r'<a href="\g<1>">\g<1></a>', data)

Notice that the order of the arguments to re.sub is different from the arguments for string.replace(). string.replace() takes the arguments (string, substring, replacement), while re.sub() takes the arguments (pattern, replacement, string). re.sub() has a different ordering of its arguments for consistency with the other functions in the re module; the regular expression pattern is viewed as the most important argument, so it’s always passed as the first argument.

Regular expression patterns have a complicated syntax. Let’s dissect the above pattern into its components.

  • (?x) Specifies that this pattern is expressed in verbose mode. Most

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Strings.rst, line 640)

Bullet list ends without a blank line; unexpected unindent.

whitespace will be ignored, so you can format the pattern neatly, and comments can be embedded in the pattern by preceding them with a “#“.

  • (http|ftp|mailto) The parenthesized group lists several alternative

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Strings.rst, line 644)

Bullet list ends without a blank line; unexpected unindent.

strings, separated by “|” characters. Any one of these strings can produce a successful match for this component of the pattern.

  • S+ The S special sequence matches any character that isn’t a

System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/grimoire/apprentice/Strings.rst, line 648)

Bullet list ends without a blank line; unexpected unindent.

whitespace character. Whitespace characters are space “** “, tab “t**”,newline “n“, carriage return “r“, form feed “f“, and vertical tab “v“. The “+” character is a qualifier that specifies how many times the previous component should be repeated; “+” indicates that the S should be repeated one or more times.

The replacement string can contain sequences which will contain pieces of the matching string. For example, g<1> will be replaced by the contents of the first group, which in this case will contain the whole URL that matches the regular expression. The replacement string is therefore ‘<a href=”g<1>”>g<1></a>’, which will contain the text of the URL at two different places, along with the required HTML.

The <<PythonModule re>> module can perform simple substitutions that are equally possible with string.replace(). This is simply a matter of writing a pattern that doesn’t use any of the regular expression syntax:

data = re.sub('USER-NAME', 'Joseph Addison', data)

Try not to use regular expressions when a simpler fixed-string replacement can do the job, because you’re paying a speed penalty for the extra generality of regular expressions.

For much more information on regular expressions in Python, consult the [[Regular Expression HOWTO|http://www.python.org/doc/howto/regex/]].

Splitting Strings from the right

Problem

You want to split a string starting from the far (i.e. right-hand) side yet you cannot find a strings method do do it (rsplit was only introduced in 2.4).

Solution

Here’s an approach that leverages the [[Reversing Strings]] one-liner:

def rsplit(s, sep=None, maxsplit=-1):
  """
  Equivalent to str.split, except splitting from the right.
  """
  if sys.version_info < (2, 4, 0):
    if sep is not None:
      sep = sep[::-1]
    L = s[::-1].split(sep, maxsplit)
    L.reverse()
    return [s[::-1] for s in L]
  else:
    return s.rsplit(sep, maxsplit)

Splitting Strings into Equal Sections

Problem

You wish to split a string into N equally-sized parts. The last part might be smaller than the previous ones when the length of the string isn’t an exact multiple of N. For example, dividing the 9-character string ‘Woodhenge’ into 2 parts would result in the list [‘Woodh’, ‘enge’], containing a 5-character and a 4-character string.

Solution

The following function takes a string S and a number N, and returns an N-element list containing the different sections of the string. The function works by taking the length of the string and computing how long the sections must be. The list is then constructed by looping over the string and extracting each section by slicing the string.

def equal_split(S, N):
   """Split up the string S into N parts, returning a list containing
   the parts.  The last part may be smaller than the others."""
   part = (len(S) + N - 1)  /N
   L = []
   for i in range(0, N):
       L.append( S[part*i : part*i+part] )
   return L

equal_split(‘this is a test’, 3) will return [‘this ‘, ‘is a ‘, ‘test’].

More general slicing and indexing of strings works in the same way as other [[sequence types|http://www.python.org/doc/2.4.1/lib/typesseq.html]].