Lesson A9 – Working with files

Text files

Sooner or later you will be confronted with a situation where you want to create a text file to store some data in it, or you want to read data from a file to work with it. For example you may want to read in the coordinates of a molecule from a structure file (maybe an entry from the PDB-database) or you could want to write a human-readable log-file during the execution of a program. A text file can be considered as a sequence of lines (or rows), each having a basically unlimited number of characters. In general, a text file is read and written to line by line.

The straighforward approach to open a file, write to it, and close it would look something like this in Python:

[1]:
file_ = open("io/file.txt", "w")  # open a file in "w" write mode
file_.write("0 1\n")  # write to the file
file_.close()  # close the file

Note: If you are wondering why we use file_ as the variable name with an underscore in the end, this is just a convention to avoid overriding a Python built-in key (we did a similar thing in the container exercise with list_ and dict_). Actually doing this for file_ is only strictly necessary in Python 2. In Python 3 file is not a reserved key anymore.

Advanced: The function open() returns a file object that we store in the variable file_. This file object has for example the attributes file_.name ('io/file.txt') and file_.mode ('w'). To work with the file we can use beyond others the methods file_.read(), file_.readlines() or file_.write. The method file_.close() closes the file. While the variable file_ still exists, we can not work with the file object anymore.

>>> file_ = open("dummy", "w")
>>> file_.close()
>>> print(file_.write(""))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-78b2a279da79> in <module>
      1 file_ = open("dummy", "w")
      2 file_.close()
----> 3 print(file_.write(""))

ValueError: I/O operation on closed file.

In the same way, reading from the just created file would look like this (note, that the reading mode "r" is the default and can be omitted):

[2]:
file_ = open("io/file.txt", "r")  # open a file in "r" read mode
line = file_.readline()  # write to the file
print(line)
file_.close()  # close the file
0 1

Doing this in this way is, however, a bit dangerous. Note that there is the need to explicitly close the file manually in the end. If you forget to write this last line, the file remains open which is a potential source of problems. Open file objects occupy memory (although the file content is not read into RAM) and may slow your program down, if you have many files opened. Left open files can be also the starting point of bugs in your code where you unintentionally write to still open files or prevent other applications from accessing them because your opening blocks them. And even if you remember to account for this last line, imagine something unexpected comes up while you process the file and your program is interrupted before the close() statement could be executed … . Python can normally handle this well, but the better way to do it is this:

[3]:
with open("io/file.txt", "r") as file_:
    print(file_.readline())
0 1

The with statement indicates the use of the open() method as a so called contextmanager. This handles the opening of the file and its correct closing in any case for you.

An opened file object, is an iterator. As we have seen before, iterators are iterable so we can use them to loop over the lines of the file. The content of the file is not stored in memory just by opening a file. You need to yield the content line by line from the open file object.

[4]:
with open("io/file.txt", "r") as file_:
    print(file_)
    # Printing the file object itself prints only a representation of it
    # and does not print the content
<_io.TextIOWrapper name='io/file.txt' mode='r' encoding='UTF-8'>
[5]:
with open("io/file.txt", "r") as file_:
    for line in file_:
        # Iterations over lines in the file
        print(line)
        # Print the current value of `line`
0 1

We can only loop once over a created file object. Once the last line and the end of the file has been reached, the file object is exhausted. Trying to get another thing out of it will raise the StopIteration exception.

[6]:
with open("io/file.txt", "r") as file_:
    for line in file_:
        print(line)
    next(file_)
0 1

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-6-ea5ea6d2d86e> in <module>
      2     for line in file_:
      3         print(line)
----> 4     next(file_)

StopIteration:

Advanced

File parsing

When parsing files line by line, the various methods available for string objects come in handy again. To demontrate a few of them in action, let’s create a file first.

[7]:
filestring = """# Example data file created by a very scientific program

@Please cite: P. Silie, K. Te, G. Heinrich *Nature Science* __1961__, *1*, 123.
@Legend:
@x: time / ps
@y: observation / au
 0 12.3
 1 435.3
 2 24.4
 3 34.6
 4 221.3
 5 11.2
 6 10.1
 7 10.0
 8 10.2
 9      # data corrupted?
10
"""
[8]:
with open("io/output.txt", "w") as file_:
    file_.write(filestring)

The file consists of a header (lines starting with # or @) followed by a data block. We want to read this file but we want to extract only the actual data, which is the second column of the data part. Our strategy will be, while reading the file line by line, to skip those lines beginning with a comment-indicating character and try to get the second column entry where possible. Note, that we use the enumerate function, we met in the iteration exercise, to track line number and line content at the same time when we loop through the file.

[9]:
data = []
# Empty list in which the data (second column in file) will be stored

with open("io/output.txt") as file_:
    for i, line in enumerate(file_):
        # i (int): line index starting at 0
        # line (str): content of the line

        if line.startswith(("#", "@")):
            continue
            # Skip these lines
            # Go directly to next iteration
            # Do not execute the rest of the loop

        # Line not skipped
        try:
            # Try to add entry to `data` list
            data.append(
                # Try to get second column and convert to float;
                float(line.split()[1])
                )
                # `.split()` cuts the line content at whitespaces " "
                # and returns a list.  Second element ([1]) of this list
                # is tried to be converted to float

        except (IndexError, ValueError):
            # IndexError: The line has not at least to columns;
            #     Indexing the list returned by `.split()` fails.
            # ValueError: The line has a second column but the element
            #     can not be converted to a float

            print(f"Can't read from line {i}")
            # Print this if we cought one of the above errors.

print(data)
Can't read from line 1
Can't read from line 15
Can't read from line 16
[12.3, 435.3, 24.4, 34.6, 221.3, 11.2, 10.1, 10.0, 10.2]