How to skip header while processing .txt filie?

316
August 18, 2017, at 05:01 AM

In Think Python by Allen Downey the excersise 13-2 asks to process any .txt file from gutenberg.org and skip the header information which end with something like "Produced by". This is the solution that author gives:

def process_file(filename, skip_header):
    """Makes a dict that contains the words from a file.
    box  = temp storage unit to combine two following word in one string
    res = dict
    filename: string
    skip_header: boolean, whether to skip the Gutenberg header
    returns: map from string of two word from file to list of words that comes 
    after them
    Last two word in text maps to None"""
    res = {}
    fp = open(filename)
    if skip_header:
        skip_gutenberg_header(fp)
    for line in fp:
        process_line(line, res)

    return res
def process_line(line, res):
    for word in line.split():
        word = word.lower().strip(string.punctuation)
        if word.isalpha():
            res[word] = res.get(word, 0) + 1

def skip_gutenberg_header(fp):
    """Reads from fp until it finds the line that ends the header.
    fp: open file object
    """
    for line in fp:
        if line.startswith('Produced by'):
            break

I really don't understand the flaw of execution in this code. Once the code starts reading the file using skip_gutenberg_header(fp) which contains "for line in fp:"; it finds needed line and breaks. However next loop picks up right where break statement left. But why? My vision of it is that there are two independent iterations here both containing "for line in fp:", so shouldn't second one start form the beginning?

Answer 1

No, it shouldn't re-start from the beginning. An open file object maintains a file position indicator, which gets moved as you read (or write) the file. You can also move the position indicator via the file's .seek method, and query it via the .tell method.

So if you break out of a for line in fp: loop you can continue reading where you left off with another for line in fp: loop.

BTW, this behaviour of files isn't specific to Python: all modern languages that inherit C's notion of streams and files work like this.

The .seek and .tell methods are mentioned briefly in the tutorial.

For a more in-depth treatment of file / stream handling in Python, please see the docs for the io module. There's a lot of info in that document, and some of that information is mainly intended for advanced coders. You will probably need to read it several times and write a few test programs to absorb what it says, so feel free to skim through it the first time you try to read... or the first few times. ;)

Answer 2

My vision of it is that there are two independent iterations here both containing "for line in fp:", so shouldn't second one start form the beginning?

If fp were a list, then of course they would. However it's not -- it's just an iterable. In this case it's a file-like object that has methods like seek, tell, and read. In the case of file-like objects, they keep state. When you read a line from them, it changes the position of the read pointer in the file, so the next read starts a line below.

This is commonly used to skip the header of tabular data (when you're not using a csv.reader, at least)

with open("/path/to/file") as f:
    headers = next(f).strip()  # first line
    for line in f:
        # iterate by-line for the rest of the file
        ...
READ ALSO
import succeeds for one file but not another in the same directory

import succeeds for one file but not another in the same directory

I have a project folder named projaIt contains several files:

252
Python: Improved set notation?

Python: Improved set notation?

In a tutorial I've read that sets can since Python 26 be defined like this:

375
Bottle does not recognize Axios.js json post

Bottle does not recognize Axios.js json post

I'm trying to make a post request with Axios to a backend coded in Python bottle

387