Interacting with the file system is a common task when you are doing any kind of system programming with Python. A common sub-task in this is to iterate over files and folders in a directory in the file system.

The os.walk function

Python provides the excellent os.walk directory and file iterator function for this purpose. This function returns a generator that allows you to walk over the files and folders starting from a given directory in an iterative fashion.

For example, this is how you can iterate over the list of mp3 music files say in your $HOME/Music folder and perform some action, in this case, just count them.

import os

def count_mp3(top='~/Music'):
    count = 0

    for root,dirs,files in os.walk(os.path.expanduser(top)):
        for f in files:
            if f.lower().endswith('.mp3'):
                count += 1

    print('Found {} mp3 files in {}'.format(count, top))
    return count

And this is how you can generalize this using the fnmatch module say for any type of file.

import os
import fnmatch

def count_files(top, pattern):
    count = 0

    for root,dirs,files in os.walk(os.path.expanduser(top)):
        count += len(fnmatch.filter(files, pattern))

    print('Found {} files in {} matching {}'.format(count, top, pattern))
    return count

The fnmatch module provides Unix style file matching in Python, so this is how you can call the above function.

>>> count_files('~/Videos', '*.mp4')
Found 118 files in ~/Videos matching *.mp4
118

The find command

If you are familiar with Linux or any other *nix system, you must be used to the very versatile find command. This command allows you to search for files matching a variety of conditions.

For example, the command matching our mp4 video file counter using find is,

$ find ~/Videos -name \*.mp4 | wc -l
118

However find can do much more than filter files by their extensions. Here are some examples.

Find filenames greater than a given size (Say 5MB from your ~/Pictures folder)

$ find ~/Pictures -type f -size +5M
...

Find filenames smaller than a given size.

$ find ~/Pictures -type f -size -5M 
...

Find files modified in the last n days (Say 10 days from your ~/Pictures folder)

$ find ~/Pictures -type f -mtime -10
...

Can we write code in Python similar to the find program which allows you to filter files from a folder which satisfies a set of conditions without having to write a new function every time ?

That is the focus of this post.

Composing functions - Predicate based filtering

The problem with os.walk is that it is a general purpose iterator. It does not accept any predicate (a filter function) as argument which allows one to pass in a condition as a function to it, to customize the directories/files it generates.

How difficult it is to write one ?

As it turns out this is not a very difficult thing to do. Here is a function which returns a generator over files in a directory named top which matches a given predicate. Since it generates only a list of files (with full pathnames) I am calling it as fwalk.

def fwalk(top, predicate=None):
    """ Generator that filters results of os.walk
    using a predicate function """
    
    for root,dirs,files in os.walk(os.path.expanduser(os.path.expandvars(top))):
        for f in files:
            cond = (predicate == None) or predicate(root, f)
            if cond:
                yield os.path.join(root, f)

The idea here is simple. We pass in a predicate function - think of it as filter functions that return either True or False - to fwalk and it generates only those file paths which match the predicate. The predicate function accepts a tuple of (directory, filename) pair.

The fwalk function allows one to create new functions by composing predicate functions and layering their logic on top to achieve the final logic we want.

Here are some examples, which repiclates what find did in the section above.

  1. Finding a list of mp4 files in the ~/Videos folder.

    >>> fwalk('~/Videos',predicate=lambda x,y: fnmatch.fnmatch(y, '*.mp4'))
    <generator object fwalk at 0x7fcf43cdf891>
    >>> len(list(fwalk('~/Videos',predicate=lambda x,y: fnmatch.fnmatch(y, '*.mp4'))))
    118
    
  2. Find pictures of size > 5MB in the ~/Pictures folder.

    >>> predicate = lambda x,y: os.path.getsize(os.path.join(x,y)) > 5*1024**2
    >>> fwalk('~/Pictures', predicate=predicate)
    <generator object fwalk at 0x7fcf43ef7c78>
    >>> len(list(_))
    492
    
  3. Find pictures of size < 5MB in the ~/Pictures folder.

    >>> predicate = lambda x,y: os.path.getsize(os.path.join(x,y)) < 5*1024**2
    >>> fwalk('~/Pictures', predicate=predicate)
    <generator object fwalk at 0x7fcf43ef7b88>
    >>> list(_)
    ['/home/anand/Pictures/skeptichacker_old.png', '/home/anand/Pictures/pp.jpeg',
    ...
    ...
    ...
    ]
        
    

Adding flexibility - Combining Predicates

It does not take much effort to extend the above fwalk function to a more generic use-case which accepts a set of predicate functions than a single one. Here is the rewritten function.

def fwalk(top, predicates=[]):
    """ Generator that filters results of os.walk using a set of predicate functions """
    
    for root,dirs,files in os.walk(os.path.expanduser(os.path.expandvars(top))):
        for f in files:
            cond = all(pred(root, f) for pred in predicates)
            if cond:
                yield os.path.join(root, f)

How does this work ?

As you may have noted, the function accepts a list of predicates and only generates those filepaths which match all the predicates. It makes use of Python’s wonderful all built-in function to do this.

With this function as a tool, we can now start building a generic find like function in Python which can be extended with any set of predicate functions to give us a general purpose, customizable directory iterator in Python.

May I present to you the find function without much ado!

import re

def find(folder, **kwargs):
    """ General purpose `find` emulator using Python """

    # Built-in predicates we support
    # any pattern using keyword 'pattern'
    # max size using keyword 'max_size'
    # min size using keyword 'min_size'
    
    pdict = {'pattern':  lambda x, y: re.search(re.escape(kwargs['pattern']), os.path.join(x,y), re.IGNORECASE),
             'min_size': lambda x, y: os.path.getsize(os.path.join(x,y)) > kwargs['min_size'],
             'max_size': lambda x, y: os.path.getsize(os.path.join(x,y)) < kwargs['max_size']}
    
    predicates = []

    for k in kwargs:
        if k in pdict:
            predicates.append(pdict[k])
            
    return fwalk(folder, predicates)

NOTE: Note how we are using `re` module for general purpose string matching instead of `fnmatch`.

How does this work ?

Our find function supports some built-in predicate functions using its pdict dictionary. The functions are defined using specific keywords as keys in the dictionary. We support the following keys namely pattern, min_size and max_size. The values are anonymous functions defined using lambda and accepting a pair of (x,y) arguments where (x,y) stands for the (directory, filename) pair.

When someone calls the function using one or more of the keys as optional argument, we use the value of the key to build out the predicate function. This is then appended to the list of predicates. Finally fwalk is called using the built predicate list and the result is returned.

Here is our find in action.

## Generator for files matching my vacation pics in Hawaii!
>>> find('~/Pictures', pattern='hawaii')
...
## Generator for Videos above 1G in size
>>> find('~/Videos', min_size=1024**3)
...
## Combine both
## Videos of vacation in hawaii which are over 1G in size
>>> find('~/Videos', min_size=1024**3, pattern='hawaii')
...
## Pictures which are in the size range 1 to 2 MB
>>> find('~/Pictures', min_size=1024**2, max_size = 2*1024**2)
...

The find function demonstrates the power of Python’s keyword based functions on top of the ability to compose functions giving the programmer a lot of flexibility by dynamically composing functions and layering logic. This is a very good example of functional programming in Python utilized to its fullest extent.

Our find function is extensible by easily adding new predicates in its pdict .

Follow Ups

If you understood the code in this post, you can take it upon as an exercise to add predicates for filtering files based on modification and/or creation times. You will need to make use of the stat module in Python for this purpose.

If you have any questions on this post or want to reuse code from here for your purposes, please add a comment or write to us.


Note that name and e-mail are required for posting comments