Table Of Contents

Previous topic

RandMusic - Copy Random Songs to an MP3 Player

This Page

Find and Prompt for Removal of Duplicate Files

The find_dup_files.py program, could really use a more cleaver name. But it does what its name says. It is a simple, but useful program that scans a set of directory trees looking for duplicate files. When duplicate files are found, it lists them and prompts the user asking how to handle the set of files. The user may select to skip the set of duplicates, keep one of the files and delete the rest, delete all of the files, or exit the program.

This program will not likely be needed very often, but it can really help with cleaning up and organizing files that have accumulated over many years. I find the program especially useful with digital pictures, which have been maintained by several people over several years. Duplicate files can also crop up as files are copied between computers.

For the student wanting to learn Python programming via example, this program demonstrates the use dictionaries, lists, and various file and directory operations. The combined use of dictionaries and lists demonstrated here is a common strategy to search for the frequency of occurrence of values in a set of data.

Usage

The program needs one or more directory names as arguments to begin searching for duplicates.

How it works

The program begins by evaluating each file by its file size. A dictionary containing lists of potential duplicates is built as the file system is scanned. From this dictionary, a list of lists is built for each set of one or more files that are the same size.

Next, the sets of potential duplicates are evaluated twice by calculating md5 hash values. In the first pass, we only process the first 1024 bytes of each file. The reason for this abbreviated pass is that many files may have the same size and calculating hash values is fairly slow. So the first pass makes a quick determination to eliminate many non-duplicates. To be certain that the files are duplicates, the second pass compares each set by calculating the md5 hash value for the whole file.

Download find_dup_files

find_dup_files.py

Auto Generated Documentation

find_dup_files.first_hash(file)
Returns md5 hash value for first 1024 bytes of file
find_dup_files.full_hash(file)
Returns md5 hash value for all of the file contents
find_dup_files.scan_by_size(directories)
Scan directories building a dictionary of lists from which a list of lists is extracted. Each sublist contains a set of file names with the same length, which is returned.
find_dup_files.filter_dupes(potentialDupes, hash_func)
Loop through a list of lists containly potential duplicate files. Use the specified hash function to determine if they do appear to be duplicates. If a list of potential duplicates contains multiple sets of duplicates, a new list is generated for each.