The find_dup_files.py program, could really use a more cleaver name. But it does what its name says. It is a simple, but useful program that scans a set of directory trees looking for duplicate files. When duplicate files are found, it lists them and prompts the user asking how to handle the set of files. The user may select to skip the set of duplicates, keep one of the files and delete the rest, delete all of the files, or exit the program.
This program will not likely be needed very often, but it can really help with cleaning up and organizing files that have accumulated over many years. I find the program especially useful with digital pictures, which have been maintained by several people over several years. Duplicate files can also crop up as files are copied between computers.
For the student wanting to learn Python programming via example, this program demonstrates the use dictionaries, lists, and various file and directory operations. The combined use of dictionaries and lists demonstrated here is a common strategy to search for the frequency of occurrence of values in a set of data.
The program needs one or more directory names as arguments to begin searching for duplicates.
The program begins by evaluating each file by its file size. A dictionary containing lists of potential duplicates is built as the file system is scanned. From this dictionary, a list of lists is built for each set of one or more files that are the same size.
Next, the sets of potential duplicates are evaluated twice by calculating md5 hash values. In the first pass, we only process the first 1024 bytes of each file. The reason for this abbreviated pass is that many files may have the same size and calculating hash values is fairly slow. So the first pass makes a quick determination to eliminate many non-duplicates. To be certain that the files are duplicates, the second pass compares each set by calculating the md5 hash value for the whole file.