Find duplicate files in a directory

in code


When I photographed heavily/professionally, I was rigorous in how I handled my imported raw files, and master processed (PSD/XCF) files. I was much less rigorous in how I sorted and stored my processed JPG files, to the point that I’ve found several directories with anywhere between hundreds and thousand of images, some or many of them straight duplicates.

For the hell of it, and also because I haven’t touched C# since early 2013, I drew up a simple console application in C# to search for duplicate file in a given directory. I made a good start on it in Bash, but…fuck. Bash is slow and interacting with arrays in Bash leaves me wanting to murder somebody.

Order of the program:

  1. Check directory was provided. Check directory exists. Check it has more than one file.
  2. Get list of files in directory.
  3. Generate MD5 checksums for each given file.
  4. For each checksum:
    i. Check each file after this in the list to see if it has the same sum.
    ii. If a duplicate is found, check if it is on the recorded dupe list.
    iii. If it isn’t on the dupe list, add it.
  5. Run through the file list once for each dupe checksum. Print all file names with the same checksum.

I need to find more little projects like this in C#; it was fun to dust off what I knew.

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Security.Cryptography;

public class findDupes {
    static void Main(string[] args) {
        CheckBeforeProceeding(args);

        string[] files = Directory.GetFiles(args[0]);
        List<string> filesums = new List<string>();

        foreach (string file in files)
            filesums.Add(GetFileSum(file));

        List<string> dupes = SearchForDupes(filesums);
        PrintDupes(filesums, dupes, files);
    }

    static void PrintDupes(List<string> sums, List<string> dupes, string[] files) {
        // Print output.
        foreach (string dupe in dupes) {
            Console.WriteLine("{0}\n----------", dupe);

            for (int i = 0; i <= (files.Length - 1); i++)
                if (sums[i] == dupe)
                    Console.WriteLine(files[i]);

            Console.WriteLine();
        }
    }

    static List<string> SearchForDupes(List<string> sums) {
        // Search for duplicate files within the given list of sums.
        List<string> dupes = new List<string>();

        for (int i = 0; i <= (sums.Count - 2); i++)
            for (int j = (i + 1); j <= (sums.Count - 2); j++)
                if (sums[i] == sums[j])
                    if (!dupes.Contains(sums[i]))
                        dupes.Add(sums[i]);

        return dupes;
    }

    static void CheckBeforeProceeding(string[] args) {
        // Check things are good with the target dir before proceeding.
        if (args.Length == 0) {
            Console.WriteLine("Error: No directory provided");
            Environment.Exit(1);
        }

        if (!Directory.Exists(args[0])) {
            Console.WriteLine("Error: '{0}' is not a valid directory", args[0]);
            Environment.Exit(2);
        } 

        if (Directory.GetFiles(args[0]).Length == 0) {
            Console.WriteLine("Error: '{0}' does not contain any files", args[0]);
            Environment.Exit(3);
        }

        if (Directory.GetFiles(args[0]).Length == 1) {
            Console.WriteLine("Error: '{0}' only contains 1 file", args[0]);
            Environment.Exit(3);
        }
    }

    static string GetFileSum(string file) {
        // Function scalped from http://stackoverflow.com/a/10520086/1433400
        using (var sum = MD5.Create())
            using (var stream = File.OpenRead(file))
                return BitConverter.ToString(sum.ComputeHash(stream)).Replace("-","").ToLower();
    }
}

Here is some example output:

[mark][new_instagram] $ ~/dupe_find.exe .
d89b812d61bb41d037b1e6704d146d11
----------
./1.jpg
./17.jpg

4c6cc2794010fe3b018038b0e1162744
----------
./132.jpg
./312.jpg

...

Neater output from the program is left as an exercise to the reader.



Remap a JavaScript Object

in code

Ik ben gelukkig

in me

Sacred Dissatisfaction

in running


Your email address will not be published. Required fields are marked *