Scala code to find (and move or remove) duplicate files

By Alvin Alexander. Last updated: November 26, 2019

My MacBook recently told me I was running out of disk space. I knew that the way I was backing up my iPhone was resulting in me having multiple copies of photos and videos, so I finally decided to fix that problem by getting rid of all of the duplicate copies of those files.

So I wrote a little Scala program to find all the duplicates and move them to another location, where I could check them before deleting them. The short story is that I started with over 28,000 photos and videos, and the code shown below helped me find nearly 5,000 duplicate photos and videos under my ~/Pictures directory that were taking up over 18GB of storage space. (Put another way, deleting those files saved me 18GB of storage.)

Using a checksum to find identical/duplicate files

A key part of the algorithm is that I find duplicate files by performing a checksum operation on each file, specifically using the SHA-1 algorithm with the Java MessageDigest class. SHA-1 is the same algorithm that’s used by Git, so I thought that would work well.

I did this because I didn’t want to assume that two files were the same just because they have the same filename. Because I had images and videos from both my old iPad and several iPhones, it was very likely that images in different directories had the same name.

The source code

The source code is admittedly crappy, but I hope it can help to serve as a starting point for you in case you want to find and move (or remove) duplicate files on your hard drive.

A few keys to understanding the code are:

I created the /Users/al/Pictures/all_picture_files.txt file by running a find . -type f command in the /Users/al/Pictures directory.
That command creates filenames like ./00913_idyll_1440x900.jpg, so the code removes those first two characters.
Early on in my process, when a checksum algorithm I found on the internet was running very slow, I found that a Scala ParVector worked much faster than a Vector in Step 2, and I ended up keeping that ParVector in the final code that’s shown. I didn’t test to see if it helps with the much faster checksum algorithm in this code.
In Step 3 I create a mutable Map where the checksum is the key and a list of filenames that have that checksum is the value.
In Step 4 I write all of the checksums and filenames to a data file. That step is totally optional, and I used it initially for debugging/verifying my code.
Step 5 contains some very dangerous code. For each record in the Map, it looks at the list of filenames, and if there’s more than one filename in the list it skips over the first entry and then moves every entry in that list to another directory. I could have deleted the duplicate files at this point, but I wanted to manually confirm that the files were indeed duplicates, so I used this approach.

As I mentioned, in the end this algorithm helped me remove nearly 5,000 files and I was able to reclaim more than 18GB of storage space.

As I also mentioned, this code is very crappy in many ways, so use it at your own risk. I don’t handle errors in several places because I never ran into those problems and I don’t plan to use this code again, at least not for a long time, so I don’t want to spend any more time on it.

Given all of that discussion and warnings, here’s the Scala source code I wrote to find duplicate files and move them to another directory:

import java.nio.file.{Files, Paths, StandardCopyOption}
import java.security.DigestInputStream
import java.security.MessageDigest
import scala.collection.parallel.immutable.ParVector
import scala.io.Source

/**
  * WARNING: Moving files is a dangerous operation, and this code does not
  *          rigorously check for errors. Use this code at your own risk.
  *
  * Notes
  * -----
  * - SHA-1 is what Git uses
  * - this code assumes you have a list of files generated by `find . -type f` inside
  *   your pictures folder on a Mac computer
  */
object FindDuplicateFiles extends App {

    println("Getting all filenames ...")


    // (1) get a list of all picture files
    val basePictureDir = "/Users/al/Pictures"
    val inputFile = "/Users/al/Pictures/all_picture_files.txt" //mine has 28k lines
    val rawListOfPictureFiles = Source.fromFile(inputFile).getLines.toVector
    // filenames begin with "./", so drop those first two characters
    val canonListOfPhotos = rawListOfPictureFiles.map(fname => basePictureDir + "/" + fname.drop(2))
    println(s"Done reading ${canonListOfPhotos.size} filenames ...")


    // (2) get checksum for each file
    // early tests showed that a parallel vector was much faster than a regular vector
    // (though that was when the SHA-1 algorithm was very slow)
    val (checksumFilenameTuples: ParVector[(String, String)], time) = timer {
        println("Getting checksums ...")
        val parallelListOfPhotos = canonListOfPhotos.par
        val listOfChecksumsAndFilenames: ParVector[(String, String)] = for {
            f <- parallelListOfPhotos
            checksum = fileChecksum(f, MessageDigest.getInstance("SHA-1"))
        } yield (checksum, f)
        listOfChecksumsAndFilenames
    }
    println(s"Checksum run time: $time seconds")


    // (3) now have a Seq[checksum, filename], where checksum is probably duplicated.
    // want to convert this to a Map(checksum, Seq[Filename])
    println("creating the map ...")
    var checksumFilenamesMap = scala.collection.mutable.Map[String, Seq[String]]()
    for ((cksum, fname) <- checksumFilenameTuples.toVector) {
        if (checksumFilenamesMap.contains(cksum)) {
            println("FOUND A DUPLICATE")
            val oldSeq = checksumFilenamesMap(cksum)  // get the old seq
            val newSeq = oldSeq :+ fname              // add the new fname to that seq
            checksumFilenamesMap(cksum) = newSeq      // update that element in the map
        } else {
            // the map doesn't contain the checksum, so just add a new entry
            checksumFilenamesMap(cksum) = Seq(fname)
        }
    }


    // (4) (optional) write all the `Map(checksum, Seq[Filename])` to a file
    println("writing to the file ...")
    import java.io._
    val pw = new PrintWriter(new File("/Users/al/Projects/Scala/DuplicateFiles/dups.dat"))
    for ((cksum,fnames) <- checksumFilenamesMap) {
        pw.write(s"cksum: $cksum, files: $fnames\n")
    }
    pw.close


    // (5) keep one copy of the file, and move the duplicates to my "dups" folder
    val dupsFolder = "/Users/al/Desktop/DuplicateImages"
    for ((cksum,fnames) <- checksumFilenamesMap) {
        if (fnames.size > 1) {
            // there are duplicate files
            val filenamesToMove = fnames.drop(1)  //keep the first file, mv the rest
            for (originalCanonFilename <- filenamesToMove) {
                val baseFilename = (new File(originalCanonFilename)).getName  //foo.jpg
                val nanoTime = System.nanoTime
                val newFilename = nanoTime + "-" + baseFilename        //123-foo.jpg
                val newCanonFilename = dupsFolder + "/" + newFilename
                val path = Files.move(
                    Paths.get(originalCanonFilename),
                    Paths.get(newCanonFilename),
                    StandardCopyOption.REPLACE_EXISTING
                )
                if (path != null) {
                    //println(s"moved the file $baseFilename successfully")
                } else {
                    println(s"could NOT move the file $baseFilename")
                }
                println("")
            }
        }
    }


    private def timer[A](blockOfCode: => A) = {
        val startTime = System.nanoTime
        val result = blockOfCode
        val stopTime = System.nanoTime
        val delta = stopTime - startTime
        (result, delta/1000000000d)
    }

    //TODO convert to use Option
    private def fileChecksum(filepath: String, mdIn: MessageDigest) = {
        var mdOut: MessageDigest = null
        var dis: DigestInputStream = null
        // file hashing with DigestInputStream
        try {
            // code is 132x slower w/o BufferedInputStream
            dis = new DigestInputStream(new BufferedInputStream(new FileInputStream(filepath)), mdIn)
            while (dis.read != -1) {
                //TODO this seems to be necessary; see dis.read() javadoc
            }
            mdOut = dis.getMessageDigest
        }
        catch {
            case ioe: IOException => {
                //TODO handle this
                System.err.println(ioe.getMessage)
            }
        }
        finally {
            if (dis != null) dis.close()
        }
        // return value
        convertBytesToHex(mdOut.digest)
    }

    def convertBytesToHex(bytes: Seq[Byte]): String = {
        val sb = new StringBuilder
        for (b <- bytes) {
            sb.append(String.format("%02x", Byte.box(b)))
        }
        sb.toString
    }

}

That code doesn’t have any dependencies, so I ran it from my IDE without any build process, so I don’t have any build files (such as a build.sbt) to share.

this post is sponsored by my books:
#1 New Release	FP Best Seller	Learn Scala 3	Learn FP Fast

The end

I can’t think of anything else to add to this discussion at this time, so I’ll leave it at that. If you need to find duplicate files on your filesystem, I hope this code can at least help point you in the right direction and save you a little time. I can confirm that it works as-is on my Mac, but I have no idea how it will work on Windows and Linux systems as I didn’t test it on those. Enjoy, and happy hacking!

scala

file