Duplicate file finder

This project consist of a duplicate file finder written in Go. The tool compares the md5 hash of the files in a given locaitonto find duplicates.

Versions

As a learning project, it also introduces different approaches. Listed from the most basic to the most advanced:

In this approach the files are processed sequentially, no concurrency is used.

Channels are used to pass data across goroutines and the main-thread.

The goroutines are:

It gets started as many times as the number of workers.
Gets a file location from filePathMetaDataChan, calculates the hash and sends it to pairChan along with the filePathMetadata.
When all files are processed, each goroutine sends a done signal to doneChan.

Explanation:

A number of workers are created based on the number of CPUs.
The main-thread starts the goroutines (getPairs started as many times as workers are available), which are on "wait" until the filePathMetaDataChan gets some input.
Then it starts the searchTree function to send the file locations to filePathMetaDataChan.
Once all the files have been sent, the filePathMetaDataChan channel is closed.
This causes the getPairs goroutines to finish their work and send the done signal to doneChan.
Once a number of done signals equal to the number of workers are received, the pairChan is closed.
This causes the collectHashMap goroutine to finish its work and send the results to resultChan.

Concurrent file discovery finder with wait groups
Channels to pass down files to hash calculation
Channel for reciving file calculate hash and pass hash
Channel for putting hash in a map, wait for every hash to be calculated, and send done.

In this approach some improvements are made to the previous concurrent approach:

The filePathMetaDataChan becomes a buffered channel, so that the main-thread is not blocked until the next hash is calculated.
searchTree starts goroutines per subdirectory. As the amount of directories is unknown, we close the channel after all those routines have been finished, by using a sync.WaitGroup.

This approach is similar to the previous one, but with a higher number of workers.

In the end a benchmark is performed to compare the different approaches. It can be easily run by executing the run.sh script.

The project comes with a bash script create_random_files.sh to create random files for testing purposes.