Some people collect different kinds of files. Some may collect viruses. Some may collect pictures of fine art, buildings or stamps. I've even heard rumors that some people collect pictures of nude women(!)
After a while such collections may contain large numbers of files, and some of the content may be duplicated in different locations or file-names.
Dupemgr is meant to be an aid for collectors to identify duplicate files, and to handle them. It can also be used to identify missing files in a collection (by comparing the data-file of one collection with the data-file(s) of other collections).
- Copy only new (unique) files to a collection
- Identify and delete duplictes in a collection
- Identify and softlink duplicates in a collection
- Manipulation of it's own data-files
(Collections in this sense are any files stored in one or more directories and sub-directories).
- Easy to use, command-line utility.
- Handling of existing file-collections; where duplicates can be identified and replaced by soft-links, or deleted.
- Handling of new files, where only unique files are copied to the destination-folder(s).
- Identification of files based on calculated hash-values rather than file-names.
- Since it can take considerable time to build hash-values for a large collection of files, the internal cache can be stored on disk in an open format (xml) and then re-used.
- To help collectors exchange unique files, the program must be able to handle several cache-files (data-files), so that you can compare your collection with someones else collection, simply by exchanging the cache-files. The same mechanism can be used to build blacklists of unwanted files, or files you have stored off-line.
- Fast. The program must be fast and efficient. To accomplish that, it's written in C++, and it stores it's working-data in memory rather than in a traditional database. The time-consuming operations are scanning large directories, and to read all the files to calculate their hash-values.
How it works
Dupemgr's core is a fast lookup-cache of MD5 (or optionally, SHA1) hash-values representing the content of files on disk. When you run the program, you give it some path(s) to work with. It will search for all the files within the path(s), and read them one by one to calculate a hash-value that is unique to each file. It is theoretically possible that two different files gives the same hash-value, but in practice it is very unlikely to happen. (If this is a concern for you, use SHA1 rather than MD5).
The cache also contains the path, size and last modification time for each file that was found. Once the cache is built, dupemgr can perform operations on it. It can remove duplicate files, or soft-link duplicate files to one “master” file. It can also copy only unique files from one or more source-folders to a destination-folder.
In addition to this, dupemgr can save the cache-file to an XML data-file of your choice. Later on, you can tell it to load this file:
- To skip the time-consuming scanning and hashing process.
- To list the files in your cache.
- To search for a file by it's content or hash-value.
- To manipulate the cache itself.
- To avoid copying files that exist off-line or in another collectors collection.
The modified cache can of course be saved to a disk-file as well.
Since the data-files are stored in XML-format, it's trivial to manipulate them with other programs as well.
Note that this program can delete files on your hard-disk. If you are a collector, the files probably are precious to you. You should therefore always have a backup of your files somewhere else before running dupemgr or any other program that may delete parts of your collection. In my own experience, it's also a good idea to drink beer after messing with programs like this...
- Linux and other *nix* platforms with a decent C++ compiler and the boost and lingcrypt libraries.
Dupemgr is free, lisenced under the GPL (version 3) license. See www.fsf.org/licensing/licenses/gpl.html for details. The source code is available for download.
You can contribute to the project if you want.
- Spread the word. Free software is not advertized on TV. An applications success depends on it's users spreading the word. If you like the program, tell people about it. Write a blog-entry about it. Link to it. Suggest it whenever someone needs an application like this.
- Report problems and Suggest features. This is the most valuable contribution anyone can provide. Good products are not created by brilliant developers - byt rather by ordinary people who gets over their frustration with the product and suggest improvements, rather than just throwing it away.
- Fix bugs. There will always be bugs in software. If you find one, and give me the receipt to kill it, you provide a very important contribution.
- Find bugs. If you find a bug, but cannot fix it, you contribute to the project by providing me with an exact receipt on how to reproduce the problem.
Current release: version 0.1 [2009-02-20]