File deduplication isnโt just for massive storage arrays or backup systemsโit can be a practical tool for personal or server setups too. In this post, Iโll explain how I use hardlinking to reduce disk usage on my Linux system, which directories are safe (and unsafe) to link, why Iโm OK with the trade-offs, and how I automated it with a simple monthly cron job using a neat tool called hadori.
๐ What Is Hardlinking?
In a traditional filesystem, every file has an inode, which is essentially its real identityโthe data on disk. A hard link is a different filename that points to the same inode. That means:
- The file appears to exist in multiple places.
- But there’s only one actual copy of the data.
- Deleting one link doesnโt delete the content, unless itโs the last one.
Compare this to a symlink, which is just a pointer to a path. A hardlink is a pointer to the data.
So if you have 10 identical files scattered across the system, you can replace them with hardlinks, and boomโnine of them stop taking up extra space.
๐ค Why Use Hardlinking?
My servers run a fairly standard Ubuntu install, and like most Linux machines, the root filesystem accumulates a lot of identical binaries and librariesโespecially across /bin, /lib, /usr, and /opt.
Thatโs not a problemโฆ until you’re tight on disk space, or youโre just a curious nerd who enjoys squeezing every last byte.
In my case, I wanted to reduce disk usage safely, without weird side effects.
Hardlinking is a one-time cost with ongoing benefits. Itโs not compression. Itโs not archival. But itโs efficient and non-invasive.
๐ Which Directories Are Safe to Hardlink?
Hardlinking only works within the same filesystem, and not all directories are good candidates.
โ
Safe directories:
/bin, /sbin โ system binaries
/lib, /lib64 โ shared libraries
/usr, /usr/bin, /usr/lib, /usr/share, /usr/local โ user-space binaries, docs, etc.
/opt โ optional manually installed software
These contain mostly static files: compiled binaries, libraries, man pagesโฆ not something that changes often.
โ ๏ธ Unsafe or risky directories:
/etc โ configuration files, might change frequently
/var, /tmp โ logs, spools, caches, session data
/home โ user files, temporary edits, live data
/dev, /proc, /sys โ virtual filesystems, do not touch
If a file is modified after being hardlinked, it breaks the deduplication (the OS creates a copy-on-write scenario), and youโre back where you startedโor worse, sharing data you didnโt mean to.
Thatโs why I avoid any folders with volatile, user-specific, or auto-generated files.
๐งจ Risks and Limitations
Hardlinking is not magic. It comes with sharp edges:
- One inode, multiple names: All links are equal. Editing one changes the data for all.
- Backups: Some backup tools donโt preserve hardlinks or treat them inefficiently.
โค Duplicity, which I use, does not preserve hardlinks. It backs up each linked file as a full copy, so hardlinking wonโt reduce backup size.
- Security: Linking files with different permissions or owners can have unexpected results.
- Limited scope: Only works within the same filesystem (e.g., canโt link
/ and /mnt if theyโre on separate partitions).
In my setup, I accept those risks because:
- I’m only linking read-only system files.
- I never link config or user data.
- I donโt rely on hardlink preservation in backups.
- I test changes before deploying.
In short: I know what Iโm linking, and why.
๐ What the Critics Say About Hardlinking
Not everyone loves hardlinksโand for good reasons. Two thoughtful critiques are:
The core arguments:
- Hardlinks violate expectations about file ownership and identity.
- They can break assumptions in software that tracks files by name or path.
- They complicate file deletion logicโdeleting one name doesn’t delete the content.
- They confuse file monitoring and logging tools, since itโs hard to tell if a file is “new” or just another name.
- They increase the risk of data corruption if accidentally modified in-place by a script that assumes it owns the file.
Why Iโm still OK with it:
These concerns are validโbut mostly apply to:
- Mutable files (e.g., logs, configs, user data)
- Systems with untrusted users or dynamic scripts
- Software that relies on inode isolation or path integrity
In contrast, my approach is intentionally narrow and safe:
- I only deduplicate read-only system files in
/bin, /sbin, /lib, /lib64, /usr, and /opt.
- These are owned by root, and only changed during package updates.
- I donโt hardlink anything under
/home, /etc, /var, or /tmp.
- I know exactly when the cron job runs and what it targets.
So yes, hardlinks can be dangerousโbut only if you use them in the wrong places. In this case, I believe Iโm using them correctly and conservatively.
โก Does Hardlinking Impact System Performance?
Good news: hardlinks have virtually no impact on system performance in everyday use.
Hardlinks are a native feature of Linux filesystems like ext4 or xfs. The OS treats a hardlinked file just like a normal file:
- Reading and writing hardlinked files is just as fast as normal files.
- Permissions, ownership, and access behave identically.
- Common tools (
ls, cat, cp) donโt care whether a file is hardlinked or not.
- Filesystem caches and memory management work exactly the same.
The only difference is that multiple filenames point to the exact same data.
Things to keep in mind:
- If you edit a hardlinked file, all links see that change because thereโs really just one file.
- Some tools (backup, disk usage) might treat hardlinked files differently.
- Debugging or auditing files can be slightly trickier since multiple paths share one inode.
But from a performance standpoint? Your system wonโt even notice the difference.
๐ ๏ธ Tools for Hardlinking
There are a few tools out there:
fdupes โ finds duplicates and optionally replaces with hardlinks
rdfind โ more sophisticated detection
hardlink โ simple but limited
jdupes โ high-performance fork of fdupes
๐ About Hadori
From the Debian package description:
This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less memory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it’s called “Hardlinking DOne RIght”.
Advantages over other tools:
- Predictability: arguments are scanned in order, each first version is kept
- Much lower CPU and memory consumption compared to alternatives
This makes hadori especially suited for system-wide deduplication where efficiency and reliability matter.
โฑ๏ธ How I Use Hadori
I run hadori once per month with a cron job. Hereโs the actual command:
/usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
This scans those directories, finds duplicate files, and replaces them with hardlinks when safe.
And hereโs the crontab entry I installed in the file /etc/cron.d/hadori:
@monthly root /usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
๐ What Are the Results?
After the first run, I saw a noticeable reduction in used disk space, especially in /usr/lib and /usr/share. On my modest VPS, that translated to about 300โ500 MB savedโnot huge, but non-trivial for a small root partition.
While this doesnโt reduce my backup size (Duplicity doesnโt support hardlinks), it still helps with local disk usage and keeps things a little tidier.
And because the job only runs monthly, itโs not intrusive or performance-heavy.
๐งผ Final Thoughts
Hardlinking isnโt something most people need to think about. And frankly, most people probably shouldnโt use it.
But if you:
- Know what youโre linking
- Limit it to static, read-only system files
- Automate it safely and sparingly
โฆthen it can be a smart little optimization.
With a tool like hadori, itโs safe, fast, and efficient. Iโve read the horror storiesโand decided that in my case, they donโt apply.
โ๏ธ This post was brought to you by a monthly cron job and the letters i-n-o-d-e.