🧱 Let’s Get Hard (Links): Deduplicating My Linux Filesystem with Hadori
File deduplication isn’t just for massive storage arrays or backup systems—it can be a practical tool for personal or server setups too. In this post, I’ll explain how I use hardlinking to reduce disk usage on my Linux system, which directories are safe (and unsafe) to link, why I’m OK with the trade-offs, and how I automated it with a simple monthly cron job using a neat tool called hadori.
🔗 What Is Hardlinking?
In a traditional filesystem, every file has an inode, which is essentially its real identity—the data on disk. A hard link is a different filename that points to the same inode. That means:
- The file appears to exist in multiple places.
- But there’s only one actual copy of the data.
- Deleting one link doesn’t delete the content, unless it’s the last one.
Compare this to a symlink, which is just a pointer to a path. A hardlink is a pointer to the data.
So if you have 10 identical files scattered across the system, you can replace them with hardlinks, and boom—nine of them stop taking up extra space.
🤔 Why Use Hardlinking?
My servers run a fairly standard Ubuntu install, and like most Linux machines, the root filesystem accumulates a lot of identical binaries and libraries—especially across /bin
, /lib
, /usr
, and /opt
.
That’s not a problem… until you’re tight on disk space, or you’re just a curious nerd who enjoys squeezing every last byte.
In my case, I wanted to reduce disk usage safely, without weird side effects.
Hardlinking is a one-time cost with ongoing benefits. It’s not compression. It’s not archival. But it’s efficient and non-invasive.
📁 Which Directories Are Safe to Hardlink?
Hardlinking only works within the same filesystem, and not all directories are good candidates.
✅ Safe directories:
/bin
,/sbin
– system binaries/lib
,/lib64
– shared libraries/usr
,/usr/bin
,/usr/lib
,/usr/share
,/usr/local
– user-space binaries, docs, etc./opt
– optional manually installed software
These contain mostly static files: compiled binaries, libraries, man pages… not something that changes often.
⚠️ Unsafe or risky directories:
/etc
– configuration files, might change frequently/var
,/tmp
– logs, spools, caches, session data/home
– user files, temporary edits, live data/dev
,/proc
,/sys
– virtual filesystems, do not touch
If a file is modified after being hardlinked, it breaks the deduplication (the OS creates a copy-on-write scenario), and you’re back where you started—or worse, sharing data you didn’t mean to.
That’s why I avoid any folders with volatile, user-specific, or auto-generated files.
🧨 Risks and Limitations
Hardlinking is not magic. It comes with sharp edges:
- One inode, multiple names: All links are equal. Editing one changes the data for all.
- Backups: Some backup tools don’t preserve hardlinks or treat them inefficiently.
➤ Duplicity, which I use, does not preserve hardlinks. It backs up each linked file as a full copy, so hardlinking won’t reduce backup size. - Security: Linking files with different permissions or owners can have unexpected results.
- Limited scope: Only works within the same filesystem (e.g., can’t link
/
and/mnt
if they’re on separate partitions).
In my setup, I accept those risks because:
- I’m only linking read-only system files.
- I never link config or user data.
- I don’t rely on hardlink preservation in backups.
- I test changes before deploying.
In short: I know what I’m linking, and why.
🔍 What the Critics Say About Hardlinking
Not everyone loves hardlinks—and for good reasons. Two thoughtful critiques are:
The core arguments:
- Hardlinks violate expectations about file ownership and identity.
- They can break assumptions in software that tracks files by name or path.
- They complicate file deletion logic—deleting one name doesn’t delete the content.
- They confuse file monitoring and logging tools, since it’s hard to tell if a file is “new” or just another name.
- They increase the risk of data corruption if accidentally modified in-place by a script that assumes it owns the file.
Why I’m still OK with it:
These concerns are valid—but mostly apply to:
- Mutable files (e.g., logs, configs, user data)
- Systems with untrusted users or dynamic scripts
- Software that relies on inode isolation or path integrity
In contrast, my approach is intentionally narrow and safe:
- I only deduplicate read-only system files in
/bin
,/sbin
,/lib
,/lib64
,/usr
, and/opt
. - These are owned by root, and only changed during package updates.
- I don’t hardlink anything under
/home
,/etc
,/var
, or/tmp
. - I know exactly when the cron job runs and what it targets.
So yes, hardlinks can be dangerous—but only if you use them in the wrong places. In this case, I believe I’m using them correctly and conservatively.
⚡ Does Hardlinking Impact System Performance?
Good news: hardlinks have virtually no impact on system performance in everyday use.
Hardlinks are a native feature of Linux filesystems like ext4 or xfs. The OS treats a hardlinked file just like a normal file:
- Reading and writing hardlinked files is just as fast as normal files.
- Permissions, ownership, and access behave identically.
- Common tools (
ls
,cat
,cp
) don’t care whether a file is hardlinked or not. - Filesystem caches and memory management work exactly the same.
The only difference is that multiple filenames point to the exact same data.
Things to keep in mind:
- If you edit a hardlinked file, all links see that change because there’s really just one file.
- Some tools (backup, disk usage) might treat hardlinked files differently.
- Debugging or auditing files can be slightly trickier since multiple paths share one inode.
But from a performance standpoint? Your system won’t even notice the difference.
🛠️ Tools for Hardlinking
There are a few tools out there:
fdupes
– finds duplicates and optionally replaces with hardlinksrdfind
– more sophisticated detectionhardlink
– simple but limitedjdupes
– high-performance fork of fdupes
📌 About Hadori
From the Debian package description:
This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less memory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it’s called “Hardlinking DOne RIght”.
Advantages over other tools:
- Predictability: arguments are scanned in order, each first version is kept
- Much lower CPU and memory consumption compared to alternatives
This makes hadori especially suited for system-wide deduplication where efficiency and reliability matter.
⏱️ How I Use Hadori
I run hadori once per month with a cron job. I used Ansible to set it up, but that’s incidental—this could just as easily be a line in /etc/cron.monthly
.
Here’s the actual command:
/usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
This scans those directories, finds duplicate files, and replaces them with hardlinks when safe.
And here’s the crontab entry I installed (via Ansible):
roles:
- debops.debops.cron
cron__jobs:
hadori:
name: Hardlink with hadori
special_time: monthly
job: /usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
Which then created the file /etc/cron.d/hadori
:
#Ansible: Hardlink with hadori
@monthly root /usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
📉 What Are the Results?
After the first run, I saw a noticeable reduction in used disk space, especially in /usr/lib
and /usr/share
. On my modest VPS, that translated to about 300–500 MB saved—not huge, but non-trivial for a small root partition.
While this doesn’t reduce my backup size (Duplicity doesn’t support hardlinks), it still helps with local disk usage and keeps things a little tidier.
And because the job only runs monthly, it’s not intrusive or performance-heavy.
🧼 Final Thoughts
Hardlinking isn’t something most people need to think about. And frankly, most people probably shouldn’t use it.
But if you:
- Know what you’re linking
- Limit it to static, read-only system files
- Automate it safely and sparingly
…then it can be a smart little optimization.
With a tool like hadori, it’s safe, fast, and efficient. I’ve read the horror stories—and decided that in my case, they don’t apply.
✉️ This post was brought to you by a monthly cron job and the letters i-n-o-d-e.