š§± Letās Get Hard (Links): Deduplicating My Linux Filesystem with Hadori
File deduplication isnāt just for massive storage arrays or backup systemsāit can be a practical tool for personal or server setups too. In this post, Iāll explain how I use hardlinking to reduce disk usage on my Linux system, which directories are safe (and unsafe) to link, why Iām OK with the trade-offs, and how I automated it with a simple monthly cron job using a neat tool called hadori.
š What Is Hardlinking?
In a traditional filesystem, every file has an inode, which is essentially its real identityāthe data on disk. A hard link is a different filename that points to the same inode. That means:
- The file appears to exist in multiple places.
- But there’s only one actual copy of the data.
- Deleting one link doesnāt delete the content, unless itās the last one.
Compare this to a symlink, which is just a pointer to a path. A hardlink is a pointer to the data.
So if you have 10 identical files scattered across the system, you can replace them with hardlinks, and boomānine of them stop taking up extra space.
š¤ Why Use Hardlinking?
My servers run a fairly standard Ubuntu install, and like most Linux machines, the root filesystem accumulates a lot of identical binaries and librariesāespecially across /bin
, /lib
, /usr
, and /opt
.
Thatās not a problem⦠until you’re tight on disk space, or youāre just a curious nerd who enjoys squeezing every last byte.
In my case, I wanted to reduce disk usage safely, without weird side effects.
Hardlinking is a one-time cost with ongoing benefits. Itās not compression. Itās not archival. But itās efficient and non-invasive.
š Which Directories Are Safe to Hardlink?
Hardlinking only works within the same filesystem, and not all directories are good candidates.
ā Safe directories:
/bin
,/sbin
ā system binaries/lib
,/lib64
ā shared libraries/usr
,/usr/bin
,/usr/lib
,/usr/share
,/usr/local
ā user-space binaries, docs, etc./opt
ā optional manually installed software
These contain mostly static files: compiled binaries, libraries, man pages⦠not something that changes often.
ā ļø Unsafe or risky directories:
/etc
ā configuration files, might change frequently/var
,/tmp
ā logs, spools, caches, session data/home
ā user files, temporary edits, live data/dev
,/proc
,/sys
ā virtual filesystems, do not touch
If a file is modified after being hardlinked, it breaks the deduplication (the OS creates a copy-on-write scenario), and youāre back where you startedāor worse, sharing data you didnāt mean to.
Thatās why I avoid any folders with volatile, user-specific, or auto-generated files.
š§Ø Risks and Limitations
Hardlinking is not magic. It comes with sharp edges:
- One inode, multiple names: All links are equal. Editing one changes the data for all.
- Backups: Some backup tools donāt preserve hardlinks or treat them inefficiently.
⤠Duplicity, which I use, does not preserve hardlinks. It backs up each linked file as a full copy, so hardlinking wonāt reduce backup size. - Security: Linking files with different permissions or owners can have unexpected results.
- Limited scope: Only works within the same filesystem (e.g., canāt link
/
and/mnt
if theyāre on separate partitions).
In my setup, I accept those risks because:
- I’m only linking read-only system files.
- I never link config or user data.
- I donāt rely on hardlink preservation in backups.
- I test changes before deploying.
In short: I know what Iām linking, and why.
š What the Critics Say About Hardlinking
Not everyone loves hardlinksāand for good reasons. Two thoughtful critiques are:
The core arguments:
- Hardlinks violate expectations about file ownership and identity.
- They can break assumptions in software that tracks files by name or path.
- They complicate file deletion logicādeleting one name doesn’t delete the content.
- They confuse file monitoring and logging tools, since itās hard to tell if a file is “new” or just another name.
- They increase the risk of data corruption if accidentally modified in-place by a script that assumes it owns the file.
Why Iām still OK with it:
These concerns are validābut mostly apply to:
- Mutable files (e.g., logs, configs, user data)
- Systems with untrusted users or dynamic scripts
- Software that relies on inode isolation or path integrity
In contrast, my approach is intentionally narrow and safe:
- I only deduplicate read-only system files in
/bin
,/sbin
,/lib
,/lib64
,/usr
, and/opt
. - These are owned by root, and only changed during package updates.
- I donāt hardlink anything under
/home
,/etc
,/var
, or/tmp
. - I know exactly when the cron job runs and what it targets.
So yes, hardlinks can be dangerousābut only if you use them in the wrong places. In this case, I believe Iām using them correctly and conservatively.
ā” Does Hardlinking Impact System Performance?
Good news: hardlinks have virtually no impact on system performance in everyday use.
Hardlinks are a native feature of Linux filesystems like ext4 or xfs. The OS treats a hardlinked file just like a normal file:
- Reading and writing hardlinked files is just as fast as normal files.
- Permissions, ownership, and access behave identically.
- Common tools (
ls
,cat
,cp
) donāt care whether a file is hardlinked or not. - Filesystem caches and memory management work exactly the same.
The only difference is that multiple filenames point to the exact same data.
Things to keep in mind:
- If you edit a hardlinked file, all links see that change because thereās really just one file.
- Some tools (backup, disk usage) might treat hardlinked files differently.
- Debugging or auditing files can be slightly trickier since multiple paths share one inode.
But from a performance standpoint? Your system wonāt even notice the difference.
š ļø Tools for Hardlinking
There are a few tools out there:
fdupes
ā finds duplicates and optionally replaces with hardlinksrdfind
ā more sophisticated detectionhardlink
ā simple but limitedjdupes
ā high-performance fork of fdupes
š About Hadori
From the Debian package description:
This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less memory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it’s called “Hardlinking DOne RIght”.
Advantages over other tools:
- Predictability: arguments are scanned in order, each first version is kept
- Much lower CPU and memory consumption compared to alternatives
This makes hadori especially suited for system-wide deduplication where efficiency and reliability matter.
ā±ļø How I Use Hadori
I run hadori once per month with a cron job. I used Ansible to set it up, but thatās incidentalāthis could just as easily be a line in /etc/cron.monthly
.
Hereās the actual command:
/usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
This scans those directories, finds duplicate files, and replaces them with hardlinks when safe.
And hereās the crontab entry I installed (via Ansible):
roles:
- debops.debops.cron
cron__jobs:
hadori:
name: Hardlink with hadori
special_time: monthly
job: /usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
Which then created the file /etc/cron.d/hadori
:
#Ansible: Hardlink with hadori
@monthly root /usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt
š What Are the Results?
After the first run, I saw a noticeable reduction in used disk space, especially in /usr/lib
and /usr/share
. On my modest VPS, that translated to about 300ā500 MB savedānot huge, but non-trivial for a small root partition.
While this doesnāt reduce my backup size (Duplicity doesnāt support hardlinks), it still helps with local disk usage and keeps things a little tidier.
And because the job only runs monthly, itās not intrusive or performance-heavy.
š§¼ Final Thoughts
Hardlinking isnāt something most people need to think about. And frankly, most people probably shouldnāt use it.
But if you:
- Know what youāre linking
- Limit it to static, read-only system files
- Automate it safely and sparingly
ā¦then it can be a smart little optimization.
With a tool like hadori, itās safe, fast, and efficient. Iāve read the horror storiesāand decided that in my case, they donāt apply.
āļø This post was brought to you by a monthly cron job and the letters i-n-o-d-e.