Python

🧼 Pre-commit: Because “oops, forgot to format” is so last year

8 October 202524 September 2025

As a solo developer, I wear all the hats. 🎩👷‍♂️🎨
That includes the very boring Quality Assurance Hat™ — the one that says “yes, Amedee, you do need to check for trailing whitespace again.”

And honestly? I suck at remembering those little details. I’d rather be building cool stuff than remembering to run Black or fix a missing newline. So I let my robot friend handle it.

That friend is called pre-commit. And it’s the best personal assistant I never hired. 🤖

🧐 What is this thing?

Pre-commit is like a bouncer for your Git repo. Before your code gets into the club (your repo), it gets checked at the door:

“Whoa there — trailing whitespace? Not tonight.”
“Missing a newline at the end? Try again.”
“That YAML looks sketchy, pal.”
“You really just tried to commit a 200MB video file? What is this, Dropbox?”
“Leaking AWS keys now, are we? Security says nope.”
“Commit message says ‘fix’? That’s not a message, that’s a shrug.”

Pre-commit runs a bunch of little scripts called hooks to catch this stuff. You choose which ones to use — it’s modular, like Lego for grown-up devs. 🧱

When I commit, the hooks run. If they don’t like what they see, the commit gets bounced.
No exceptions. No drama. Just “fix it and try again.”

Is it annoying? Yeah, sometimes.
But has it saved my butt from pushing broken or embarrassing code? Way too many times.

🎯 Why I bother (as a hobby dev)

I don’t have teammates yelling at me in code reviews. I am the teammate.
And future-me is very forgetful. 🧓

Pre-commit helps me:

📏 Keep my code consistent
💣 It catches dumb mistakes before I make them permanent.
🕒 Spend less time cleaning up
💼 Feel a little more “pro” even when I’m hacking on toy projects
🧬 It works with any language. Even Bash, if you’re that kind of person.

Also, it feels kinda magical when it auto-fixes stuff and the commit just… works.

🛠 Installing it with `pipx` (because I’m not a barbarian)

I’m not a fan of polluting my Python environment, so I use pipx to keep things tidy. It installs CLI tools globally, but keeps them isolated.
If you don’t have pipx yet:

python3 -m pip install --user pipx
pipx ensurepath

Then install pre-commit like a boss:

pipx install pre-commit

Boom. It’s installed system-wide without polluting your precious virtualenvs. Chef’s kiss. 👨‍🍳💋

📝 Setting it up

Inside my project (usually some weird half-finished script I’ll obsess over for 3 days and then forget for 3 months), I create a file called .pre-commit-config.yaml.

Here’s what mine usually looks like:

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files

  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.28.0
    hooks:
      - id: gitleaks

  - repo: https://github.com/jorisroovers/gitlint
    rev: v0.19.1
    hooks:
      - id: gitlint

  - repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update
    rev: v0.8.0
    hooks:
      - id: pre-commit-update

🧙‍♂️ What this pre-commit config actually does

You’re not just tossing some YAML in your repo and calling it a day. This thing pulls together a full-on code hygiene crew — the kind that shows up uninvited, scrubs your mess, locks up your secrets, and judges your commit messages like it’s their job. Because it is.

📦 `pre-commit-hooks` (v5.0.0)

These are the basics — the unglamorous chores that keep your repo from turning into a dumpster fire. Think lint roller, vacuum, and passive-aggressive IKEA manual rolled into one.

trailing-whitespace:
🚫 No more forgotten spaces at the end of lines. The silent killers of clean diffs.
end-of-file-fixer:
👨‍⚕️ Adds a newline at the end of each file. Why? Because some tools (and nerds) get cranky if it’s missing.
check-yaml:
🧪 Validates your YAML syntax. No more “why isn’t my config working?” only to discover you had an extra space somewhere.
check-added-large-files:
🚨 Stops you from accidentally committing that 500MB cat video or .sqlite dump. Saves your repo. Saves your dignity.

🔐 `gitleaks` (v8.28.0)

Scans your code for secrets — API keys, passwords, tokens you really shouldn’t be committing.
Because we’ve all accidentally pushed our .env file at some point. (Don’t lie.)

✍️ `gitlint` (v0.19.1)

Enforces good commit message style — like limiting subject line length, capitalizing properly, and avoiding messages like “asdf”.
Great if you’re trying to look like a serious dev, even when you’re mostly committing bugfixes at 2AM.

🔁 `pre-commit-update` (v0.8.0)

The responsible adult in the room. Automatically bumps your hook versions to the latest stable ones. No more living on ancient plugin versions.

🧼 In summary

This setup covers:

✅ Basic file hygiene (whitespace, newlines, YAML, large files)
🔒 Secret detection
✉️ Commit message quality
🆙 Keeping your hooks fresh

You can add more later, like linters specific for your language of choice — think of this as your “minimum viable cleanliness.”

🧩 What else can it do?

There are hundreds of hooks. Some I’ve used, some I’ve just admired from afar:

black is a Python code formatter that says: “Shhh, I know better.”
flake8 finds bugs, smells, and style issues in Python.
isort sorts your imports so you don’t have to.
eslint for all you JavaScript kids.
shellcheck for Bash scripts.
… or write your own custom one-liner hook!

You can browse tons of them at: https://pre-commit.com/hooks.html

🧙‍♀️ Make Git do your bidding

To hook it all into Git:

pre-commit install

Now every time you commit, your code gets a spa treatment before it enters version control. 💅

Wanna retroactively clean up the whole repo? Go ahead:

pre-commit run --all-files

You’ll feel better. I promise.

🎯 TL;DR

Pre-commit is a must-have.
It’s like brushing your teeth before a date: it’s fast, polite, and avoids awkward moments later. 🪥💋
If you haven’t tried it yet: do it. Your future self (and your Git history, and your date) will thank you. 🙏

Use pipx to install it globally.
Add a .pre-commit-config.yaml.
Install the Git hook.
Enjoy cleaner commits, fewer review comments — and a commit history you’re not embarrassed to bring home to your parents. 😌💍

And if it ever annoys you too much?
You can always disable it… like cancelling the date but still showing up in their Instagram story. 😈💔

git commit --no-verify

Want help writing your first config? Or customizing it for Python, Bash, JavaScript, Kotlin, or your one-man-band side project? I’ve been there. Ask away!

Creating 10 000 Random Files & Analyzing Their Size Distribution: Because Why Not? 🧐💾

30 July 202515 July 2025

Ever wondered what it’s like to unleash 10 000 tiny little data beasts on your hard drive? No? Well, buckle up anyway — because today, we’re diving into the curious world of random file generation, and then nerding out by calculating their size distribution. Spoiler alert: it’s less fun than it sounds. 😏

Step 1: Let’s Make Some Files… Lots of Them

Our goal? Generate 10 000 files filled with random data. But not just any random sizes — we want a mean file size of roughly 68 KB and a median of about 2 KB. Sounds like a math puzzle? That’s because it kind of is.

If you just pick file sizes uniformly at random, you’ll end up with a median close to the mean — which is boring. We want a skewed distribution, where most files are small, but some are big enough to bring that average up.

The Magic Trick: Log-normal Distribution 🎩✨

Enter the log-normal distribution, a nifty way to generate lots of small numbers and a few big ones — just like real life. Using Python’s NumPy library, we generate these sizes and feed them to good old /dev/urandom to fill our files with pure randomness.

Here’s the Bash script that does the heavy lifting:

#!/bin/bash

# Directory to store the random files
output_dir="random_files"
mkdir -p "$output_dir"

# Total number of files to create
file_count=10000

# Log-normal distribution parameters
mean_log=9.0  # Adjusted for ~68KB mean
stddev_log=1.5  # Adjusted for ~2KB median

# Function to generate random numbers based on log-normal distribution
generate_random_size() {
    python3 -c "import numpy as np; print(int(np.random.lognormal($mean_log, $stddev_log)))"
}

# Create files with random data
for i in $(seq 1 $file_count); do
    file_size=$(generate_random_size)
    file_path="$output_dir/file_$i.bin"
    head -c "$file_size" /dev/urandom > "$file_path"
    echo "Generated file $i with size $file_size bytes."
done

echo "Done. Files saved in $output_dir."

Easy enough, right? This creates a directory random_files and fills it with 10 000 files of sizes mostly small but occasionally wildly bigger. Don’t blame me if your disk space takes a little hit! 💥

Step 2: Crunching Numbers — The File Size Distribution 📊

Okay, you’ve got the files. Now, what can we learn from their sizes? Let’s find out the:

Mean size: The average size across all files.
Median size: The middle value when sizes are sorted — because averages can lie.
Distribution breakdown: How many tiny files vs. giant files.

Here’s a handy Bash script that reads file sizes and spits out these stats with a bit of flair:

#!/bin/bash

# Input directory (default to "random_files" if not provided)
directory="${1:-random_files}"

# Check if directory exists
if [ ! -d "$directory" ]; then
    echo "Directory $directory does not exist."
    exit 1
fi

# Array to store file sizes
file_sizes=($(find "$directory" -type f -exec stat -c%s {} \;))

# Check if there are files in the directory
if [ ${#file_sizes[@]} -eq 0 ]; then
    echo "No files found in the directory $directory."
    exit 1
fi

# Calculate mean
total_size=0
for size in "${file_sizes[@]}"; do
    total_size=$((total_size + size))
done
mean=$((total_size / ${#file_sizes[@]}))

# Calculate median
sorted_sizes=($(printf '%s\n' "${file_sizes[@]}" | sort -n))
mid=$(( ${#sorted_sizes[@]} / 2 ))
if (( ${#sorted_sizes[@]} % 2 == 0 )); then
    median=$(( (sorted_sizes[mid-1] + sorted_sizes[mid]) / 2 ))
else
    median=${sorted_sizes[mid]}
fi

# Display file size distribution
echo "File size distribution in directory $directory:"
echo "---------------------------------------------"
echo "Number of files: ${#file_sizes[@]}"
echo "Mean size: $mean bytes"
echo "Median size: $median bytes"

# Display detailed size distribution (optional)
echo
echo "Detailed distribution (size ranges):"
awk '{
    if ($1 < 1024) bins["< 1 KB"]++;
    else if ($1 < 10240) bins["1 KB - 10 KB"]++;
    else if ($1 < 102400) bins["10 KB - 100 KB"]++;
    else bins[">= 100 KB"]++;
} END {
    for (range in bins) printf "%-15s: %d\n", range, bins[range];
}' <(printf '%s\n' "${file_sizes[@]}")

Run it, and voilà — instant nerd satisfaction.

Example Output:

File size distribution in directory random_files:
---------------------------------------------
Number of files: 10000
Mean size: 68987 bytes
Median size: 2048 bytes

Detailed distribution (size ranges):
&lt; 1 KB         : 1234
1 KB - 10 KB   : 5678
10 KB - 100 KB : 2890
>= 100 KB      : 198

Why Should You Care? 🤷‍♀️

Besides the obvious geek cred, generating files like this can help:

Test backup systems — can they handle weird file size distributions?
Stress-test storage or network performance with real-world-like data.
Understand your data patterns if you’re building apps that deal with files.

Wrapping Up: Big Files, Small Files, and the Chaos In Between

So there you have it. Ten thousand random files later, and we’ve peeked behind the curtain to understand their size story. It’s a bit like hosting a party and then figuring out who ate how many snacks. 🍿

Try this yourself! Tweak the distribution parameters, generate files, crunch the numbers — and impress your friends with your mad scripting skills. Or at least have a fun weekend project that makes you sound way smarter than you actually are.

Happy hacking! 🔥