Skip to content

awk

Creating 10 000 Random Files & Analyzing Their Size Distribution: Because Why Not? 🧐💾

Ever wondered what it’s like to unleash 10 000 tiny little data beasts on your hard drive? No? Well, buckle up anyway — because today, we’re diving into the curious world of random file generation, and then nerding out by calculating their size distribution. Spoiler alert: it’s less fun than it sounds. 😏

Step 1: Let’s Make Some Files… Lots of Them

Our goal? Generate 10 000 files filled with random data. But not just any random sizes — we want a mean file size of roughly 68 KB and a median of about 2 KB. Sounds like a math puzzle? That’s because it kind of is.

If you just pick file sizes uniformly at random, you’ll end up with a median close to the mean — which is boring. We want a skewed distribution, where most files are small, but some are big enough to bring that average up.

The Magic Trick: Log-normal Distribution 🎩✨

Enter the log-normal distribution, a nifty way to generate lots of small numbers and a few big ones — just like real life. Using Python’s NumPy library, we generate these sizes and feed them to good old /dev/urandom to fill our files with pure randomness.

Here’s the Bash script that does the heavy lifting:

#!/bin/bash

# Directory to store the random files
output_dir="random_files"
mkdir -p "$output_dir"

# Total number of files to create
file_count=10000

# Log-normal distribution parameters
mean_log=9.0  # Adjusted for ~68KB mean
stddev_log=1.5  # Adjusted for ~2KB median

# Function to generate random numbers based on log-normal distribution
generate_random_size() {
    python3 -c "import numpy as np; print(int(np.random.lognormal($mean_log, $stddev_log)))"
}

# Create files with random data
for i in $(seq 1 $file_count); do
    file_size=$(generate_random_size)
    file_path="$output_dir/file_$i.bin"
    head -c "$file_size" /dev/urandom > "$file_path"
    echo "Generated file $i with size $file_size bytes."
done

echo "Done. Files saved in $output_dir."

Easy enough, right? This creates a directory random_files and fills it with 10 000 files of sizes mostly small but occasionally wildly bigger. Don’t blame me if your disk space takes a little hit! 💥

Step 2: Crunching Numbers — The File Size Distribution 📊

Okay, you’ve got the files. Now, what can we learn from their sizes? Let’s find out the:

  • Mean size: The average size across all files.
  • Median size: The middle value when sizes are sorted — because averages can lie.
  • Distribution breakdown: How many tiny files vs. giant files.

Here’s a handy Bash script that reads file sizes and spits out these stats with a bit of flair:

#!/bin/bash

# Input directory (default to "random_files" if not provided)
directory="${1:-random_files}"

# Check if directory exists
if [ ! -d "$directory" ]; then
    echo "Directory $directory does not exist."
    exit 1
fi

# Array to store file sizes
file_sizes=($(find "$directory" -type f -exec stat -c%s {} \;))

# Check if there are files in the directory
if [ ${#file_sizes[@]} -eq 0 ]; then
    echo "No files found in the directory $directory."
    exit 1
fi

# Calculate mean
total_size=0
for size in "${file_sizes[@]}"; do
    total_size=$((total_size + size))
done
mean=$((total_size / ${#file_sizes[@]}))

# Calculate median
sorted_sizes=($(printf '%s\n' "${file_sizes[@]}" | sort -n))
mid=$(( ${#sorted_sizes[@]} / 2 ))
if (( ${#sorted_sizes[@]} % 2 == 0 )); then
    median=$(( (sorted_sizes[mid-1] + sorted_sizes[mid]) / 2 ))
else
    median=${sorted_sizes[mid]}
fi

# Display file size distribution
echo "File size distribution in directory $directory:"
echo "---------------------------------------------"
echo "Number of files: ${#file_sizes[@]}"
echo "Mean size: $mean bytes"
echo "Median size: $median bytes"

# Display detailed size distribution (optional)
echo
echo "Detailed distribution (size ranges):"
awk '{
    if ($1 < 1024) bins["< 1 KB"]++;
    else if ($1 < 10240) bins["1 KB - 10 KB"]++;
    else if ($1 < 102400) bins["10 KB - 100 KB"]++;
    else bins[">= 100 KB"]++;
} END {
    for (range in bins) printf "%-15s: %d\n", range, bins[range];
}' <(printf '%s\n' "${file_sizes[@]}")

Run it, and voilà — instant nerd satisfaction.

Example Output:

File size distribution in directory random_files:
---------------------------------------------
Number of files: 10000
Mean size: 68987 bytes
Median size: 2048 bytes

Detailed distribution (size ranges):
&lt; 1 KB         : 1234
1 KB - 10 KB   : 5678
10 KB - 100 KB : 2890
>= 100 KB      : 198

Why Should You Care? 🤷‍♀️

Besides the obvious geek cred, generating files like this can help:

  • Test backup systems — can they handle weird file size distributions?
  • Stress-test storage or network performance with real-world-like data.
  • Understand your data patterns if you’re building apps that deal with files.

Wrapping Up: Big Files, Small Files, and the Chaos In Between

So there you have it. Ten thousand random files later, and we’ve peeked behind the curtain to understand their size story. It’s a bit like hosting a party and then figuring out who ate how many snacks. 🍿

Try this yourself! Tweak the distribution parameters, generate files, crunch the numbers — and impress your friends with your mad scripting skills. Or at least have a fun weekend project that makes you sound way smarter than you actually are.

Happy hacking! 🔥