This was something I’ve wanted to do for a long time… since I was messing around with everything else on the server, I figured now would be a good time.
The rationale behind it is that if you have pre-compressed versions of a file available and use the “gzip_static” directive in nginx, it’ll save nginx from having to compress it on-the-fly. So less server load, and visitors hopefully get served the page just a tiny bit faster. This also makes the max compression in gzip (-9) a little more palatable.
The problems I needed to solve were as follows:
- I needed to be able to specify which files to compress based on the extension. Only compress CSS, HTML, JS, etc.
- I wanted it set up in a way such that it could be automated. In other words, so that I could either manually run a script that would look-at-and-take-care-of-everything, or handle it in a cron job that would look to see if any of the files changed, and update the .gz version if so.
- This was being done on a Ubuntu 14.04 server, so of the zillion ways you might do something in a bash script, whatever I used had to work on Ubuntu 14.04.
Solving #1 was a fairly simple 1-liner, though note that you should BACK UP first and make sure you have the latest version of gzip before trying this, since the “-k” option is pretty new and it would be unfortunate if you suffered the old behavior:
find /var/www/testing.com -type f -regextype posix-extended -regex '.*\.(htm|css|html|js)' -exec gzip -k -9 {} \;
That’s a quick and dirty (not so great) way of manually gzipping it all and getting prompts asking if you want to overwrite any previous .gz versions. Kinda terrible, but one line.
A better way that addressed #2 is via a shell script. This one’s a little messy, but hopefully easy enough to follow:
#!/bin/bash LOCATION="/var/www/testing.com" FILES="htm|css|html|js" process() { DEBUG=1 SLEEP_DELAY=0.1 FILE="$1" if [ -f "$FILE".gz ] then FILE_ORIG=$(stat -c %Y "$FILE") FILE_GZIP=$(stat -c %Y "$FILE".gz) if [ $FILE_ORIG -gt $FILE_GZIP ] then rm "$FILE".gz gzip -k -9 "$FILE" if [ "$DEBUG" == 1 ] then echo "Deleted old .gz and created new one at: $FILE.gz" sleep $SLEEP_DELAY fi else if [ "$DEBUG" == 1 ] then echo "Skipping - Already up to date: $FILE.gz" fi fi else gzip -k -9 "$FILE" if [ "$DEBUG" == 1 ] then echo "Created new: $FILE.gz" sleep $SLEEP_DELAY fi fi } export -f process find $LOCATION -type f -regextype posix-extended -regex '.*\.('$FILES')' -exec /bin/bash -c 'process "{}"' \;
The stuff meant to be easily tweakable is listed at the top in BOLD.
What it essentially does:
- Does a find for everything with an html/htm/css/js extension.
- Checks to see if a gzipped version already exists.
- If a gzipped version DOES already exist, it compares the timestamp against the original. If the timestamp is the same, it does nothing. If the original is newer, it deletes the old gzip and creates a new one.
- If a gzipped version DOESN’T already exist, it creates a gzip.
A few things to note:
- Again, this works on this particular linux distro with the latest version of gzip. Do a test run on some non-important data before trying to use this.
- DEBUG=1 spits out data about what happened for each file – whether it was gzipped, skipped, or an old gzip was replaced. After you’ve run it successfully once and are sure that nothing-crazy-happened, you can probably set DEBUG=0.
- SLEEP_DELAY creates a small delay after each compression (it won’t add a delay on skipped files), with the intent being that it doesn’t cause the server load to skyrocket if you have a lot of files and the server’s already heavily loaded. Since compressing 1 million files would take about 28 hours with this setting, you may want to tweak it.
- You could tweak this in a number of ways to make it better suited to your uses – Couple examples:
- adding another check so that you don’t compress files under a certain size.
- saving everything in the process() function to it’s own script and then calling it with something like find /var/www/testing.com -type f -regextype posix-extended -regex ‘.*\.(htm|php|html)’ -exec ./mynewscript.sh {} \; to make it a little more portable if you want to use the same script to affect different file extensions or directories.
Warnings:
- Don’t use it on php files unless you absolutely 100% know what you’re doing. A gzipped version wouldn’t work anyway, and they often contain sensitive data that will magically become available to the world via the new .gz file.
- Back up before testing, and test it on a duplicate. I can’t stress this enough. I could have made an error above, or there might be something funky that happens on your system.
- If it’s possible for someone else (visitors) to create files on your system somehow, you may want to carefully comb through the code and make sure there’s no room for injection. I honestly have no idea if there’s potential for damage via nefariously crafted odd file/directory names (stuff with spaces and/or special chars in it). Use at your own risk!
A lot of ways to do potentially do this. I'd be inclined to use a separate script (or a separate loop within the first script) that's triggered before the main script/loop which does a "find" for all .gz files in the desired location and then pulls out the basename. For example, if you completely gutted/modified the script so that "FILEZ" was pointing to files like "my.opinion.on.cats.vs.dogs.html.gz" (pointing to .gz extensions instead of the html/js/etc extensions), then:
basename $FILEZ .gz
(note the space between $FILEZ and .gz)
...would output "my.opinion.on.cats.vs.dogs.html".
So a line like:
CHECKFILE=$(basename "$FILEZ" .gz)
(note the space between "$FILEZ" and .gz)
...would give you the file name without the gz extension and save it to a new CHECKFILE variable. Then you can do a check to see if $CHECKFILE exists (with the "if [ -f ... ]" bit). If it doesn't exist, delete the .gz version.
Once any extraneous .gz files are removed, let the normal script run.
---
A few notes:
1) I use basename because it handles file names that have multiple periods in them pretty easily. There are other ways to do it, and someone can certainly chime in if they've got a better method.
2) This will obviously cause a problem if you intentionally uploaded a .tar.gz file and it happens to be in the path you're checking (since you probably don't have an original .tar in there, it'll delete your original). Thus, I tend to just manually delete .gz variants manually if need be.
3) I've tested the behavior of BASENAME, but didn't actually test my syntax above. So it may need tweaking.
4) Obviously TEST CAREFULLY on a copy of your main site just in case something goes wrong. Make sure you have backups!
Hopefully something in there helps. Good luck!