I know, you almost felt a little dirty searching for this…. But rest assured, website scraping isn’t just for the nefarious!
In the past I’ve had to convert sites from php to html for various reasons. Sometimes temporary, sometimes a little more permanent. In any case, it always takes me a while to look up exactly the way I did it the last time, so now it’s here for me. Oh, and it’s here for you too. Exciting.
Note that while I’ve used it to convert php to static html, it’s also just dandy for scraping/mirroring a site that’s already in html.
PART 1: grab wget
Mac OS X: http://www.merenbach.com/software/wget/ (it includes a package installer)
Windows: http://users.ugent.be/~bpuype/wget/#download
Others (also OSX and Windows if the above links break one day – UPDATE – YES THEY DID BREAK):
http://wget.addictivecode.org/FrequentlyAskedQuestions#download
PART 2: install, and use it!
I’ve only used wget on Mac OS X – the above package installer makes it really easy. Install and then you can then access it from anywhere as soon as you open Terminal.
If you’re on Windows, I believe it’s just a simple executable – if you want to use it from anywhere, throw it somewhere in your %path% and then jump into Command Prompt.
So, whether you’re in Terminal or Command Prompt, the usage is pretty simple. For a singe page (say, the main page of a site), you’d type in:
wget http://www.website.com
PART 3: Making it useful (using switches)
Of course, doing an entire website that way would take forever, so here’s the set of switches I typically use as a start-point:
wget -m -k -K -E -l 7 -t 6 -w 5 http://www.website.com
**note** if you don’t want to read much further, the -m and -w 5 are the critical ones (and of course the http://www.website.com ). Use them even if you’re going to just trial-and-error with a CTRL-C to cancel/tweak once you’ve seen what it looks like.
A little clarification regarding each of the switches, as you will undoubtedly want to tweak a bit. Like seriously, before you copy/paste and spend an hour scraping your site, read this or you might end up with *something* you didn’t want.
-m Essentially, this means “mirror the site”, and it recursively grabs pages & images as it spiders through the site. It checks the timestamp, so if you run wget a 2nd time with this switch, it will only update files/pages that are newer than the previous time.
-k This will modify links in the html to point to local files. If instead of using things like “page2.html” as links throughout your site you were actually using a full “http://www.website.com/page2.html” you’ll probably need/want this. I turn it on just to be on the safe side – chances are at least 1 link will cause a problem otherwise.
-K The option above (lowercase k) edits the html. If you want the “untouched” version as well, use this switch and it will save both the changed version and the original. It’s just good practise in case something is awry and you want to compare both versions. You can always delete the one you didn’t want later.
-E This saves HTML & CSS with “proper extensions”. Careful with this one – if your site didn’t have .html extensions on every page, this will add it. However, if your site already has every file named with something like “.htm” you’ll now end up with “.htm.html”.
-l 7 By default, the -m we used above will recurse/spider through the entire site. Usually that’s ok. But sometimes your site will have an infinite loop in which case wget will download forever. Think of the typical website.com/products/jellybeans/sort-by-/name/price/name/price/name/price example. It’s somewhat rare nowadays – most sites behave well and won’t do this, but to be on the safe side, figure out the most clicks it should possibly take to get anywhere from the main page to reach any real page on the website, pad it a little (it would suck if you used a value of 7 and found out an hour later that your site was 8 levels deep!) and use that #. Of course, if you know your site has a structure that will behave, there’s nothing wrong with omitting this and having the comfort of knowing that the 1 hidden page on your site that was 50 levels deep was actually found.
-t 6 If trying to access/download a certain page or file fails, this sets the number of retries before it gives up on that file and moves on. You usually do want it to *eventually* give up (set it to 0 if you want it to try forever), but you also don’t want it to give up if the site was just being wonky for a second or two. I find 6 to be reasonable.
-w 5 This tells wget to wait a few seconds (5 seconds in this case) before grabbing the next file. It’s often critical to use something here (at least 1 second). Let me explain. By default, wget will grab pages as fast as it possibly can. This can easily be multiple requests per second which has the potential to put huge load on the server (particularly if the site is written in PHP, makes MySQL accesses on each request, and doesn’t utilize a cache). If the website is on shared hosting, that load can get someone kicked off their host. Even on a VPS it can bring some sites to their knees. And even if the site itself survives, being bombarded with an insane number of requests within a few seconds can look like a DOS attack which could very well get your IP auto-blocked. If you don’t know for certain that the site can handle a massive influx of traffic, use the -w # switch.5 is usually quite safe. Even 1 is probably ok most of the time. But use something.
PART 4: A few other notes….
wget is incredibly versatile. If you run it without any parameters it’ll spit out the command to print out the help, and it’s worth a look if you want to do something a little more specific (I looked at 7 options above… there are 40 “simple” switches, and over 100 options in total).
One edge-case where I’ve had issues in the past is when some images were implemented through CSS. If you’ve used a lot of “background-image:picture.png” in your theme, you may be in for the same headache. Basically, the last time I scraped a site it just doesn’t seem to see/download those images. The good news is that CSS images tends to only apply to images for the theme itself, and since you’re presumably scraping your own site, you should have them all clumped in a folder somewhere anyway, and should be able to plunk those images back in manually if need be. Regular images (implemented through the typical <img src=”something.img” /> were grabbed by wget just fine for me.
CTRL-C is generally the shortcut to cancel part-way through. You’ll see the progress as it grabs the files, so if something looks awry or you realize that the “wait” value you used wasn’t a great balance between not-killing-the-server and being-time-efficient, feel free to cancel, adjust, and try again.
So there we have it. A simple way to scrape and/or mirror you own site. And now I’ll have the settings again for next time.
1 Comment | Leave a Comment