As a short note today, if you want to make an offline copy/mirror of a website using the GNU/Linux
wget command, a command like this will do the trick for you:
wget --mirror \
-o log \
Why I did this
In my case I used this command because I don’t want to use Drupal to serve that website any more, so I used
wget to convert the original Drupal website into a series of static HTML files that can be served by Nginx or Apache. (There’s no need to use Drupal here, as I no longer update that website, and I don’t accept comments there.) I just did the same thing with my alaskasquirrel.com website, which is basically an online version of a children’s book that I haven’t modified in many years.
Why use the --html-extension option?
Note that you won’t always need to use the
--html-extension option with
wget, but because the original version of my How I Sold My Business website did not use any extensions at the end of the URLs, it was necessary in this case.
What I mean by that is that the original version of my website had URLs like this:
Notice that there is no .html extension at the end of that URL. Therefore, what happens if you use
wget without the
--html-extension option is that you end up with a file on your local computer with this name:
Even if you use MAMP or WAMP to serve this file from your local filesystem, they aren’t going to know that this is an HTML file, so essentially what you end up with is a worthless file.
Conversely, when you do use the
--html-extension option, you end up with this file on your local filesystem:
On a Mac, that file is easily opened in a browser, and you don’t even need MAMP.
wget is also smart enough to change all the links within the offline version of the website to refer to the new filenames, so everything works.
Explanation of the wget options used
Here’s a short explanation of the options I used in that
Turn on options suitable for mirroring. This option turns on
recursion and time-stamping, sets infinite recursion depth,
and keeps FTP directory listings. It is currently equivalent to
‘-r -N -l inf --no-remove-listing’.
After the download is complete, convert the links in the document
to make them suitable for local viewing.
write "log" output to a file named "foo"
Wait the specified number of seconds between the retrievals.
Use of this option is recommended, as it lightens the server load
by making the requests less frequent.
Depending on the web server settings of the website you’re copying, you may also need to use the
-U option, which works something like this:
mascarade as a Mozilla browser
That option lets you set the
wget user agent. (I suspect that the string you use may need to be a little more complicated than that, but I didn’t need it, and didn’t investigate it further.)
I got most of these settings from the GNU wget manual.
An alternative approach is to use
httrack, like this:
httrack --footer "" http://mywebsite:8888/
I’m currently experimenting to see which works better.
I’ll write more about
wget and its options in a future blog post, but for now, if you want to make an offline mirror copy of a website, the
wget command I showed should work.