Archiving a Web Page ==================== This is just a document used for my own notes. Currently using: ``` wget --convert-links --html-extension --page-requisites https://url.here.com/whatever/ ``` More reference: https://www.petekeen.net/archiving-websites-with-wget ``` $ wget \ --mirror \ --warc-file=YOUR_FILENAME \ --warc-cdx \ --page-requisites \ --html-extension \ --convert-links \ --execute robots=off \ --directory-prefix=. \ --span-hosts \ --domains=example.com,www.example.com,cdn.example.com \ --user-agent=Mozilla (mailto:archiver@petekeen.net)\ --wait=10 \ --random-wait \ http://www.example.com ``` Let's go through those options: wget is the tool were using * --mirror turns on a bunch of options appropriate for mirroring a whole website * --warc-file turns on WARC output to the specified file * --warc-cdx tells wget to dump out an index file for our new WARC file * --page-requisites will grab all of the linked resources necessary to render the page (images, css, javascript, etc) * --html-extension appends .html to the files when appropriate * --convert-links will turn links into local links as appropriate * --execute robots=off turns off wget's automatic robots.txt checking * --span-hosts allows it to follow links to other domain names * --domains includes a comma-separated list of domains that wget should include in the archive * --user-agent overrides wget's default user agent * --wait tells wget to wait ten seconds between each request * --random-wait will randomize that wait to between 5 and 15 seconds * http://www.example.com is the website we want to archive