Archiving a Web Page ==================== This is just a document used for my own notes. Currently using: ``` wget --convert-links --html-extension --page-requisites ``` More reference: ``` $ wget \ --mirror \ --warc-file=YOUR_FILENAME \ --warc-cdx \ --page-requisites \ --html-extension \ --convert-links \ --execute robots=off \ --directory-prefix=. \ --span-hosts \,, \ --user-agent=Mozilla (\ --wait=10 \ --random-wait \ ``` Let's go through those options: wget is the tool were using * --mirror turns on a bunch of options appropriate for mirroring a whole website * --warc-file turns on WARC output to the specified file * --warc-cdx tells wget to dump out an index file for our new WARC file * --page-requisites will grab all of the linked resources necessary to render the page (images, css, javascript, etc) * --html-extension appends .html to the files when appropriate * --convert-links will turn links into local links as appropriate * --execute robots=off turns off wget's automatic robots.txt checking * --span-hosts allows it to follow links to other domain names * --domains includes a comma-separated list of domains that wget should include in the archive * --user-agent overrides wget's default user agent * --wait tells wget to wait ten seconds between each request * --random-wait will randomize that wait to between 5 and 15 seconds * is the website we want to archive