Wayback Machine - atauenis/webone GitHub Wiki

Internet Archive (Archive.org) founders launched the Wayback Machine in 2001. The service enables users to see archived versions of web pages across time. The Wayback Machine lives at http://web.archive.org, so another unofficial name of it is Web Archive. Their crawler is indexing all publicly available Internet resources since 1996, making copies of the WWW on every year.

All copies are static, so server-side scripts weren't saved and only their cached outputs are available. Some client-side scripts are saved and adapted to work in Web Archive environment but many of them gets broken. However, the text and most of graphics and linked files is saved and is available to see.

Support of Wayback Machine in WebOne

WebOne can open archived copies of pages on dead links. If this feature is enabled (via configuration file), on 404 File Not Found errors or when remote server's domain name is unresolvable WebOne tries to search requested URL in the Web Archive database. It's do via Wayback CDX Server API. In case of availability of an archived copy, the proxy server makes a redirection to that archived copy or returns the copy (depending on proxy settings). Otherwise client sees the error as is. When Web Archive have multiple saved versions of a page, WebOne prefers latest available without redirects (or simply latest available, if there is no found version with HTTP 200 code).

Default installation of WebOne is configured to open old archived copies of some web sites, even which are still alive. The list includes Microsoft.com, online services of Windows XP, Windows Media Player 6/7/8/9, IE4 Active Channels (*.cdf files), Netscape.com online services.

URLs of archived copies

All copies have URL address in the fixed format: https://web.archive.org/web/YYYYMMDDHHMMSS/URL. Date-time-stamp can be shortened by removing last digits and the nearest copy will be used. In such cases Web Archive gives a 302 redirection to a URL with full timestamp. Even the current year may be used to get most latest available copy (including cases where the last copy is from 2005).

Another possible URLs

Web Archive addresses can contain wildcards to get list of archived content.

All available dates

To get list of all archived copies of the page, replace the timestamp with an *. It is some more powerful than the Timeline Bar on the top of all archived pages and is working even in that old browsers which can't display Timeline Bar. Example: https://web.archive.org/web/*/http://google.com.

Blue links indicates successfully created copies. Orange indicates that the URL was not found at crawl time. Green indicates redirects at crawl time.

All saved pages

To get list of all saved files from server's directory, enter wildcard in both timestamp and address part. Example: https://web.archive.org/web/*/http://web.ukonline.co.uk/cliff.lawson/*. This might be useful for searching for binary files or for URLs with arguments (like http://example.com/index.php?page=index&captcha=12345). Note that date of the last copy is not meaning that the file was removed shortly after creation of the copy. This may mean that the Web Archive robot haven't downloaded this file again due to tasteless file type or another excuses.

Modified and original HTML code

By default, Wayback Machine is returning a user-friendly version of content with Timeline Bar on top, and modified version of page content on the rest of the page. The modified version contains corrected links, so all links now going to Web Archive instead of real files. But sometimes it is need to get an original copy of files or hide the Timeline Bar.

The URL address can contain an suffix after timestamp like in example: https://web.archive.org/web/20130806040521if_/http://faq.web.archive.org/page-without-wayback-code/.

  • No suffix - full Wayback Machine page with Timeline Bar, optimized for modern browsers.
  • id_ Identity - the original file it as it was archived. Most of links will be broken.
  • js_ JavaScript - return document marked up as JavaScript.
  • cs_ CSS - return document marked up as CSS.
  • im_ Image - return document as an image.
  • if_ or fw_ In-frame - modified version, which have proper links to archived images, styles, etc, and a JavaScript patch inside.

For old browsers it is better to use fw_ version of the pages, as it is containing minimum amount of modifications. But all hyperlinks on it still will go to regular version. This can be overridden by a WebOne edit set. :)

See also

https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine