Categories
Software webdoc

Offline documentation with webdoc

Before going on a long flight, I download PDFs of reference documentation for whatever software library I will be programming with. Having the documentation handy means, I won’t get stuck on an unfamiliar edge case. It would be very convenient if documentation websites could be cached offline − and that’s why I added an offline storage option in webdoc. The documentation for PixiJS and melonJS can now be downloaded for offline use! I’ll walk you through how I did it − the technique can be replicated for any static website.

How is it done?

It’s done using a service worker!

A service worker is a script that acts as a proxy between a browser and the web server. It can intercept fetches done on the main thread and respond with a cached resource, or a computed value, or allow the request to continue the “normal” way to the web server. The service worker controls a cache storage and can decide to put or delete resources from its caches. Note that this “cache storage” is separate from the browser’s regular HTTP cache.

If your static website is hosted on a free service like GitHub Pages, being able to control the caching policy can be very handy. GitHub Pages’ caching policy sets a 10-minute TTL for all HTTP requests; this adds redundant downloads to repeat visits. A service worker can be leveraged to evict cached resources only when a web page has been modified.

A service worker runs in a separate thread from the page’s main thread. It’s stays active when a device is not connected to the Internet so it can serve cached pages even when it is offline!

webdoc’s caching policy

webdoc’s service worker uses two separate caches:

  1. The “main cache” holds the core assets of the website – react, react-dom, mermaid.js, material-ui icons. These assets have versioned URLs so they won’t need to be cache evicted. The main cache is simple and has no eviction policy.
  2. The “ephemeral cache” holds all assets that might be modified when documentation is regenerated. This is the generated HTML and webdoc’s own CSS / JS. To facilitate cache eviction, webdoc generates a md5 hash of its documentation manifest and inserts it into the website it generates.
    1. This hash is refetched on the main thread and stored in web storage on each page load.
    2. The service worker tags cached resources with the associated hash when they are downloaded. The tagging is done by appending a “x-manifest-hash” header to responses.
    3. A hash mismatch between web storage and a cached response effectuates a refetch from the web server, and the cache is updated with the new response.

Let’s dive into the code

Registration

const registration = await navigator.serviceWorker.register(getResourceURI("service-worker.js"));

The first step is to register the service worker so that the browser downloads and runs it. getResourceURI is a helper to locate a static resource in a webdoc site.

Before the main thread can communicate with the service worker, the browser must activate it so the second step is to wait for the registration to activate.

const waitOn = navigator.serviceWorker.controller ?
    Promise.resolve() :
    new Promise((resolve, reject) => {
      const worker = registration.installing ?? registration.waiting ?? registration.active;
      if (!worker) return reject(new Error("No worker found"));
      else {
        worker.onstatechange = (e) => {
          if (e.target.state === "active") {
            resolve();
          }
        };
      }
    });

// This hangs on the first page load because the browser doesn't
// activate the service worker until the second visit.
await waitOn;

navigator.serviceWorker.controller is what lets the main thread control and communicate with service workers. Its value is null until the service worker activates – which is signaled by the “statechange” event on the registration object.

Note that the service worker won’t activate on the first page load; the browser activates it on the second page load. That’s why it’s important to wait for the controller to become non-null.

Hash verification

Once the service worker is registered, the website hash must be downloaded and compared to what is in web storage. The local hash would be null after the service worker is registered for the first time, however; this means a hash mismatch would occur (which is desired)

    // https://github.com/webdoc-labs/webdoc/blob/a52570c22fc3161e1f19f4997eb96081d1ea9d34/core/webdoc-default-template/src/protocol/index.js#L47
    if (!APP_MANIFEST) {
      throw new Error("The documentation manifest was not exported to the website");
    }

    // Use webdoc's IndexedDB wrapper to pull out the manifest URL & hash
    const {manifest, manifestHash, version} = await this.db.settings.get(APP_NAME);
    // Download the latest hash from the server
    const {hash: verifiedHash, offline} = await webdocService.verifyManifestHash(manifestHash);

    // If the manifest URL or hash don't match, when we need to update IndexedDB
    // and send a message to the service worker!
    if (manifest !== APP_MANIFEST ||
          manifestHash !== verifiedHash ||
          version !== VERSION) {
      console.info("Manifest change detected, reindexing");
      await this.db.settings.update(APP_NAME, {
        manifest: APP_MANIFEST,
        manifestHash: verifiedHash,
        origin: window.location.origin,
        version: VERSION,
      });
      if (typeof APP_MANIFEST === "string") {
        this.postMessage({
          type: "lifecycle:init",
          app: APP_NAME,
          manifest: APP_MANIFEST,
        });
      }
    }

lifecycle:init

The service worker receives the lifecycle:init message on a hash mismatch and uses it to download the manifest data and recache the website if offline storage is enabled.

  case "lifecycle:init": {
    // Parse the message
    const {app, manifest} = (message: SwInitMessage);

    try {
      // Open the database & fetch the manifest concurrently
      const [db, response] = await Promise.all([
        webdocDB.open(),
        fetch(new URL(manifest, new URL(registration.scope).origin)),
      ]);
      const data = await response.json();

      // Dump all the hyperlinks in the manifest into IndexedDB. This is used by
      // "cachePagesOffline" to locate all the pages in the website that need
      // downloaded for offline use.
      await db.hyperlinks.put(app, data.registry);

      // Caches the entire website if the user has enabled offline storage
      const settings = await db.settings.get(app);
      if (settings.offlineStorage) await cachePagesOffline(app);
    } catch (e) {
      console.error("fetch manifest", e);
    }

    break;
  }

fetch

Now let’s walk through how webdoc caches resources on the website. The caching policy is what makes the website work when a user is offline, and it also makes the pages load instantly otherwise. The fetch event is intercepted by the service worker and a response is returned from the cache if available.

// https://github.com/webdoc-labs/webdoc/blob/a52570c22fc3161e1f19f4997eb96081d1ea9d34/core/webdoc-default-template/src/service-worker/index.js#L21
// Registers the "fetch" event handler
self.addEventListener("fetch", function(e: FetchEvent) {
  // Skip 3rd party resources like analytics scripts. This is because
  // the service worker can only fetch resources from its own origin
  if (new URL(e.request.url).origin !== new URL(registration.scope).origin) {
    return;
  }

The respondWith method on the event is used to provide a custom response for 1st party fetches. The caches global exposes the cache storage API used here.

  // https://github.com/webdoc-labs/webdoc/blob/a52570c22fc3161e1f19f4997eb96081d1ea9d34/core/webdoc-default-template/src/service-worker/index.js#L26
  e.respondWith(
    // Open the main & ephemeral cache together
    Promise.all([
      caches.open(MAIN_CACHE_KEY),
      caches.open(EPHEMERAL_CACHE_KEY),
    ]).then(async ([mainCache, ephemeralCache]) => {
      // Check main cache for a hit first - since we know hash verification
      // isn't required for versioned assets in that cache
      const mainCacheHit = await mainCache.match(e.request);
      if (mainCacheHit) return mainCacheHit;

      // Check the ephemeral cache for the resource and also pull out the hash
      // from IndexedDB
      const ephemeralCacheHit = await ephemeralCache.match(e.request);
      const origin = new URL(e.request.url).origin;
      const db = await webdocDB.open();
      const settings = await db.settings.findByOrigin(origin);

      if (settings && ephemeralCacheHit) {
        // Get the tagged hash on the cached response. Remember responses are
        // tagged using the x-manifest-hash header
        const manifestHash = ephemeralCacheHit.headers.get("x-manifest-hash");

        // If the hash matches, great!
        if (settings.manifestHash === manifestHash) return ephemeralCacheHit;
        // Otherwise continue and fetch the resource again    
        else {
          console.info("Invalidating ", e.request.url, " due to bad X-Manifest-Hash",
            `${manifestHash} vs ${settings.manifestHash}`);
        }
      }

If the main & ephemeral cache don’t get hit, then the resource is fetched from the web server by the service worker. A fetchVersioned helper is used to add the “x-manifest-hash” header to the returned response. The response is put into the appropriate cache so a future page load doesn’t cause a download.

// https://github.com/webdoc-labs/webdoc/blob/a52570c22fc3161e1f19f4997eb96081d1ea9d34/core/webdoc-default-template/src/service-worker/index.js#L49
      try {
        // Fetch from the server and add "x-manifest-hash" header to response
        const response = await fetchVersioned(e.request);

        // Check if the main cache can cache this response
        if (VERSIONED_APP_SHELL_FILES.some((file) => e.request.url.endsWith(file))) {
          await mainCache.put(e.request, response.clone());
        // Check if the ephemeral cache can hold the response (all HTML pages are included)
        } else if (
          settings && (
            EPHEMERAL_APP_SHELL_FILES.some((file) => e.request.url.endsWith(file)) ||
          e.request.url.endsWith(".html"))) {
          await ephemeralCache.put(e.request, response.clone());
        }

        return response;
      } catch (e) {
        // Finish with cached response if network offline, even if we know it's stale
        console.error(e);
        if (ephemeralCacheHit) return ephemeralCacheHit;
        else throw e;
      }
    }),
  );
});

Note that at the bottom, a catch block is used to return a cached response even if we know the hash didn’t match. This occurs when the resource is stale but the user isn’t connected to the Internet so downloading the latest resource from the web server isn’t possible.


webdoc is the only documentation generator with offline storage that I know of. It supports JavaScript and TypeScript codebases. Give it a try and let me know what you think!