When It Comes To Cache Hit Ratio And CDNs, The Devil Is In The Details

The term “cache hit ratio” is used so widely in the industry that it’s hard to tell what exactly it means anymore from a measurement standpoint, or the methodology behind how it’s measured. When Sandpiper Networks first invented the concept of a CDN (in 1996), and Akamai took it to the next level by distributing the caching proxy “Squid” on a network of global servers, the focus of that caching at the time was largely images. But now we need to ask ourselves if focusing on overall cache hit ratio as a success metric is the best way to measure performance on a CDN.

In the late 90’s, much of the Internet’s web applications were being served from enterprises with on premise data centers and generally over much lower bandwidth pipes. One of the core issues Akamai solved was relieving bandwidth constraints at localized enterprise data centers. Caching images was critical to moving bandwidth off the local networks and bringing content closer to the end user.

But fast forward 20 years later and the Internet of today is very different. Pipes are bigger, applications are more complicated and users are more demanding with respect to performance, availability and security of those applications. So, in this new Internet is the total cache hit ratio for an application a good enough metric to consider, or is there a devil in the details? Many CDNs boast of their customers achieving cache hit ratios around 90%, but what does that really mean and is it really an indicator of good performance?

To get into cache hit ratios we must think about the elements that make up a webpage. Every webpage delivered to a browser is comprised of an HTML document and then other assets including images, CSS files, JS files and Ajax calls.  HTTP Archive tells us that, on average, a web page contains about 104-108 objects per page coming from 19 different domains. The average breakdown of asset types served per webpage from all HTTP Archive sites tested looks like this:

Most of the assets being delivered per web page are static. On average 9 may specifically be content type HTML (and therefore potentially dynamic) but usually, only one will be the initial HTML document. An overall cache hit rate for all of these objects tells us what percentage of them are being served from the CDN, but does not give developers the details they need to truly optimize caching. A modern web application should have most of the images, CSS files and other static objects served from cache. Does a 90% cache hit ratio on the above page tell you enough about the performance and scalability of the application serving that page?  Not at all.

The performance and scalability of a modern web applications is often largely dependent on its ability to process and serve the HTML document.  The production of the HTML document is very often the largest consumer of compute resource on a web application. When more HTML documents are served from cache, less compute resource is consumed and therefore applications become more scalable.

HTML delivery time is also critical to page load time and start render time, being the first object delivered to the browser and a blocker to all other resources being delivered. Generally, serving HTML from cache can cut HTML delivery time to circa 100ms and significantly improve user experience and their perspective of page speed. Customers should seek to understand the cache hit ratio by asset type so developers can specifically target improvements in cache hit rates by asset type. This would result in achieving faster page load times and a more scalable application.

For example, seeking closer to 100% cache hit rates for CSS files, JS files and possibly images would seem appropriate.

As would understanding what cache hit rate is being achieved on the HTML.

[*Snapshots from the section.io portal]

While not all HTML can be served from cache, the configurability of cache solutions like Varnish Cache (commercially available through Varnish Software, section.io and Fastly) and improved HTML management options such as HTML streaming (commercially available from Instart Logic and section.io) have made it possible to cache HTML. In addition, new developer tools such as section.io’s Developer PoP allow developers to more safely configure and deploy HTML caching without risking incidents in production.

Many CDNs focus on overall cache hit rate because they do not encourage their users to cache HTML. A 90% cache hit rate may sound high, but when you consider that the 10% of elements not cached are the most compute-heavy, a different picture emerges. By exposing the cache hit ratio by asset type, developers are able to see the full picture of their caching and optimize accordingly. This results in builders and managers of web applications who can more effectively understand and improve the performance, scalability, and user experience of their applications and is where the industry needs to head.

  • Alexander Leschinsky

    Thanks for pointing out to this important metric. Exposing cache hit ratio (CHR) by asset type is a good start, but we know that our customers like to go even deeper into the details as the ability to cache can vary greatly even within an asset type. I just looked into a random example for CHR of 25 mio CSS files per day for one of the configurations we deliver. While the weighted average CHR for a given hour was 99,87%, the individual URL’s CHRs ranged between 68 and 100%.

    To identify bad CHRs our customers make extensive use of the Akamai offload reports. These views show individual URL hits and CHR that can be filtered by simple pre- and postfixes up to fully fledged regular expressions. We usually export the filtered lists to CSV or XML to further drill down in Excel or ElasticSearch. Combining CHR with the actual numbers of hits for an asset allows for the weighted calculation I mentioned.

    As we also work a lot with Varnish I’d like to mention the Akamai Connector for Varnish which has been released earlier this year. Besides other things you can use it to configure caching behaviour for both Varnish and Akamai to consistently optimise your CHR.

  • Alexandre

    Hello,
    Indeed it’s a very interesting metrics and CDN’s provider doesn’t speak a lot about it.
    Do you have any baseline about hit ratio for large image backend (so TTL=1Year and backend number of images > 1 Millions files) ?
    Thanks,
    Alexandre

  • Jason Hofmann

    At Limelight Networks, we agree with you! The problem of taxonomy and classification is not an easy one.

    There’s an old CS joke: “There are 2 truly difficult problems in Computer Science: 0: Naming things, 1: Cache invalidation, and 2: Off by one errors”

    When we launched SmartPurge, we tackled the famed Cache Invalidation problem and by most accounts, beat it into submission with the world’s most flexible cache invalidator and cache evictor – invalidating *AND deleting* (that’s a big deal) up to 1,000 *PATTERNS* **PER UI OR API REQUEST** within a couple of seconds!

    Naming things is still “A Hard Problem”.

    A server has a cache hit ratio – both for requests as well as for bytes, for varying types of content – both for cacheable and non cacheable content. Sometimes, only part of the request was served out of cache. How do you count that?

    A Point of Presence has a cache hit ratio, but sometimes that Point of Presence is used as the root of cache hierarchy and so gets “credited with” all the cache misses back to origin in lieu of the POP that the user connected to.

    And the CDN as a whole has a cache hit ratio (again, for requests, for bytes, for and for varying types of content).

    In our Control customer portal, we already show “CDN Efficiency” (Cache Hit for the CDN as a whole) as a percentage of both bytes and requests. We also report on:

    Cache Hit Ratio as a % of Bytes for the most popular URLs and URL prefixes (paths)
    Cache Hit Ratio as a % of Bytes by file size bucket (e.g. 0-512 Bytes, 16K-32K, 1M-2M)
    Cache Hit Ratio as a % of Bytes by Content Type (MIME Type, e.g. application/javascript, application/json, text/css, text/html)
    Cache Hit Ratio as a % of Bytes or Requests filterable by the First Touch Server (that the request first landed on) or the Last Touch Server (the server that either had all the content or had to fetch some or all bytes from origin), further grouped by cache status (standard, refresh check, IMS, negatively cached), status code or status code family (e.g. 200, 206 or 4xx, or custom filter) and by cache result (Hit, Miss, Edge Redirect, Other)
    Cache Hit Ratio as a % of Bytes by User Agent
    Cache Hit Ratio as a % of Bytes by Referrer URLs

    And most of the above can be further filtered by applying a pre-defined Data Segment. Data Segments can be applied either to Published URLs or Origin URLs, and may match any portion of a URL using a Regex pattern.

    And we have plans to add more!

    Jason Hofmann
    VP, Architecture
    Limelight Networks

  • Jason Hofmann

    At Limelight Networks, we agree with you! The problem of taxonomy and classification is not an easy one.

    There’s an old CS joke: “There are 2 truly difficult problems in Computer Science: 0: Naming things, 1: Cache invalidation, and 2: Off by one errors”

    When we launched SmartPurge, we tackled the famed Cache Invalidation problem and by most accounts, beat it into submission with the world’s most flexible cache invalidator and cache evictor – invalidating AND deleting up to 1000 *PATTERNS* *PER UI OR API REQUEST* within a couple of seconds.

    Naming things is still A Hard Problem.

    A server has a cache hit ratio – both for requests as well as for bytes, for varying types of content – both for cacheable and non cacheable content. Sometimes, only part of the request was served out of cache. How do you count that?

    A Point of Presence has a cache hit ratio, but sometimes that Point of Presence is used as the root of cache hierarchy and so gets “credited with” all the cache misses back to origin in lieu of the POP that the user connected to.

    And the CDN as a whole has a cache hit ratio (again, for requests, for bytes, for and for varying types of content).

    In our Control customer portal, we already show “CDN Efficiency” (Cache Hit for the CDN as a whole) as a percentage of both bytes and requests. We also report on:

    + Cache Hit Ratio as a % of Bytes for each of the most popular URLs and URL prefixes (paths)
    + Cache Hit Ratio as a % of Bytes by file size bucket (e.g. 0-512 Bytes, 16K-32K, 1M-2M)
    + Cache Hit Ratio as a % of Bytes by Content Type (MIME Type, e.g. application/javascript, application/json, text/css, text/html)
    + Cache Hit Ratio as a % of Bytes or Requests filterable by the First Touch Server (that the request first landed on) or the Last Touch Server (the server that either had all the content or had to fetch some or all bytes from origin), further grouped by cache status (standard, refresh check, IMS, negatively cached), status code or status code family (e.g. 200, 206 or 4xx, or custom filter) and by cache result (Hit, Miss, Edge Redirect, Other)
    + Cache Hit Ratio as a % of Bytes by User Agent
    + Cache Hit Ratio as a % of Bytes by Referrer URLs

    And most of the above can be further filtered by applying a pre-defined Data Segment. Data Segments can be applied either to Published URLs or Origin URLs, and may match any portion of a URL using a Regex pattern.

    And we have plans to add more!

    Jason Hofmann
    VP, Architecture
    Limelight Networks

  • John Smith

    Sorry, for getting out of topic, but Dan, have you noticed that Net Insight refer to Amazon in their latest informationmovie about Sye? They illustrate the synced performance of Sye through Amazon Web services’ web server on Ireland. Would you think this might be an indication that Amazon could make Sye a part of their powerful expansion in the field? Sye would instantly provide Amazon with a strong upper hand on the competition in live streaming.