Classic challenge, new tool – addressing browser image caching issues with Hugo fingerprinting

Have you ever been on a support call with a service/vendor and they tell you – “Please, clear your browser cache!”

Or you’ve been on the development side of a website and you upload a new image on your website only for users to continue reporting that they still see the old image. Then you have to ask all of your users to refresh their browser, which is not an easy thing to do.

It’s a classic challenge with caching of assets like images and it generally has only one solution working in all cases.

Recently we had the same challenge with the new website at BeMe.ai and I thought: “Let’s explore a tool from the perspective of this challenge”. The tool is a static website generator called Hugo and the job to be done is to address the image caching challenge so that our parents of autistic children always see the latest image on our website developed by our illustrator – Antonio.

(image of a parent and a an autistic child. Used at https://beme.ai)

In this article I will go through what the challenge is, why there is a caching challenge, how it could be addressed without any framework, with Rails (Ruby on Rails) and with Hugo. My hope is to give you a good understanding of how the Hugo team thinks, in what terms and in what direction. This is not a Hugo tutorial, but more how I’ve focused on one challenge and tried to address it with Hugo.

The challenge of browser caching

When browsers visualise a webpage with images they request images from the server where the website is hosted. Images require much more bandwidth then text. Next time you visit the same website the browser will request the same image that will again require traffic. Images do not change often on websites so browsers prefer to cache the image – which means they are saved on the local storage of your device, be it mobile or desktop. Next time when you as a user open the same webpage the browser will not make a request to the server to get the same image as there is a copy of this same image already in your local storage.

The question here is how does the browser know that the image it has in its local storage is the same image as the one that is on the server. What if you’ve changed the image in the meantime and there is a new version on the server. It happened with us and an image that Antonio developed.

The answer is remarkably simple: The browser knows if the image in its local storage and the image on the server are the same if they have the same “url/name”.

Let me illustrate a scenario

A week ago:
A parent visited our website. The browser visualised the image called – parent-with-child.png
It stores a copy of parent-with-child.png in its local storage.

Image from BeMe.ai that had the wrong dashboard on the phone

Two days ago:
Antonio developed a new image of the parent and the child and we uploaded it at BeMe.ai. From now on the image located on parent-with-child.png is the second version.

Image improved by Antonio to have the correct dashboard on the phone. Taken from the BeMe.ai website.

Today:
The same parent again visits the website. The browsers asked the server what’s on the page and the server responded that the page contains a link to parent-with-child.png. As the browser already has a local copy of the parent-with-child.png it will not request this resource from the server. It will just use the local copy. This saves the bandwidth and the site is opened faster. It’s a better experience, but it is the old image.

Which one will be shown to the user:

What really makes this problem difficult is the fact that different browsers will behave in different ways. The internet has tried many different solutions including different headers in the HTTP protocol to address this challenge. Still, there are times when the user will just not see the new version of the image. It could drive you really crazy as it will be some users seeing the new version and some seeing the old.

How big of a challenge is this?

Technically it is not a great challenge, yet I’ve seen experienced engineers miss this small detail. I’ve missed it a couple of times. Every such case is a load on the support channel of your organisation and on the engineering. So better avoid any caching issues at all.

What’s the solution?

There are many solutions. There is only one that works across all browsers and devices and across all versions of the HTTP protocol that power the internet.
The solution is remarkably simple:
The browser will look at the URL of the image. If there is a local copy stored for this URL, the browser will use the local copy. If there is no local copy for the URL the browser will request the new image.

Everything we have to do is next time we change the parent-with-child.png image to upload the image with a new file name. Probably parent-with-child-v2.png is a good new name.
Other good names include:

  1. parent-with-child-2021-12-19-13-15.png (it has the date and time on which it was uploaded)
  2. parent-with-child-with-an-added-book.png (it is different and descriptive)
  3. parent-with-child-1639913778.png (it has in its name the time in seconds since the UNIX epoch when the file was created)
  4. parent-with-child-2f11cf241d023448c988c3fc807e7678.png (it has an MD5 hash code)
  5. parent-with-child-a6e0922d83d80151fb73210187a9fb55ee2c5d979177c339f22cb257dead8a56.png (it has a SHA256 sum as it’s name)

All it takes to resolve the browser caching challenge is to change the name of the picture when you upload it and to change it to something unique.

That’s all it takes – no need for frameworks and fingerprinting and assets precompilation and all the other things that follow in the article.

All it takes to address and successfully resolve any browser image caching issue is to change the name of the image next time you upload a new version of it.

Why does the simple solution not work?

I think it does not work because we are humans and we forget. We create a new version of the parent-with-child.png image and we forget to change the name of the image. That’s it. We just forget.

Computers on the other hand are good at reminding us of things and of doing dull work that we often forget to do. What we could ask the computer to do is to create a new name for the image everytime we upload a new version. Enter fingerprinting?

Fingerprinting

Fingerprinting is the process of looking at the bits of the image (or generally any file) and calculating a checksum. The checksum will be unique for every file. After we calculate the checksum we add the checksum to the name of the file.

Example:

  1. We upload the original parent-with-child.png image and the computer calculates a checksum a6e0922d83d80151fb73210187a9fb55ee2c5d979177c339f22cb257dead8a56. Then it sets the name of the file to be parent-with-child-a6e0922d83d80151fb73210187a9fb55ee2c5d979177c339f22cb257dead8a56.png on the server
  2. We upload a new version of parent-with-child.png image and the computer calculates its checksum that is cdc72fa95ca711271e9109b8cab292239afe5d1ba3141ec38b26d9fc8768017b. Then the computers sets the name of the file to be parent-with-child-cdc72fa95ca711271e9109b8cab292239afe5d1ba3141ec38b26d9fc8768017b.png on the server
  3. We upload a new version, a new checksum is calculated and new name is generated. And this is done with every new image.

How checksums are calculated is a topic for another article. Computers are good and fast at calculating checksum. Humans are terrible. Like literally it will probably take us days of manual calculations to come up with the checksum of an image file if we do it by hand.

What’s difficult with fingerprinting?

The difficult part is not the fingerprinting part. What’s difficult is finding every HTML page on your website where the image
 parent-with-child-a6e0922d83d80151fb73210187a9fb55ee2c5d979177c339f22cb257dead8a56.png is used and replacing this name with parent-with-child-cdc72fa95ca711271e9109b8cab292239afe5d1ba3141ec38b26d9fc8768017b.png.

This means every occurance of on the website

<img src="https://www.beme.ai/images/parent-with-child-a6e0922d83d80151fb73210187a9fb55ee2c5d979177c339f22cb257dead8a56.png"></img>

should be updated to contain the new url

<img src="https://www.beme.ai/images/parent-with-child-cdc72fa95ca711271e9109b8cab292239afe5d1ba3141ec38b26d9fc8768017b.png"></img>

The good thing is that computers are also good at this – searching through the content of many files and replacing specific parts of this file with new content.

What needs to be implemented then? What’s the process?

What we need from our process of deploying new versions of images to our website is the following:

  1. Ask our illustrator, Antonio in our case, to develop a new version of the parent-and-child.png image.
  2. Put this new picture on our website and the computer should magically:
      – calculate a new checksums of the image and change the name of the parent-and-child.png, eg. fingerprint the image
      – find all references on our website to the previous version of the image and replace each reference with the new name of the image
     

Pure Bash and Linux implementations

Linux provides simple tools as sha256sum, grep, sed, mv and in a combination with such tools we can come up with a pretty decent solution. I am not going to do that because we might decide it is a good idea to do it in this way. This might take us on a path where we are reinventing the wheel with different bash scripts all over the infrastructure and code, and there is no need to do this. If you are already on this path I can not stop you, but I don’t want to be the one guiding you on this path. Been there, done that and after many years I realised that it was not a very wise decision.

Doing it the Rails way

I am a big fan of Rails. Rails address the browser image caching challenge with something called “Assets Pipeline”

When we do in rails is to use the image_tag method in all HTML pages. The syntax below is an ERB template where we use “<%  %>” inside the HTML when processing it on the server side.

<div class="container">
  <!-- Logo -->
  <%= link_to image_tag("logo.png", alt: "BeMe.ai", class: "img-fluid", width:55), main_app.root_path,{ class: "navbar-brand" }%>
  <!-- End Logo -->

Note that here we use the name of “logo.png” and the image_tag method handles everything for us

<div class="container">
  <a class="navbar-brand" href="/"><img alt="BeMe.ai" class="img-fluid" width="55" src="/assets/logo-60ffa36d48dfd362e6955b36c56058487272e3030d30f0b6b40d226b8e956a2b.png"></a>

Note how the file that we referred as logo.png in the template now becomes /assets/logo-60ffa36d48dfd362e6955b36c56058487272e3030d30f0b6b40d226b8e956a2b.png in the file visualised on the client.

Rails has done everything to us – fingerprinting and replacing. Thanks to the Assets Pipeline in Rails we’ve successfully resolved the browser image caching challenge.

Doing it the Hugo way

Hugo is different from Rails. Hugo is a static website generator and it thinks in terms different from Rails. Yet, it has a Hugo Pipeline. Before we enter into the Hugo Pipeline it is good to have a small introduction to Hugo.

Hugo allows authors to create markdown documents

Hugo is thinking about the authors. It gives a chance to include team members to develop the content of the website and it does not have to be team members that know how to start & deploy Rails applications. Which is good.

This means that authors could create a markdown document like this

# Basic misconceptions about autism

In this blog post we will talk about basic misconceptions about autism.

Let's start with this picture of an autistic child and a parent

![Picture of misconceptions](https://www.beme.ai/parent-and-child.png)

...

As a markdown document this is something that someone with medical knowledge could develop the content and there is no need to have someone with medical and HTML/Rails/Web Development expertise (and such people are difficult to find)

Now the author has added a new version of the parent-and-child.png image and it again has the name parent-and-child.png. We should somehow ask Hugo to add a fingerprint and replace all the references to the image with a reference to the new image.

Hugo in 1 paragraph – content, layouts, markdown hooks

In Hugo the content developers write the content in Markdown format. The engineer creates the HTML layouts. Hugo takes the layout and adds the content to the layout to generate the HTML page that should be visualised to the user. Everytime a Markdown element is converted to an HTML, Hugo calls a Markdown hook. The job of the hook is to convert the Markdown to an HTML. The logic of the hook is implemented with the Go Template Language. There are default implementations of hooks for every markdown element. We can override the default implementation of the hook that converts the Markdown containing the image parent-and-child.png to an HTML, by creating a file layouts/_default/_markup/render-image.html

Fingerprinting of content images in Hugo with the Hugo Pipeline

Fingerprinting of content images is not enabled by default. We should be explicit that we want it. Hugo Pipeline handles the rest with methods like “fingerprint”

Here is the content of layouts/_default/_markup/render-image.html

<!-- layouts/_default/_markup/render-image.html -->

{{/* Gets a resource object from the destination of the image that is parent-with-child.png */}}
{{/* The parent-with-child.png image is specified in the markdown and must be available in the  */}}
{{/* asset folder */}}
{{ $image := resources.GetMatch .Destination }}

{{/* Calculate the fingerprint, add it to the name and get the Permalink is the full URL of the file */}}
{{ $image := ($image | fingerprint).Permalink }}

<img src="{{ $image }}"/>

When processed Hugo will generate an index.html file that contains:

<img src="http://example.org/parent-and-child.a6e0922d83d80151fb73210187a9fb55ee2c5d979177c339f22cb257dead8a56.png"

Summary

Image fingerprinting is guaranteed to resolve the browser caching challenge 100% of the cases.
It is a topic that is often overlooked, both by content developers and engineers.
Without it we often end up with users seeing the wrong images and having the “Clear their browser cache and refresh again.”
It is easy to address it with many different available tools.

We’ve looked at how to implemented it with
  – Linux
  – Rails
  – Hugo
 
Asking users to “clear browser cache” and “refresh a website” is a failure of the process and the engineering organisation. It should not happen, and I am sure we could be better than this.