David Roe

Author


Using Varnish To Speed Up WhatClinic.com

Varnish Web Cache

Varnish makes websites fly... once you iron out a few issues.

[Here’s a bit of background about our previous cache setup. Skip ahead to “Using Varnish To Cache WhatClinic.com” if you want to jump straight into the Varnish section.]

Our main website is built on Microsoft’s IIS and we have been using its built-in page and component level caching to serve html pages for several years. This built-in caching is easy to setup and quite flexible, but it is very memory hungry.

The memory issue isn’t much of a problem on small static websites with only a couple of hundred pages. Unfortunately though, WhatClinic.com is a dynamic site with potentially millions of individual pages to serve. Typically we were getting only 12% of our pages served from the cache, and sometimes this was as low as 6%. It was almost pointless running the cache at all.

The biggest problem for us is the breadth of the website. On a typical day we have 30,000 unique visitors, but they land on 23,000 distinct URLs. Over the course of a month this balloons to 145,000 distinct landing pages.  Worse still, they look at over half a million distinct pages on the site.

To try and improve the performance of the existing IIS cache we tried writing the page cache to disk. Under test conditions with relatively small numbers of pages this worked well, but to get even 50% of our pages from one month’s visits in the cache it meant having 250,000 pages written to the disk. In the end the NT file system on our servers starting grinding to a halt, not because of request volume but purely because of the number of individual files involved.

Using Varnish To Cache WhatClinic.com

We came up with some ways around the NT file system problem but decided in the end it would be better to move the cache off the main box altogether. At the same time we decided to look at Varnish as a solution, with a view to hosting it on AWS.

On the upside Varnish is lightweight and powerful, but it also introduced a number of new problems for us to overcome:

1. Varnish Caches Cookies

We use a cookie to store all kinds of information about a new visitor, including things like their country of origin, so we can display clinics’ prices in the visitor’s local currency. To get around Varnish serving up pages based on one person’s cookie all the time we had to move our cookie drop into a javascript call rather than doing it on the page. No big deal, but something to be aware of.

2. All Requests Go Through The Varnish Box

To determine a visitor’s location we look at their IP address, but since all requests were going through the Varnish server our own server was only seeing one IP address hit it all the time. We changed the code to pass the referring IP address along and so we could pick it off.

Problem solved, except now our default access logs don’t record the proper IP address of each visitor. We use Google Analytics and our own logs for the bulk of our reporting so this isn’t a big deal, but at some point we might have to look at writing our own access logs with the referring IP address if only to give us the peace of mind that when something does go wrong we can track it in the raw log data.

3. Altering Our Landing Pages

Depending on whether you have just landed on WhatClinic.com, or are browsing subsequent pages, we alter the layout of the page. The layout differences are quite extensive even though the data is all the same, so it isn’t efficient to make the changes on the client side. We need to cache two different versions of the same page.

The solution involved getting Varnish to pass along the referring URL and using something like (isReferringDomainWhatClinic.com) as part of the key for the cache as well as the requested URL itself. In the end this was pretty easy to do too, but it did double the number of pages in the cache. However, we were trying in particular to improve the speed of our landing pages so it is worth it to us.

4. Time To Live

As we said in the intro, we have a very broad site. Our pages also change quite infrequently, so we wanted to have the maximum possible time to live for the cached pages, in the order of several months. However, some pages do change, and a change to any one of our customer’s data may have effects that ripple over hundreds of pages that their clinic might appear on.

The solution was to set our time to live to several months, and then remove pages from the cache only when they had been updated. Having implemented a means to remove the pages from our cache, we then had to determine when a change to a clinic’s data had occurred and which pages were affected by the change, so we knew which pages to remove from the cache and update.

Working out exactly which pages were affected turned out to be a little problematic but we solved it eventually and we’re reasonably happy that we’ve covered all the cases. We also coded a big red “Remove All This Clinic’s Data From The Cache” for use in case of emergencies.

The Results

Overall, it has been a big win. After about three weeks of operation we have a page hit rate of around 65%, which is a huge improvement from the 12% we used to get. Cached pages are returned now somewhere in the order of 100-200ms instead of 2000-5000ms, and the load on our server has dropped dramatically, improving performance for those pages which are never going to be in the cache too.

Of course, having improved the efficiency of generating the page html, we are now looking at the speed of all our own JavaScript, our external calls to analytics, our social media buttons and other external client-side calls.

Performance improvements never end, do they?

Tagged with:  

Google Instant Adds 90Kb To Your Search

Google Instant is an amazing piece of technology. However, I imagine, like most techies, the question that first springs to mind is “Oh my god, how much data is this sucking down?!?”

The answer of course is: “it depends”. It depends to a large degree on the kind of results you’re going to see, how many results there are on the page, whether they have maps in them or images, and lots of other factors. It also depends on how accurate the query suggestions are at guessing what you are going to type, since the more accurate it is the fewer times it will have to re-fetch results from the server.

For instance, let’s say I’m going to look up train timetables from Victoria Station, London. I start typing, and when I put in the first letter ‘v’ Google makes a wild guess that I’ll be looking for Verizon and grabs down results for it. So far 13.5Kb of search result data has been sucked down, an increase of just under 13Kb over the non-instant option, which just sucks down the suggested search queries, not the results themselves.

Victoria's Secret

Victoria's Secret not Victoria Station

When I type the letter ‘i’, Google realises I’m not looking for Verizon and decides I must be looking for Victoria’s Secret. That adds another 29Kb to be sucked down, which includes a couple of images. (Which are pretty tame by the way, I have safe search on at work).

Now, 29Kb is pretty small. Google have compressed the data, and since search result data is very compressible it averages about a 70% bandwidth saving, good for what is essentially pure text with some images thrown into the data.

From that point until I get all the way to ‘Victoria St’, my results stay static, since it looks increasingly likely that I’m looking for lingerie. However, there is another 10K pulled down, or about 1.4Kb per keystroke. This isn’t results, just different suggestions being cycled through the list as I type, (vicodin, victoza, victor) but Google is still showing results for what it thinks is the most likely option – Victoria’s Secret.

This behaviour is the same as it is for the existing search suggestions so we’ll discount the data for that. When I’ve got to ‘Victoria St’ Google realises its embarrassing mistake and decides that I must be searching for Victoria Stilwell the famous dog trainer. That adds another 25Kb, again with images encoded into the results.

Victoria Station

Victoria Station

When I get to ‘Victoria Sta’ the penny drops and Google gets Victoria station results, which weigh in at just 11Kb, with no images, and from then on to the end, the results don’t change, except for the cycling dance of other possible auto complete suggestions (victoria stafford, victoria station salem etc.)

In total then Google Instant added 89Kb in downloaded data over and above what a previously standard experience would have required. A tiny test of 20 other random queries from my own search history shows this to be pretty average. Obviously maps and image data which are not in the final result set add to this, but calling it 90Kb extra per search (with 6 queries in the search) seems to be in the ballpark.

This maps pretty well to Google’s own expected figures. They reckon they’ll see 5 to 7 extra search results fetched as a result of an Instant search, and presumably they know what they’re talking about. How much it is used and how accurate it will be is anyone’s guess.

Taking Google’s current round estimate of 1 billion searches per day and 6 as the midpoint of their reckoning of accuracy, and my finger-in-the-air of 15Kb for the data for each extra set of results, we get a pretty measly 85 Terabytes extra of data leaving Google’s server farms and the average UK user, who averages around 4 searches per day, getting an extra 360Kb per day down their internet connection. This is hardly a noticeable amount of data for a corporation that deals in Petabytes for its indexing of the web. Similarly, 360Kb is hardly noticeable for a user with even the slowest of broadband connections.

But is there any point? In all my use of instant so far, it’s felt like no more than a bothersome distraction. I do use Google suggest pretty often for long tail searches, and it’s easy to see if what’s being suggested describes what you want to type.

However, looking down at the search results is a further glance away, and the information takes longer to interpret. It feels unnatural to me. If I’m typing a query string, I’m typing text, so a suggestion of what I am going to type may be helpful.

On the other hand a suggestion of search results for that query isn’t what I have in my mind. It’s another step away from the thought in my brain that millisecond.

Time will tell I suppose, but if Google Instant isn’t an instant hit I’d expect to see it become opt-in rather than opt out for Google users by default pretty quickly.

How have you found using Google Instant so far – do you like it, hate it, or haven’t really noticed it at all? Share your thoughts in the comments.

Tagged with:  

Email is notoriously open to fraud. It’s built upon old protocols which tell the user very little that is definitely true about its content and source. The most basic thing about an email, the “from” address, can easily be spoofed by a sender who wants to pretend to be someone else, especially spammers.

SPF (Sender Policy Framework) is a protocol that is built on top of email. When a message arrives at the server receiving your email (Google, Yahoo, or your own company’s email server) the email claims to be from person@domain.com, but it might not really. All we know for sure is that it definitely comes from some IP address.

Using SPF, your email server can check with the domain that the email purports to be from to see if the IP address you got the email from is one that they use to send email. If domain.com says no, they don’t send emails from that IP address, then the email may be spam. If the domain says yes, then it’s probably not.

Probably; however, this isn’t certain. Not all email senders support SPF. It is voluntary but very widely used. So, just because the sender doesn’t support it doesn’t mean all emails from that domain are spam. Conversely, people send spam from perfectly respectable domain names all the time, so just because you do get a valid SPF record back that matches the “from” domain, doesn’t mean it’s definitely not spam. Still, it is a good indication, and some email servers and ISPs will mark your email as spam if the SPF record doesn’t match, or isn’t present.

So how can you avoid this fate?

First check if you already have a valid SPF record. Go to http://www.kitterman.com/spf/validate.html and enter the domain where you send email from. If your domain returns a valid SPF record, everything is fine. If not, you may find some email servers you send emails to may block them as spam.

Here’s what Gmail was showing for us:

Received-SPF: neutral (google.com: XX.XXX.XX.XXX is neither permitted nor denied by best guess record for domain of info@revahealth.com) client-ip= XX.XXX.XX.XXX;

Authentication-Results: mx.google.com; spf=neutral (google.com: XX.XXX.XX.XXX is neither permitted nor denied by best guess record for domain of info@revahealth.com) smtp.mail=info@revahealth.com

(To see this, simply send email to a Gmail account, and then select ‘See Original’ in the little menu at the top of the email message. You get to see all the headers for the email.)

What this is saying is that when they check the IP address we’re sending from, they get back neither a “confirm” nor a “deny” message. That is, there is no SPF record at all.

We used to have our SPF record for RevaHealth.com set correctly. I know, because I did it. I also could tell when it stopped working – when we moved our front end box from one server to another a few months ago. Of course, I couldn’t remember just what I’d done to set it correctly nearly three years ago.

The key to the answer is not who or what sends your email, but who owns the domain. The email receiver doesn’t check with the domain, but with the DNS (Domain Name Server). In the example above, Google GMail isn’t checking with RevaHealth.com, but with the Domain Name Server for RevaHealth.com. In our case, that’s Go Daddy.

Of course, I didn’t remember that at first. Thinking our own front end box would have the SPF record I looked in its own DNS entries and added it there. There’s a very handy SPF setup wizard here to help you to create your SPF record and save it in your DNS. However since our DNS is Go Daddy, this did me no good at all.

So, after going back and reading the very helpful SPF FAQ again, I realised that I should use our DNS to create the SPF record. And that’s when I realised what had gone wrong. When we moved our servers, we updated our DNS entry for RevaHealth.com and lost the SPF record in the process.

Editing your SPF record on a domain register depends on their interface. Thankfully for us, there is a helpful guide to creating  SPF records for domains hosted with Go Daddy.

I quickly added the SPF record to our RevaHealth.com domain entry, but this wasn’t the end of the story.  We send email from our hosting server. This looks like mail.si-svXXXX.com and that’s what an IP address lookup returns. When I entered this domain as an allowed domain to send email, I got nothing. Running the validation check failed.  However, this was because the SPF record should return only the domains it supports, not the sub-domains. Dropping the “mail.” and changing the record to just si-svXXXX.com brought our SPF records back to normal.

Now, Google reports itself happy with us again.

Received-SPF: pass (google.com: domain of support@revahealth.com designates XX.XXX.XX.XXX as permitted sender) client-ip=XX.XXX.XX.XXX;

Authentication-Results: mx.google.com; spf=pass (google.com: domain of support@revahealth.com designates XX.XXX.XX.XXX as permitted sender) smtp.mail=support@revahealth.com

Did it make a difference? Is this worth bothering about?

Yes. Several of our customers had not been receiving emails from us because their ISP was blocking anything without a valid SPF record, and these emails are now getting through okay. It probably reduces the overall spam score for emails we’re sending too, but that was always very low anyway. It’s definitely worth doing, it costs nothing, and the business cost of emails not arriving where they are supposed to can be massive.

Tagged with:  

Technical Problems With Page Caching

This article takes a more technical viewpoint on the caching issues raised here.

Warning – Not for the casual non-technical reader!

The Problem

RevaHealth.com is made up of 10′s of millions of pages, organised as ‘pretty’ URLs such as

  • /dentsist/ireland
  • /dentists/ireland/dublin
  • /dentists/ ireland/dublin/crowns
  • /dentists/ ireland/dublin/crowns/the-big-clinic

We cache each page in an asp.net data cache, and this works for frequently requested pages as they have a high cache hit rate. This works by holding the data you need to construct a page in memory. However, there is a fairly heavy code hit which results in a Time to First Byte of 1.2 to 1.5 seconds.

This wasn’t providing the user experience that we wanted and we were determined to lower it, so we added asp.net output page caching with a time to live of an hour. This holds the fully constructed page in the web server memory so it can be close to instantly returned to the user. This resulted in a Time to First Byte of 0.5 seconds.

This was great.

Or so we thought. Regular testing revealed that even frequently requested pages were rarely in our cache. Why? In fact only about one in ten pages were in the cache. This wasn’t good. Was the output page caching not working?

Why?

The answer wasn’t so simple. Firstly, RevaHealth.com is very broad and flat. Search results are divided by the type of clinic (dental, cosmetic, etc) and then further subdivided by multiple levels of location (country, county, city & neighbourhood). To make matters worse there are options for further treatment and/or specialization sub-subdivisions. A typical landing URL might well look like this:

http://www.revahealth.com/dentists/uk/west-midlands/birmingham/erdington/implants

Landing pages are almost all ‘long tail’, and the tail is very, very long. With over 50,000 locations in over 200 countries, several dozen clinic types and hundreds of procedures, we have millions of search pages and over 100,000 clinics. We knew our landing pages covered a very broad range, but only when we looked into the figures more closely did we realise just how broad and how flat the site was.

In a typical period 152,072 visitors entered the site through 36,357 pages. Only 66 of those pages had more than 100 hits and 30,000 had five or less hits. So in a typical one hour period only a few hundred pages were getting a hit in the output cache. The huge bulk of pages requested were not in the cache when requested.

Looking for Solutions

Clearly a simple remedy would be to extend the cache life beyond an hour. But this has business implications. Firstly, when our customers update their profile, they want to see that change reflected as soon as possible. Asking them to wait more than an hour would not be good.

More importantly, for a site like RevaHealth.com, search ordering and the appearance of results is critical, and search results order updates happen dynamically as patients contact clinics, review clinics and generally interact with the site. So, extending cache time to a level where we get more of the tail into the cache would be very problematic.

We decided to simulate traffic to the site, and to force the most frequently requested pages into the cache on an hourly cycle.

cURL seemed to be the obvious tool to use, as we had some experience with it and it is widely accepted. We generated a list of the top 100,000 most frequently requested URLs and created a cURL script to fetch them all.

Our experiences with cURL

cURL is a feature rich tool, but we wanted to use it in a pretty simple way – fetch all the pages on the list on an hourly cycle.

The first problem we encountered was that from the command line there is no way to limit the rate of page fetches. We knew we wanted to fetch them at a rate just above 12/second to ensure that the script would complete in an hour. But curl will only set a speed limit in kb/sec. Since our page size varies greatly, this made fixing that speed a case of trial and error. Obviously we didn’t want to fetch too fast and strain the server unnecessarily, or fetch too slow and not complete the list in an hour.

We could have used libCurl in our own server code and set a rate per second there, but we were keen not to have to write code for this, and instead use the command line tool to keep it simple.

Some relatively straightforward trail-and-error tests revealed a rate which would enable the script to finish within the cache time available (one hour).

What was frustrating during this process was that there is no way for cURL to send the actual file data fetched to nul and to save  the normal stdout output to a file or even send it to the screen. We didn’t want to save the actual output files which could get potentially very large, but sending them to nul meant normal output was sent to nul too. Equally frustrating when testing was that the normal (non verbose) output does not show the URL of the page being fetched.

The progress meter shows bytes downloaded, percentage completed, etc, but rather strangely, not the file being fetched, so there’s no easy way to tell your progress through the list of pages you are fetching.

In the end though we got past all these problems and had a script that worked – or so we thought. In fact, our first run through made no difference to the cache at all. This caused a lot of head scratching until someone looked at the fetched files and we realised that, of course, they were not compressed.

We always return compressed dynamic pages. Since the output file is gzipped, and cached as a compressed file, we were only having non-zipped pages cached.

Helpfully, cURL allows the http request headers to be set on the command line, so simply adding  –header “Accept-Encoding: gzip,deflate” fetched our zipped pages into the cache and testing in Firebug showed that they were being requested by our script.

We watched memory usage during the build up of the cache, and made some adjustments to allow larger physical memory to be used. At a certain number of pages requested we began to see large page usage, so we scaled back the number of pages being requested and all returned to normal.

Browsers Browser Browsers

We thought we were done, but one of the oddest things was yet to bite us. Like most developers, we love Firebug and we were checking everything using Firefox, but before we push changes live we do a fairly rigorous check in other browsers. Disaster. Firefox and Chrome were receiving our new cached page but Internet Explorer wasn’t.

Internet explorer was simply bypassing the cached pages and hitting the code. This was exactly what we were trying to avoid.

The problem was that we were also using GZIP to compress the HTML. It turns out that  IE passes a different parameter for the ‘accept-encoding: gzip’ than Firefox or Chrome does. Even though they all accepted exactly the same encoding the web server wouldn’t serve it up.

  • FF: Accept-Encoding: gzip,deflate
  • Chrome: Accept-Encoding: bzip2
  • IE: Accept-Encoding: : gzip, deflate  (note the space)

Essentially because the browsers were each requesting the same file using very slightly different parameters it resulted in the web server thinking they were different files.

The choice was simple, either:

  1. Cut the size of the cache to 33% and increase the length of it 3x
  2. Only support some browsers

Unfortunately the commercial reality of choices like this is – ‘Provide the greatest good to the greatest number of users’. This meant only providing cached pages for IE. As a result Firefox and Chrome users have a slightly degraded experience compared to IE users, however this degradation is largely compensated by faster JavaScript engines.

Note: IIS 7 introduces some control that solves this particular issue.

Your War Stories

We’d love to hear about your  trials and tribulations getting time to first byte down. Leave a comment below.

Use RevaHealth.com Maps On Your Website

Today we’ve made our maps of clinics in the UK and Ireland freely available for use on your own website. You can easily include a snippet of code on your pages to show a map of the dentists, doctors or other health clinics in your locality.

For instance, here is the snippet of code to show a map of general practice doctors in Brighton.

<script type="text/javascript" language="javascript">

document.write("<iframe src='http://www.revahealth.com/doctors/uk/east-sussex/brighton/externalmap' width='600' height='500' frameborder='0'></iframe>");

document.write("<span>Data provided by <a title='RevaHealth.com' href='http://www.revahealth.com'>RevaHealth.com</a><span>");

</script>

And here is how the map would appear on your page.

The map pins show the locations of the clinics. The prices shown are for a standard doctor consultation in the practice. You can pan around and zoom in and out to see more detail about the location of each practice, and click on each pin to see more practice information.

Using the maps on your own website is completely free and easy to do. You just need to add a small snippet of code to your page which pulls in the map and data from the RevaHealth.com server. You don’t need to be a programmer at all; anyone who can edit their own web page can do it easily.

The snippet can easily be changed to show any of the different kinds of clinics in the thousands of locations in Ireland and the UK which are covered in the RevaHealth database. You can contact us at the address below to see what types of clinics are available. For example, you could show Laser eye clinics in Stratford, Dental Clinics in Prestwick or GPs in Cork.

To do it yourself, just search for the URL of any set of clinics on RevaHealth.com as normal, and when you find the list you want, add /externalmap to the end and replace the URL in the example snippet above with the URL of your choice. Hey presto!

The clinic data is constantly being refreshed and updated by the team at RevaHealth.com and users can look up phone numbers or contact the clinics on-line.

The API is free to use, although we do ask you to show the source of the data beside the map on your site with a link to RevaHealth.com. The code snippet above includes the link:

Data provided by RevaHealth.com

which you can change if needs be. If you are interested in putting these maps on your site or have any further enquiries, please contact us at support@revahealth.com.

Tagged with:  

I was recently asked about our experiences in taking online payments, and in particular in taking regular subscription payments. The company we chose to handle our payments is called Realex, and we’re very happy with them. 

However, thinking back over everything that we’ve done since we started, there were plenty of things that we did that made our lives difficult after the fact, especially when it came to reconciling accounts and processing refunds. Hopefully sharing our missteps and mistakes might help save you a bit of time if you plan on taking payments yourself.

The first mistake we made was to be too worried about what would happen if our payment processor’s API took a long time to respond. We coded a lot of safety nets around this, recording all the details of the transaction in case something went wrong so we could redo it at a later time if necessary.

As it turns out, we’ve never experienced a timeout or overly long delay, so all that safety net coding was a waste of time. I wouldn’t bother with it now if I was starting again. You could just log timeouts or send an email alert and deal with them by hand should it ever happen to you.

The next thing we did that made our life difficult was to store the credit transactions in one DB table, and the calls and responses to the payment processor in another. This meant that when we came to reconcile the credit transactions without bank statements, we were missing some vital information.

Our payment processor batches payments together and we couldn’t easily work out which payment belonged to which batch without a lot of complicated work after the fact. Now we store the payment processor’s transaction and batch IDs with our own record of payment and it makes it very easy to reconcile our accounts whenever we need to. I really wish we’d done this from the start!

Another thing to bear in mind is that inevitably people make mistakes, and at some point you are going to have to process refunds of one sort or another. Doing the refunds themselves isn’t difficult. In our case Realex has simple online tools to handle them. However, you will want to design your system to handle refunds in such a way that you can reconcile your accounts afterwards. Just deleting or altering your own record of the payment will make this very difficult. You’d be surprised at how quickly you forget what the heck went on.

By way of example, if you take a payment and then have to refund it two weeks later, you could just delete the original transaction from your system and the books will balance, but you will be out of sync with your payment processor in two places – the original transaction and the refund. If each of these happened in different accounting months it can lead to real headaches. Ultimately, even if you don’t handle refunds directly through your transaction system, you will need to setup your code to handle refunds transactions that you enter by hand.

New credit card details can also cause problems. People get issued new cards for all sorts of reasons, so the details you have today are not necessarily the details you will have tomorrow. If you ignore this and just let people overwrite their current credit card details, it can make looking back at old transactions next to impossible. You should code for multiple cards, so a new card is added rather than overwriting the current card’s details.

Finally, if and when a customer complains about a transaction to you, stop and listen. As soon as they go to their bank and ask for the transaction to be reversed, you will be punished with higher fees and a permanent flag against your account. The banks have made this very easy to do in recent times, so it is in your best interest to avoid any dispute as quickly as you can. If in any doubt, refund the transaction. Even if you are proved to be right in the long run, your account will still be flagged because you haven’t done enough quickly enough to avoid the dispute.

Hopefully the advice above is of some use to you. We’d love to hear your stories of setting up to receive online transactions too, so leave a comment.

Tagged with:  
© 2010 WhatClinic.com Blog