In just ten weeks (first commit 16th September 2013, live to customers 29th November) we fundamentally changed the approach for software development and hardware deployment of the NET-A-PORTER website.
On the 9th of February 2014 a collaboration between designer Peter Pilloto and US retailer Target brought hundreds of thousands of shoppers to the NET-A-PORTER website.
However, operating at a sustained peak of fifteen times our normal order rate, the NET-A-PORTER website did not crash. Here’s what our customers had to say:
scottnothing2do: “@NETAPORTER completely pain free shopping experience shopping #PeterPilottoforTarget website stable, shopping cart full! #topmarks”
Style_biscuit: “it’s live and kicking on @NETAPORTER and very easy to navigate:-) #PeterPilottoforTarget”
fashpad: “that seemed almost too easy, got my pieces! @NETAPORTER #PeterPilottoforTarget”
For the more visual amongst you, here’s a graph from the 9th of Feb showing orders placed per hour against daily and weekly moving averages.
But NET-A-PORTER could never have managed to achieve this a year ago.
Let’s rewind more than a year to the first day of a NET-A-PORTER sale on Boxing Day 2012.
Here’s a graph of requests/min, generated by AppDynamics, from one pool of application servers during this 24-hour period.
The green represents requests serviced successfully. The red represents requests that made it as far as an app server but timed out. Not included here are failed requests at the web server and CDN (content delivery network) tiers. The fluctuation of the red was caused by manual rate-limiting intervention.
On what should have been the busiest shopping day of the year, 75% of customer requests failed.
This is truly the cold turkey of Boxing Day statistics.
By April 2013 we had addressed a few basics, improving Apache Solr cache configuration, decreasing JVM garbage collection times by preventing allocation of unnecessary small objects; things worthy of a blog article in their own right.
But this was not enough.
Traffic to the NET-A-PORTER website during a sale in May 2013 saturated networking gear in one of our data centres. The large number of filter combinations on product listing pages meant that our attempts at caching were nearly useless.
Things had to change, drastically.
Switch to an event-driven rather than thread-per-request model
We’ve moved high-traffic pages from a world of Spring MVC running on Tomcat fronted by Apache HTTPD to a world of Node.JS, Scala and nginx.
Technology we now use includes ExpressJS for external routing, Handlebars for HTML templating, Spray for internal routing, Akka for high concurrency, and Scala for aggregating (sorting, filtering) data from multiple APIs.
A request for a list of products will traverse up to three instances of nginx, the most important of which sits between our stateless data aggregation and stateful data service tiers.
By focussing on concurrency, we accidentally improved performance. We’ve seen time to first byte (TTFB, data start) reduce by at least 50% and total download time reduce by at least 20%.
Ensure cache lifetimes are relevant
Our cache lifetimes used to be based on the type of request: for example, cache a list of products for 60 minutes, cache details about a product for 10 minutes.
Our cache lifetimes are now a function of stock level. High stock means we can cache for longer, low stock means we should not cache for long at all.
This more intelligent caching also separates product data into different regions. So, “product is included in a list” will probably be shorter-lived than “details about a product in a list”.
Analytics evidence from previous sales indicated that customers used the colour filter least, and that use of the size filter increased during a sale. This made sense; as more popular sizes sold out, customers filtered lists to see only products in their size.
We enabled size filtering only when we were sure we could handle the reduced cache hit ratios this would bring.
Use on-demand, utility computing
We own and operate colocated servers in three London data centres. For 95% of the year this hardware runs cold, but during high-traffic periods there is not enough computing power to service the load.
Arriving fashionably late to the Amazon Web Services party is NET-A-PORTER.
Attempts at “monadic deployment”, using Puppet to build immutable Amazon Machine Images (AMI) proved a little too complex, so pragmatism prevailed and we opted for the far simpler use of Amazon’s Elastic Beanstalk, their platform-as-a-service (PAAS).
Upgrade from Continuous Integration to Continuous Deployment
We already used Jenkins for Continuous Integration, checking and compiling code, then running unit, integration and functional tests with every git push.
With a little bit of shell scripting magic, we got Jenkins talking directly to Amazon Web Services. With the click of a button we can push Beanstalk-deployable WAR and ZIP files straight to Amazon S3 as well as retrieve log files back from it.
We’re now working towards automated deployment with every git push.
Profile code under stress test conditions
We were already using the Gatling stress testing tool for internal APIs. Gatling’s Scala-based domain-specific language (DSL) and ability to ramp-up tests meant it was perfect for our website too.
We’re using Amazon’s stateless (non-sticky) Elastic Load Balancer, which appears to route requests based on a hash of the originating IP address. This meant our first stress tests, running from a single IP, ended up hitting the same node behind the load balancer.
The solution? Create a stress-test-platform-as-a-service! To ensure our stress tests hit all machines behind the load balancer, we built a tool to bundle Gatling tests in a WAR for deployment to Beanstalk and run on multiple machines.
Track CDN-hosted error pages
We added our website analytics tracking to 503 error pages served by our Content Delivery Network (CDN). This meant we could see, in real-time, which parts of the global network were unable to talk to our Amazon-hosted services.
It’s really all about people and communication.
Every new technology introduced was selected by the team because we either believed in that technology or were willing to learn.
Daily stand-up meetings, cross-functional pair programming, ad-hoc discussions, ruthless re-prioritisation and good old IRC provide the glue.
- Target concurrency before performance
- Application hosting is a solved problem
- Trust the delivery team to select the right technologies
- Profile code running under stress
- Move state management to the edges of your system, either at the front close to customers or at the back in your data stores.
- If your systems talk HTTP internally, use cache response headers relevant to the response body.