Page load timeI work in the software department of a company which manufactures various types of marketing materials.
Our e-commerce website sells these products in more than 20 countries and supports a variety of languages. The products we sell can be customized by users. The code base is made up of systems built over more than 2 decades.
Like many other e-commerce companies holiday in the US and Europe is the biggest time of the year in terms of traffic and sales. The effect of production issues on the website is magnified during this time. Holiday is a time of code and data freezes, risk mitigation being the much repeated buzzword.
I was working on a relatively newer system to which several of our user facing page types were being migrated. The original system, 'the monolith', the bane of new developers, was for the most part written in-house. It has been debugged and tuned to a high degree over time. Parts of the monolith were being moved to a service oriented architecture, and the Windows/ASP.NET based CMS I was working on merged many of the new micro-services onto user facing web pages. For several months before holiday, the focus was on migration and getting new features onto the new CMS.
Due to various reasons, page load performance on the CMS was left to degrade over time. Consider the car door magnets page which was a relatively lightweight product page with a combination of static and dynamic content:
That was terrible considering the html from the CMS was actually presented in between the header and footer from the monolith, and this a situation we should not have brought upon ourselves. We had unofficially set the page load time of the CMS to be within 200ms server side, but over time, as new features were added, there was no real push to stay within this limit.
Holiday gave an opportunity for some of us to take a break from chasing after feature requests from product owners and tackle reliability and performance issues. We knew that upcoming in-house gameday load tests would push the requests from the monolith to the point where we would start timing out with our current infrastructure presenting a real risk to production traffic and revenue.
By this time we had our internal code freeze which would not allow for roto-tilling the design and code underlying the pages. So we had to look for improvements by tweaking the existing code. Fun!
The first problem was that we had stopped running regular load tests. My colleague Joe updated an old Locust based load test which we could run locally and in our lower environments. The performance issue was obvious: we were constrained to ~10 requests per second across all pages before CPU started reaching its upper limits on individual VMs.
InvestigationLooking in our NewRelic, Sumologic and application logs in the load test environment did not give us clues to possible bottlenecks. Joe generated a few dump files of the IIS worker process and sent it to me for analysis.
Using windbg to look for clues. Often times half the work in windbg analysis is getting dlls loaded correctly. Have annotated my setup with comments:
Using the runaway command and looking at thread stacks did not reveal anything obvious.
.echo "Prep work" .echo "From image where load test was run get -> sos, mscordacwks and clr dlls" .echo "Only sos should be needed most times" .echo "Using sos.dll from localhost may not get correct results" .echo "Ensure we have dlls/pdbs compiled for relevant release/compiler flags" .echo "End Prep work" .echo "set noisy symbol loading" !sym noisy .echo "set symbol path to: microsoft symbol server, .echo "and the path to our dlls" .sympath+ srv*e:\symbols*http://msdl.microsoft.com/downloads/symbols;E:\windbg\dev_10_4\bin .echo "load sos.dll" .load E:\windbg\dev_10_4\sos.dll .echo "Load sosex and psscor4 extensions" .load E:\work\windbg\Psscor4\amd64\amd64\psscor4 .load E:\work\windbg\sosex\sosex.dll
We did notice several GC related threads on the stacks. We had not noticed any obvious memory leaks, but then we release to prod every day making slow leaks unnoticeable.
Joe set up .NET CLR memory perf counters and noticed that the Gen 2/1/0 collections were not in the ratio of 100/10/1 mentioned here. Something to investigate.
Going back to the dump files:
Sosex's dumpgen on the large object heap did reveal something interesting, there were several string objects representing htmls. A sample:
Looking into the code, this led to a method:
So a simple but oft-repeated bug, operating on strings instead of streams, which shows as a problem when the strings grow larger and are used in different ways.
Streams over stringsThe solution already existed a few lines below but the method had a bug, which was probably why consumers of the library were not using the method.
Fixed this code based on advice from my co-worker Chris:
Load test showed noticeable improvement. We could go from ~10rps to ~32rps:
After production release the next day, we looked at the car door magnets page:
A nice boost! But still a little off from our 200ms goal and holiday was approaching.
More in the next blog post.