Sunday, August 12, 2018

Monolith to Microservices - a debugging story


I work for an e-commerce company which manufactures customizable products*.

Over the past couple of years we have been migrating from a monolithic system towards a service oriented one. The migration work is still in process and has been mostly successful so far. Product search was a cross cutting concern which had to work across both the old monolith world and the new service based one. The monolith based search implementation obviously did not consider products introduced on the newer platform.

The opportunity

A couple of months back, I got the opportunity to move search to the new microservice based platform. We had decided to use a third party search engine vendor and the off-monolith microservices had reached a point where a search MVP was conceivable.

Couple of aims we had in mind:
1. Drop some of the bells and whistle provided by monolith search. Mainly drop some search result types and focus on product search.
3. Try to match the monolith performance in terms of page load times. This seemed hard since we were introducing extra hops including a new vendor based service which was hosted on a different data center.

The design was simple, and the microservices provided on the new platform lent themselves neatly to some of the aims:

Data feed/index:


Problems with response time

Things went well as we integrated the systems, but there was a problem. We had set a limit of less than 200ms response time for the https calls from the search results api. This was essential for us to keep within the performance budget for the service call while serving up search result pages.

The average response time was in the 500ms range. Traceroute showed the problem:

 1    <1 ms    <1 ms    <1 ms
22    13 ms    13 ms    13 ms [REMOVED]

Our data center is located in AWS Ireland, while the search engines data center was in Rackspace New Jersey. This was easy enough to fix, the search engine provider relocated to the Rackspace London data center and the response times went within the acceptable limit. The miss seemed obvious in retrospect. For some reason I had assumed this had been considered when the contract was signed - guess engineers were not very involved at that point :)

We decided to A/B test the implementation at 50-50 traffic across our top two locales. The revenue metrics in the test were slightly down and we noticed bursts of timeout errors from the search engine in our logs. Occasionally the timeouts tripped the circuit breaker policy we had in place.

The timeouts were effecting ~2-3% of the search traffic.

Time to debug

There were three possibilities that came to mind:
1. The http client implementation on our side
Did not turn up much. The same http client is used across various service client calls and there was no indication of port exhaustion or some such problem on our servers/load balancers.

2. Something in the connection or network
Going over ping and MTR data did not show much either. I emailed one of our systems engineers with the traceroute data and got no reply. My guess was he did not think there was much to go on.

3. The vendor's service runtime
This was something I did not have visibility into. How to debug and convey useful information to the (external) service provider? What I had was the http client logs with timeout and circuit breaker events.

At this point, I will switch to extracts from emails.

We don’t see a drop in the number of timed out requests. Around 12:48pm EST in the afternoon, we had our software circuit breaker trip a few times due to a series of requests timing out (the policy set in code is trip the circuit breaker for 30 seconds for 10 consecutive failures – the circuit breaker tripped 4 times).
We also realized that while we have a circuit breaker policy, we did not set a retry policy for calls to the search engine. I have this in code now and should go to prod tomorrow, will let you know if this improves things.
Another option we are considering is fronting search responses with CDN...

As we cannot locate httpd logs for the sample requests that client said were timing out, we suspect that the error happened in the first leg of the request, from client to us. And with the loss at only at 2~3%, traceroute is likely not going to show where the problem is. We'd need to run multiple traceroutes to possibly detect the problem.
Here is what we can offer in terms of improving performance:
1. This timing out might be caused by port exhaustion on the servers, would you please check netstat to ensure that this is not the issue?
2. We have noticed that there is no compression requested in the request, we would recommend sending an
Accept-Encoding: gzip header to improve performance...

The search responses were paged and hence small, gzip while it would improve things did not seem to be a reason for the timeouts. Need more data to help us all narrow down the possible cause.

The timeouts are occurring at a similar rate today as yesterday.

I had a thought about one possible cause and way to reduce it, could you consider it:
- The timeouts seem to occur is bursts, usually the burst lasts 1-2 minutes and then there are several minutes of search queries with no timeouts.
- The timeouts occurs for various locales, though the occurrence is more in en-us, that might be a reflection of the traffic.
- The timeouts occurs across all our servers, there is no correspondence to increased load or network activity. It occurs at similar rates on all days of the week.
- I have attached an Excel sheet with timeouts and periods of no timeouts from 1:00am to 4:30pm EST on 4/17.
Is there any process which runs at these time intervals on the server(s) which serve up our company queries? Perhaps a re-index or some such?


The timing data did the trick! The vendor was able to isolate the problem to a process on their side. Turned out they ran auto-deploys which were having an effect on the run time behavior of the response system.

Thanks again for the below information, we have a short term resolution in place:
We have disabled autodeploys for the time being, which are causing these full loads to happen with each deployment. This should put a stop to the timeouts in the short term while we address the source of the issue.
Will keep you updated on the permanent solution.

Nice work, looks like that was it. No timeouts since ~6:49pm EST yesterday (4/17). Please convey my thanks to the dev team too, have attached a picture from our logging system showing the stop in timeouts.

We have made the loads asynchronous and they should no longer block search. This has been deployed to your cluster and there should no longer be timeouts caused by this particular issue.

Server side time was within limits now, the brown section in the graph corresponding to the search engine response time ~100ms: well within the 200ms limit we had set:


The MVP went well.

Beyond this; we dont have to depend on the legacy search system, the search engine/vendor can be replaced easily in the face of risks and the MVP has opened up the door for other industry standard search features.

In the service world where we have limited visibility into the service provider implementations, timing data can be a valuable source of debugging information.

*Views are my own, not of my employer Cimpress.

Wednesday, February 14, 2018

Using windbg to improve page load time

Page load time

I work in the software department of a company which manufactures various types of marketing materials*.
Our e-commerce website sells these products in more than 20 countries and supports a variety of languages. The products we sell can be customized by users. The code base is made up of systems built over more than 2 decades.

Like many other e-commerce companies holiday in the US and Europe is a big time of the year in terms of traffic and sales. The effect of any production issues on the website is magnified during this time.

I was working on a relatively newer system to which several of our user facing page types were being migrated. The original system, 'the monolith' was for the most part written in-house. It has been debugged and tuned to a high degree over time. Parts of the monolith were being moved to a service oriented architecture, and the Windows/ASP.NET based CMS I was working on merged many of the new micro-services onto user facing web pages. For several months before holiday, the focus was on migration and getting new features onto the new CMS.

Due to various reasons, page load performance on the CMS became slow over time. Consider the car door magnets page which was a relatively lightweight product page with a combination of static and dynamic content:

That was not great considering the html from the CMS was actually presented in between the header and footer from the monolith. We had unofficially set the page load time of the CMS to be within 200ms server side, but over time, as new features were added, we went above this limit.

Holiday gave an opportunity for some of us to take a break from chasing after feature requests from product owners and tackle reliability and performance issues. We knew that upcoming in-house gameday load tests would push the requests from the monolith to the point where we would start timing out with our current infrastructure presenting a real risk to production traffic and revenue.

By this time we had our internal code freeze which would not allow for roto-tilling the design and code underlying the pages. So we had to look for improvements by tweaking the existing code. Fun!

The first problem was that we had stopped running regular load tests. My colleague Joe updated an old Locust based load test which we could run locally and in our lower environments. The performance issue was obvious: we were constrained to ~10 requests per second across all pages before CPU started reaching its upper limits on individual VMs.


Looking in our NewRelic, Sumologic and application logs in the load test environment did not give us clues to possible bottlenecks. Joe generated a few dump files of the IIS worker process and sent it to me for analysis.

Using windbg to look for clues. Often times half the work in windbg analysis is getting dlls loaded correctly. Have annotated my setup with comments:

.echo "Prep work"
.echo "From image where load test was run get -> sos, mscordacwks and clr dlls"
.echo "Only sos should be needed most times"
.echo "Using sos.dll from localhost may not get correct results"
.echo "Ensure we have dlls/pdbs compiled for relevant release/compiler flags"
.echo "End Prep work"

.echo "set noisy symbol loading"
!sym noisy
.echo "set symbol path to: microsoft symbol server, 
.echo "and the path to our dlls"
.sympath+ srv*e:\symbols*;E:\windbg\dev_10_4\bin
.echo "load sos.dll"
.load E:\windbg\dev_10_4\sos.dll
.echo "Load sosex and psscor4 extensions"
.load E:\work\windbg\Psscor4\amd64\amd64\psscor4
.load E:\work\windbg\sosex\sosex.dll
Using the runaway command and looking at thread stacks did not reveal anything obvious.

We did notice several GC related threads on the stacks. We had not noticed any obvious memory leaks, but then we release to prod every day making slow leaks unnoticeable.

Joe set up .NET CLR memory perf counters and noticed that the Gen 2/1/0 collections were not in the ratio of 100/10/1 mentioned here. Something to investigate.

Going back to the dump files:
Sosex's dumpgen on the large object heap did reveal something interesting, there were several string objects representing htmls. A sample:

Looking into the code, this led to a method:

where value was being passed in as string.

So a simple but oft-repeated bug, operating on strings instead of streams, which shows as a problem when the strings grow larger and are used in different ways.

Streams over strings

The solution already existed a few lines below but the method had a bug, which was probably why consumers of the library were not using the method.

Fixed this code based on advice from my co-worker Chris:

Load test showed noticeable improvement. We could go from ~10rps to ~32rps:

After production release the next day, we looked at the car door magnets page:

A nice boost! But still a little off from our 200ms goal and holiday was approaching.
More in the next blog post.

*Views are my own, not of my employer Cimpress.
Boston, MA, United States