Sunday, August 12, 2018

Monolith to Microservices - a debugging story


I work for an e-commerce company which manufactures customizable products*.

Over the past couple of years we have been migrating from a monolithic system towards a service oriented one. The migration work is still in process and has been mostly successful so far. Product search was a cross cutting concern which had to work across both the old monolith world and the new service based one. The monolith based search implementation obviously did not consider products introduced on the newer platform.

The opportunity

A couple of months back, I got the opportunity to move search to the new microservice based platform. We had decided to use a third party search engine vendor and the off-monolith microservices had reached a point where a search MVP was conceivable.

Couple of aims we had in mind:
1. Drop some of the bells and whistle provided by monolith search. Mainly drop some search result types and focus on product search.
3. Try to match the monolith performance in terms of page load times. This seemed hard since we were introducing extra hops including a new vendor based service which was hosted on a different data center.

The design was simple, and the microservices provided on the new platform lent themselves neatly to some of the aims:

Data feed/index:


Problems with response time

Things went well as we integrated the systems, but there was a problem. We had set a limit of less than 200ms response time for the https calls from the search results api. This was essential for us to keep within the performance budget for the service call while serving up search result pages.

The average response time was in the 500ms range. Traceroute showed the problem:

 1    <1 ms    <1 ms    <1 ms
22    13 ms    13 ms    13 ms [REMOVED]

Our data center is located in AWS Ireland, while the search engines data center was in Rackspace New Jersey. This was easy enough to fix, the search engine provider relocated to the Rackspace London data center and the response times went within the acceptable limit. The miss seemed obvious in retrospect. For some reason I had assumed this had been considered when the contract was signed - guess engineers were not very involved at that point :)

We decided to A/B test the implementation at 50-50 traffic across our top two locales. The revenue metrics in the test were slightly down and we noticed bursts of timeout errors from the search engine in our logs. Occasionally the timeouts tripped the circuit breaker policy we had in place.

The timeouts were effecting ~2-3% of the search traffic.

Time to debug

There were three possibilities that came to mind:
1. The http client implementation on our side
Did not turn up much. The same http client is used across various service client calls and there was no indication of port exhaustion or some such problem on our servers/load balancers.

2. Something in the connection or network
Going over ping and MTR data did not show much either. I emailed one of our systems engineers with the traceroute data and got no reply. My guess was he did not think there was much to go on.

3. The vendor's service runtime
This was something I did not have visibility into. How to debug and convey useful information to the (external) service provider? What I had was the http client logs with timeout and circuit breaker events.

At this point, I will switch to extracts from emails.

We don’t see a drop in the number of timed out requests. Around 12:48pm EST in the afternoon, we had our software circuit breaker trip a few times due to a series of requests timing out (the policy set in code is trip the circuit breaker for 30 seconds for 10 consecutive failures – the circuit breaker tripped 4 times).
We also realized that while we have a circuit breaker policy, we did not set a retry policy for calls to the search engine. I have this in code now and should go to prod tomorrow, will let you know if this improves things.
Another option we are considering is fronting search responses with CDN...

As we cannot locate httpd logs for the sample requests that client said were timing out, we suspect that the error happened in the first leg of the request, from client to us. And with the loss at only at 2~3%, traceroute is likely not going to show where the problem is. We'd need to run multiple traceroutes to possibly detect the problem.
Here is what we can offer in terms of improving performance:
1. This timing out might be caused by port exhaustion on the servers, would you please check netstat to ensure that this is not the issue?
2. We have noticed that there is no compression requested in the request, we would recommend sending an
Accept-Encoding: gzip header to improve performance...

The search responses were paged and hence small, gzip while it would improve things did not seem to be a reason for the timeouts. Need more data to help us all narrow down the possible cause.

The timeouts are occurring at a similar rate today as yesterday.

I had a thought about one possible cause and way to reduce it, could you consider it:
- The timeouts seem to occur is bursts, usually the burst lasts 1-2 minutes and then there are several minutes of search queries with no timeouts.
- The timeouts occurs for various locales, though the occurrence is more in en-us, that might be a reflection of the traffic.
- The timeouts occurs across all our servers, there is no correspondence to increased load or network activity. It occurs at similar rates on all days of the week.
- I have attached an Excel sheet with timeouts and periods of no timeouts from 1:00am to 4:30pm EST on 4/17.
Is there any process which runs at these time intervals on the server(s) which serve up our company queries? Perhaps a re-index or some such?


The timing data did the trick! The vendor was able to isolate the problem to a process on their side. Turned out they ran auto-deploys which were having an effect on the run time behavior of the response system.

Thanks again for the below information, we have a short term resolution in place:
We have disabled autodeploys for the time being, which are causing these full loads to happen with each deployment. This should put a stop to the timeouts in the short term while we address the source of the issue.
Will keep you updated on the permanent solution.

Nice work, looks like that was it. No timeouts since ~6:49pm EST yesterday (4/17). Please convey my thanks to the dev team too, have attached a picture from our logging system showing the stop in timeouts.

We have made the loads asynchronous and they should no longer block search. This has been deployed to your cluster and there should no longer be timeouts caused by this particular issue.

Server side time was within limits now, the brown section in the graph corresponding to the search engine response time ~100ms: well within the 200ms limit we had set:


The MVP went well.

Beyond this; we dont have to depend on the legacy search system, the search engine/vendor can be replaced easily in the face of risks and the MVP has opened up the door for other industry standard search features.

In the service world where we have limited visibility into the service provider implementations, timing data can be a valuable source of debugging information.

*Views are my own, not of my employer Cimpress.
Boston, MA, United States