Tuesday, January 22, 2013

Logging for fun and profit

The problem

I work for a startup, and the typical users of our web based application are school administrators and teachers.

Sometime back, my manager presented interesting data related to user activity that he had collected with Google Analytics. From the questions our non-engineering colleagues asked during the presentation, it seemed like this was not enough for non-developers to answer business questions like:

  • Did user X in school district Y edit student Z's test information?
  • User A has raised a vague help desk ticket. What did A really do?
  • Which is the most active school district for today? For this month?

Getting the others to make database queries to the audit database didn't seem like a viable solution. Firstly needed to give an intro to SQL, and then get them familiar with the schema and then...

Another option was to extend our Google Analytics integration. Since we preferred to have more control over the data, we thought of doing it differently - implement a log analysis mechanism to allow non-developers to get quantitative answers to business questions. Some of the aims while implementing a solution were:
  1. Low cost, at least till the solution has paid for itself. As a startup, this aim is usually a necessity.
  2. Have minimal impact on the application.
  3. At a later time, be able to feed back this data into the features in the main application.

The message format

Layer 1

If a standard has already been defined as an IETF RFC, it is a good idea to use it.
I based the message format on the syslog protocol, and simplified the format prescribed in the RFC to allow simpler parsing/indexing of the message data.

SYSLOG-MSG = STRUCTURED-DATA [MSG]
STRUCTURED-DATA = "[" "TIME="TIMESTAMP "PRIORITY="PRIORITY-VALUE "APP="APP-VALUE "UUID="UUIDID [SD-NAME"="SD-VALUE]* "]"
(More in the RFC)

To keep the messages short, I constrained them to a max of 2048 characters.

Layer 2

My guess was that the data search will eventually be better if the applications formatted messages in a natural language like way. For this I chose a subject-object-verb format.

[USERDATA] [SUBJECT] [VERB] [OBJECT] [FOR] [ADDITIONAL PARAMETERS]* [MESSAGE]

So an example message could be:
[ TIME="2013-01-21 16:30:22:51" PRIORITY="INFO" APP="MyAppNAme" UUID="a2383776-9bfa-ced3-b861-1349808ddba6" CONTROLLER="StudentHistory" USER="john" USERID="1090" DISTRICT="Chester County Schools" STATE="SC" SUBJ="STUDENT" VERB="VIEW" OBJ="HISTORY" FOR="EDIT" STUDENTID="248" STUDENT="Jane Doe" ] This does not really need a message.

The protocol

I decided to use UDP based on past experience. As compared to TCP, UDP is:
  1. Non-streaming i.e. 'Fire and forget'. By design, the client (i.e. our production system) is more decoupled from the logging server when compared to TCP. If there are issues with the logging server, there is less likelihood of it hurting the web servers.
  2. Faster. Minimal overhead (no 3 way handshake). Smaller header size. No recovery options etc.
  3. Works well for small messages, which is true by design for our logging.
SNMP, the popular network management protocol uses UDP as transport.

On the other hand UDP is unreliable and lacks congestion control. However these weren't big issues for us, since we didn't care if say 1% of the logging data is lost, at least not currently. And since we planned to constrain the amount of data being logged in the application, we didn't think congestion would be an issue.

Implementation

The C# implementation (our web servers run on IIS/ASP.NET/Windows) is quite simple. All that is needed is a UDP library, string manipulation and optionally a dictionary data structure for the 'structured data'.

The parts consisted of:
A configuration section with server(s) name and port.
Layer 1: Syslog class which takes a dictionary of 'structured data', formats it into a log message and sends it over UDP to the server(s)
Layer 2: A builder class which allows log messages to be created using method chaining.

The details aren't particularly interesting, so won't describe it here. The end result is that in the consumer code, an example log call would be:


new SyslogBuilder( this )
    .Subj( "USER" )
    .Verb( "VIEW" )
    .Obj( "SWITCHDISTRICT" )
    .AddData("DISTRICTNAME", district.districtName)
    .Info( Session.SessionID );



The remaining work was to insert the logging code in the application.

The log server

It was tempting to implement a UDP listener which parses the log messages, create an indexing scheme using a non-SQL store and build a UI which then... However with 10 other pending tasks, this didn't seem like the way to go. Moreover, mine would likely be a terrible implementation for a problem more intelligent programmers have solved elegantly.
(sometimes I look back at my previous job in a big company, when I had time to do more fun stuff like that - but usually ended up browsing the web :)

So I chose Splunk, an excellent tool I had used before for log analysis. The free version has:
  1. A UDP listener
  2. A schema-less system which indexes text data as it ingests it.
  3. An elegant search language based on UNIX command pipeline.
  4. A web UI with ability to create dashboards and applications.
  5. Free for up to 500MB of log data a day. Based on some quick calculations, that was more than enough till that we can move on to a real license of Splunk (i.e. we make lots of money, get taken over or some such happy scenario).
I uninstalled software unnecessary for my purpose from an Amazon EC2 Ubuntu server, installed the free version of Splunk, set it up to listen on the UDP port of choice and configured the firewall for our webservers.

The free version of Splunk does not provide access control for its web UI, so had to setup Apache as a reverse proxy to provide for this (adding some firewall rules is probably not a bad idea too).

Starting it up

After making sure the firewall rules in our data-center was okay, it was time to send some test data. Perhaps with a Powershell script, but my laziness demanded something simpler: copy over nc.exe (netcat) from my Cygwin installation, to a temp directory on the server. Ran it but it complained about missing cygwin1.dll, copied that and we are ready to test.

nc -uvv mysplunkserver 514
Hello world

Doing a search on the Splunk server gave:
12/16/12
11:44:44.000 AM
hello world
host=mysplunkserver   Options|  sourcetype=syslog   Options|  source=mysourceipaddress.mydomain.com

Ready to turn production data on!

With data from a couple of days was able to create some drill-down reports. This was useful to present the tool to the non-developers. A few screenshots:





The pretty graphs are useful for some scenarios, but the real power is usually in the search.  Splunk makes searching easy since it indexes the data based on its inherent format.

e.g. to look for the top users in a particular district:
DISTRICT="Chester County*" | stats count by USER
which I think is a simple enough language to learn for non-developers.

Soon our customer support person had used this to solve a real life problem. Our founder, who does not have an engineering background, has thought up more uses for the tool. IMO this is one of the better things of working in a startup, your implementation's usefulness is quickly verified. Can be disheartening when it fails, but great when occasionally it is useful to others.

At some point, we hope to use the data (via Splunk's RESTful API) to feed back into certain features in our main application.

Would like to hear how you implement logging in your system!


No comments:

Post a Comment

Boston, MA, United States