Operations Mise en Place, part I — Logs

Published in

EatCodePlay

7 min readJun 9, 2017

https://www.flickr.com/photos/wickenden/3629186048/in/album-72157619199607142/

When you first start writing a new web application, it’s often very easy to get caught up in building features. While shipping features is important, once your software is live your customers will encounter problems and you’ll need to investigate. You’ll also need to monitor if your site is working correctly, and be notified of problems before your customers contact you (or worse, stop using your product altogether).

When managing a web app, it’s critical to have visibility into what it’s doing, how much it’s doing it (whatever it is) and how it is performing. You also need the ability to quickly and easily debug issues that inevitably arise. Otherwise your software is effectively a ‘black box’, despite the fact that you wrote it!

As a full stack developer turned ops turned devops, I’ve been on both sides of the wall when dealing with large web apps. Having insights into how and why a website is behaving or misbehaving is crucial for both developers and operations to succeed. Over the past few years I’ve put together of list of must-haves for any SaaS that improve visibility that I have nicknamed ‘ops mise en place’.

They are, in order of most to least importance:

Logs
Metrics
Alerting
Dashboards
Application Performance Management (APM)

One you have all of these systems in place, you will spot problems sooner, fix issues faster and have a much better high-level view of how your software is operating.

This post is the first post in a series that will cover each item in succession. The first topic is the foundation of all software debugging and introspecting: logs.

Logs

https://www.flickr.com/photos/michalewicz/9422999743/in/album-72157634907748704/

Unless you are the only user using your application, you don’t know how it’s being used and by whom. Information such as what data is being submitted, what types of errors are occurring and who exactly did what is critical to debugging and understanding how your system behaves and is used. How do you do this? Logs!

Logging 101

The most basic form of logging is the simple print() (or put() or p or console.log()) in your code. Most web frameworks and tools have built-in logging libraries and often log basic information such as request data (paths, query strings, UAs) and response metadata such as timing information and status codes.

When you log in your code, it usually ends up in a log file somewhere on the server’s local filesystem. Often times it is a file in your application’s directory or a system level log directory such as /var/log. Get familiar with where that file is, you’ll be opening it often!

Request info logging is a good start, but there’s a lot more that you should log.

What should you log? Things like:

HTTP requests — also be sure to log extra metadata like user ids, query strings, form data (remove sensitive information like email addresses and passwords), response codes and request ids. This makes debugging a LOT easier when a customer contacts you about something that went wrong.
Complex or critical actions performed by the system. Log start and end events, how much data or work was done, etc.
Slow SQL queries (again, sanitized if necessary)
Errors/exceptions
Periodic tasks start/stopping and the amount of work they did
Anything else that may be useful or relevant

Examples for Rails:

Rails.logger.info(“Adding 3 photos to photo gallery for user id: #{user.id}”)

Rails.logger.warning(“Retried #{tries} to save photo: #{photo.id}, succeeded”)

Rails.logger.error(“Could not save photo, got error code #{error.code} and message #{error.message} for user: #{user.id}”)

In general, the more logging the better. It’s a common habit to only log errors, but often times it’s even more useful to have logs of successful tasks than just unsuccessful ones. For example, if your SaaS unexpectedly is slow, but still responding, it would be useful to see in the logs that a specific user added 10k images to their photo gallery.

Logging 102: Centralization

https://www.flickr.com/photos/jeremybrooks/32159963204/in/photolist-QZSdCd-6vtnHZ

Now that you’ve added logs to your app, you need a way to collect, aggregate and search them. Otherwise you won’t be much better off if every time you want to fix a problem you have to ssh into your servers, grep files and generally be confounded.

It’s incredibly difficult to debug systems when your logs are located on different systems that may all have different formats. Sometimes you’ll have a database server, web server, background job server and a few other miscellaneous servers. You can’t easily view logs on all of them and correlate without centralization. Use a tool like Splunk, Loggly or an Elasticsearch-Logstash-Kibana setup that takes logs from various sources, aggregates and normalizes them. This makes searching and viewing logs simple and easy compared to having four different ssh sessions open and trying to get them all to line up (literally and temporally).

Setting up a log aggregator is generally easy. Some have agents that run on your server(s) or you can point your logging tools directly at them, making setup even easier. Spend some time testing a few tools to figure out what works for you.

Once you have it all setup, spend some time doing practice searches, setup automated queries and even some alerts. The better you are at using your log search tool, the faster you will fix and understand problems in your app.

Logging 103: Structured Logging

https://www.flickr.com/photos/vintagechica/3092671406/in/album-72157610833914601/

If you’ve gotten this far, you have journeyed far on the path to logging enlightenment. The next step in your journey is structured logging.

When you first start logging, usually your logs are ad-hoc and often look something like this:

Rails.logger.error(“Could not save photo, got error code #{code}”)

Rails.logger.info(“Adding 3 items to todo list”)

While that is easy to read and understand, any attempts to parse and make a machine understand it is tedious and unique per logging call. You could try to use a regex, but creating a new regex every time you start logging something new is a Very Bad Idea. If you can structure your logs appropriately, it opens up many useful tools and tricks when searching and monitoring them.

Structured logging is a combination of adding metadata to each log line and structuring your logging statements to be machine readable. The most common way to do this is to log in a structured format, like JSON or simple key=value pairs.

For example, by default Rails logs look like this:

Started GET “/” for 127.0.0.1 at 2017–05–20 02:33:23 +0800 Processing by SiteController#index as HTML Rendered text template within layouts/application (1.0ms) Rendered layouts/_header.html.erb (5.0ms) Rendered layouts/_footer.html.erb (7.6ms) Completed 200 OK in 25ms (Views: 22.8ms | ActiveRecord: 0.0ms)

Attempting to parse this with a regex or code would be an exercise in futility. Instead you can use the gem LogRage to generate something like this:

method=GET path=/ format=json controller=SiteController action=index status=200 duration=25 view=22.8 db=0

It is much more terse, but more importantly it is structured, meaning logging tools can easily extract information out of the log lines. Many log tools also allow for logging as JSON, which gives you nested structures and no need to write custom code to extract values. Some tools will automatically parse any key=value string in your log lines and show statistics like how often it saw specific values. Or you can do more advanced searches by specifying the key and value in your queries.

Here’s an example of Loggly and how it parses JSON log lines automatically:

Instead of multiple lines of hard to read text, we now have structured data that Loggly understands. We can search by fields, use ranges and more powerful search terms because Loggly understands our logs!

One more thing…

One more thing you can do to achieve enlightened logging is to have top-level application code attach metadata to each HTTP request’s (or background job’s) log output. Information like user id, job id, request ids and the like are very helpful for debugging issues. You can see an example of this in the screenshot above where the user id is included in the request information.

Review

Phew! While it is a lot of work, adding logging to your app, funneling them to a centralized source and structuring them well will have a huge ROI. All of this work improves your visibility into your app, reduces time spent troubleshooting issues and gives you additional information about how your app is being used and by whom.

Stay tuned for part II: metrics!