A Mild Log Situation

2018, Sep 11    

Prelude

At SAJ ADV we used to have a “log situation”. No two logs in the application followed the same pattern, parameters or format, so it was incredibly hard to filter and search through them when debugging. In one log we might have something like

[LOG_MESSAGE] [USER_INFO]

and then in another:

[LOG_TAG] [USER_INFO] [LOG_MESSAGE]

But, most importantly, way too many log messages weren’t abiding by the logging levels from any library whatsoever. We would have thousands of debug messages being outputted with the INFO level and warning messages as ERRORS that created false alerts to the team.

Given this chaotic scenario, I took upon me to bring some order to our code base and set the logs straight. I spent a whole day reading blog posts and documentations about tips for logging in Java, 10 best practices for writing logs and the like. In the end of the day, I wrote a code convention on how to write standardized logs in our code base:

Log levels

Looking at our code base, I found at least 3 different logging libraries (Java’s standard, Apache’s Log4j, and another one I can’t recall). After the whole day reading about it, I opted for Apache Log4j, mainly because it contains all the log levels I wanted to standardize:

  • FATAL: we use this level for when things have really gone catastrophically bad, like THE SYSTEM IS DOWN!. Somebody MUST look at what’s going on. Something like:
    Exception in thread "main" java.lang.OutOfMemoryError
    

    which implies that the instance is not responding to requests anymore.

  • ERROR (or SEVERE): this one should be used when the system is still running, but not quite how it should. Somebody MUST look at it as well.
    org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
    

    The instance IS responding to requests, but most likely the wrong way.0

  • WARN (or WARNING): the core system is fully functional, but not at its prime performance, or maybe some small functionalities are down. The error below portraits a failed request to the full-text search engine (GoldenGate, a Solr-based engine). It’s considered to be a warning because the functionality possess a fallback, and it queries the relational database when GoldenGate is down.
    Couldn't search with GoldenGate: javax.ejb.EJBTransactionRolledbackException: The search server's response indicates a bad request (status 400)
    
  • INFO: useful information for admin users or maybe people from other teams of the company (customer service, sales and marketing, etc).
    [Ical importing] execution time (seconds): '0', user info: [...], execution parameters: [...]].
    
  • DEBUG: useful information for developers at development time, like the steps a function call makes. Below, three distinct logs:
    Initiating Search Call.
    Initiating Count Call.
    Finished Search Call in 24ms.
    

Message standards

After clearly defining the log levels, and when they should be used, the next step was to standardize the messages that should be used when creating logs, so that no two developers would code different log messages. I created three standards for three separate situations that we have:

Logs for authenticated user

Used when an authenticated user initiated the request responsible for the log.

LOG_LEVEL SERVICE_NAME[cdTenant: 'TENANT_CODE', cdUser: 'USER_CODE', typeUser: 'USER_TYPE'] [CLASS] TAG MESSAGE

By having just this data on the user, any developer is able to further investigate an issue.

Logs for unauthenticated user

Similar to the previous standard, but instead of having user data, we try to log the requester’s IP and network data.

LOG_LEVEL SERVICE_NAME[ip: 'IP'] [CLASS] TAG MESSAGE

Logs for system jobs

And finally, a standard for jobs that are run by the system (like health checks) that aren’t user-centric.

LOG_LEVEL SERVICE_NAME[job: 'JOB_NAME', idExecution: 'ID'] [CLASS] TAG MESSAGE

Default library

As previously stated, I opted for the Apache Log4j library because of the log levels it possesses, and also because it works with our existing request filter: the filter adds most of the log message, e.g. LOG_LEVEL SERVICE_NAME[cdTenant: 'TENANT_CODE', cdUser: 'USER_CODE', typeUser: 'USER_TYPE'] [CLASS], so that the developer’s only concern when logging becomes the message itself.

Best practices

  • Never log sensitive data from your users, like login/email, passwords of any kind, addresses, name, etc. For completeness, refer to OWASP’s Logging Cheat Sheet
  • Your log methods must NOT cause side-effects, i.e., the system should behave the same with or without logs
  • Before invoking a logging method, you should test if the wanted log level is actually on. For example, you are trying to log an object, at the DEBUG level:
LOGGER.debug("Hey, this is the object: " + myObject);

Even before the variable myObject reaches the actual logging implementation, it’ll be evaluated to a String and concatenated to "Hey, this is the object: ". Then, once the actual debug() method is called, it might check and see that NO, the current application is not logging DEBUG. So your application just spent precious moments evaluating myObject’s toString() for nothing!

A better approach would be:

if (LOGGER.isDebugEnabled()) {
    LOGGER.debug("Hey, this is the object: " + myObject);
}

Further research

Two topics I want to look into, but didn’t get the chance to do so yet, are:

  • AspectJ - aspect-oriented programming for Java. This way the logs could be done by aspects that are included in methods and classes.
  • Asynchronous Log - so that logs don’t block user operations and requests.

I think these two could play a major role at improving log quality in the code base.

Final considerations

Overall, I feel that the trust the SAJ ADV’s team has on logs has improved, and it’s proving to be much easier to use them for debugging. My one-plus day researching and documenting logs was visibly worth it!