Monitoring is very essential for modern applications, modern applications are highly distributed in nature and have different dependencies like database, service, caching and many more. It’s more of a like service mesh, tracing and monitoring these services are very essential to adhere to SLA (Service Level Agreement). SLA is an agreement between client and server, It accounts for reliability, responsiveness and other service-level metrics. We always tend not to violate any SLAs, violating any part of the SLA can have many consequences. If a service fails to meet the terms defined in an SLA, it risks brand reputation damage and revenue losses. Worst of all, a company may lose a customer to a competitor due to its inability to meet a customer’s service-level requirements.
What kind of metrics should be monitored?
- Service availability: the amount of time the service is available for use. This can be measured in terms of response time, for example, X percentile abbreviated as pX e.g p95, p99, p99.999. Not all services would require p99.999, systems that are guaranteed to be highly available like E-Commerce, Search, Payment, etc, should have higher SLA.
- Defect rates: Despite good efforts in system development, no system is 100% perfect. We should counts or percentages of errors in major flows. Production failures such as server error, database query error, connection error, network errors and missed deadlines can be included in this category.
- Security: In these hyper-regulated times, application and network security breaches can be costly. Measuring controllable security measures like system access, unauthorized access to a database, bulk downloading of database records or large data movement can be included in this category.
- Business results: Increasingly, IT customers would like to incorporate business process metrics into their SLAs so that a better business decision can be made. In this category, different data can be collected like the number of users visited the page, login activity, coupon effectiveness, etc.
Choosing a set of metrics could be non-trivial in the beginning, also we can be in the dilemma of how much we should monitor. We can start with a bare minimum and add as many as we need later.
A typical monitoring setup system would have three components
- Metrics store (generally a time series database) like InfluxDB, TimescaleDB, Prometheus, etc
- Dashboard (dashboard would be used to visualize the data stored in the metrics store)
- Applications that would keep pushing metrics to the metrics store or metrics store periodically pull data from the application’s local state.
We can have other components as well for example alerting, where the alert channels could be Email, Slack or any others. Alerting component would be sending alerts to the application owners or subscribers of events. We’re going to use Grafana as a dashboard and alerting system, Prometheus as a metrics store system.
Things we need 1. Any IDE 2. Java platform 3. Gradle
Create a project from spring boot initializer, add dependencies as many as we need. We’re going to use the Micrometer library, it is an instrumentation facade that provides bindings for many metric stores like Prometheus, Datadog, and New Relic just name a few. Out of the box, Micrometer provides 1. HTTP request 2. JVM 3. Database 3. Cache system etc related metrics. Some of the metrics are enabled by default, whereas others can be enabled, disabled or customized. We’ll use the application.properties file to handle enabling, disabling and customization. We need to also use Spring boot actuator as this is going to expose the prometheus endpoint.
Add these dependencies in build.gradle file
We can enable Prometheus export by adding the following line to the properties file.
Once this line is added Micrometer will start accumulating data about the application, and this data can be viewed by visiting actuator/prometheus endpoint, this endpoint would be used in the Prometheus script to fetch the data from our application servers. Even though we have added this line in properties, we can’t browse Prometheus endpoint, since this is disabled by default, we can expose that using management endpoint, include prometheus in the list.
NOTE: Do not enable all endpoints from actuator as it can lead to a security loophole. We should choose them selectively, especially in the production system, even if we want then do not expose the endpoint to the whole world as it can expose a whole lot of data about the application, use some proxy or some rule to hide from the outside world.
The different parts of HTTP requests are customizable like SLA, percentile histogram should be computed or not, this is done using metrics.distribution properties.
A sample application.properties can have these lines
Now if we run the application and browse to the page, http://localhost:8080/actuator/prometheus , this will display hell lot of data.
The above data displays HTTP request detail, exception=None means no exception occurred if any then we can use that to filter how many requests have failed due to that exception, method=GET HTTP method name. status=200 HTTP status code is 200, uri=/actuator/prometheus displays the URL path, le=xyz displays the processing time, N.0 displays the number of times that end-point was called.
This data is a histogram that can be plotted in Grafana, for example to plot p95 over 5 minutes we can use the following query.
histogram_quantile(0.95,sum(rate(http_server_requests_seconds_bucke[5m])) by (le))
We can plot other metrics as well in Grafana like Pie-Chart etc.
Many times we need custom metrics, some of the use cases are the number of logged-in users, currently available stock details, number of orders in the order queue, etc. Some of the business use cases can be solved using custom metrics, micrometer supports different types of metrics, we’ll mainly focus on Gauge and Counter. Gauge gives us instantaneous data like the length of a queue, whereas the counter is like a monotonically increasing number starts from 1.
For this, we’re going to create a demo stock manager that will store details in memory and would provide two functionality 1. add items 2. get items
In this, we’ve created one counter and one gauge in init method, whenever getItem is called then we increase the counter as well as we measure the stock size whereas when addItems is called then we only update the gauge.
For demonstration purposes, we’ll add two endpoints to add items and get items.
Let’s fist add 10 items using two API calls
curl -X POST http://localhost:8080/stocks?items=1,2,3,4
curl -X POST http://localhost:8080/stocks?items=5,6,7,8,9,10
Now if we browse to Prometheus endpoints then we can see the following data, which indicates currently we have 10 items in the stock.
# HELP stock_size Number of items in stocks
# TYPE stock_size gauge
Now we’re going to place an order of size 3
Again, if we browse Prometheus endpoint then we get the following data that indicates stock size has been changed to 7
# HELP stock_size Number of items in stocks
# TYPE stock_size gauge
Also, we can see the counter has been added with value 1, this indicates one order has been placed.
# HELP order_created_total number of orders created
# TYPE order_created_total counter
In software engineering, profiling (“program profiling”, “software profiling”) is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization. Profiling is achieved by instrumenting either the program source code or its binary executable form using a tool called a profiler (or code profiler). Profilers may use a number of different techniques, such as event-based, statistical, instrumented, and simulation methods.
Profiling is very helpfull in diagnostic system problems like how much time an HTTP call takes, even if it takes N seconds, then where all this time has been spent, what’s the distribution of N seconds among different database queries, downstream service calls, etc. We can use the histogram to plot the distribution in the dashboard, also we can use a counter to measure the DB queries count, etc. For profiling, we need to inject some code in many functions that would be executed as part of the method execution. The interesting part about the profiling code is that they are the same for similar types of profiler which means we need to copy-paste similar code at thousands of places, if we have to change anything then we need to update the same. Profiler code in every file and likely in every function that requires profiling would increase the complexity and can become a complete mess; though we can avoid this mess using Aspect-Oriented Programming(AOP).
In short, AOP works on proxy design pattern, though it can be implemented using byte code modification as well.
Whenever a method is called then our expectation is that the callee method would be directly called without any intermediate steps, but when AOP is put in place then method call is intercepted by a proxy method and proxy method calls the target method, the proxy method returns the result to the caller, as depicted in below figure.
The system depends on different other systems, so we might be interested in profiling different components differently, for example, database calls, HTTP requests, downstream service calls or some specific methods that are critical or would like to see what’s going in some specific methods. We can use the same Micrometer library for profiling as well but that may not be exactly what we want so we’ll change code.
Micrometer comes with a Timed annotation, this annotation can be placed on any public method, as the name suggests this can measure execution time, this is going to measure the execution time of the corresponding method. Instead of directly using this annotation, we’ll extend this annotation to support other features like logging, retry, etc. Timed annotation is useless without TimedAspect bean, as we’re redefining the Timed annotation so we’ll also define TimedAspect class as per need like logging, mass profiling (profile all methods in a package without adding any annotation on any method or class), retry, etc. In this story, we’ll see three of the use cases 1. Mass profiling 2. Logging 3. Profile Specific method
Create a java file MonitoringTimed.java, in this we’ve added a new field called loggingEnabled, this field would be used to check whether logging is enabled or not if it’s enabled then log the method arguments and return values.
This annotation is not useful without the timed aspect class, so a new class MonitoringTimedAspect will be defined with all required details, this class would have a method to profile any method based on the processing joint object and MonitoringTimed object and another one to profile method based on the MonitoringTimed annotation.
The method timedMethod having Around annotation is used to filter all the method calls having annotated with MonitoringTimed
For mass profiling, we’ll define a profiler class that will work on the package level filtering, For example, for HTTP request profiling we can add ControllerProfiler, which would handle the profiling of all public methods available in the controller package.
Interesting line in the above code is @Pointcut(“execution(* com.gitbub.sonus21.monitoring.controller..*.*(..))”), this defines a pointcut, a pointcut expression can be defined using boolean operators like not(!), or (||), and (&&). Once a method is qualified as per the pointcut expression then it can call the corresponding method defined using [at]Around annotation. As we have defined a profile method that would get called, we can also define other methods using [at]After, [at]Before, etc annotations.
If we run this application with the above code and browse the promethus endpoint then following data can be seen
We can directly use MonitoringTimed annotation as well on any method to measure execution time, for example, let’s measure how much time StockManager’s method addItems takes in adding items.
Once we start the application and add few items then we can see following at the prometheus endpoint
# HELP method_timed_seconds
# TYPE method_timed_seconds summary
MonitoringTimed can be further customized, like the number of retries can be added to support retry in the case of failure, log function arguments in case of failure so that it can be analyzed later why it’s failed.
Complete code is available at https://github.com/sonus21/monitoring
If you found this post helpful please share across and give a thumbs up.