Monitoring Serverless Apps Using Prometheus

Sonu Kumar
6 min readJan 8, 2020
Photo by Rohit Tandon on Unsplash

Prometheus is an open-source system monitoring and alerting toolkit. Data related to monitoring is stored in RAM and LevelDB, nevertheless, data can be stored in other storage systems such as Elasticsearch, InfluxDB, and others. We can configure Prometheus to notify interested entities based on the severity of failure or SLA breach.

Prometheus Deployment

A typical deployment of Prometheus consists of 3 components.

  1. App server(s)
  2. Prometheus server
  3. Prometheus data stores like Elasticsearch, InfluxDB, etc.

Prometheus is a single-node server, which means it could be a single point of failure(SPOF), to mitigate SPOF we can use Thanos. Prometheus is different from other monitoring systems like DataDog, and Graphite in the sense it’s a pull-based monitoring system whereas other systems are push-based monitoring systems. In Prometheus, the application user must configure scrape target(s), scrape target is a web resource endpoint that provides the metrics data related to that service.

Prometheus Setup

Scrape target is nothing but an HTTP request handling endpoint, that provides metrics related to that server. The scrape endpoint must return a response in the format (text) understood by Prometheus. Generally, the target is an internal route provided by an application that’s only available to Prometheus. Each app server has to respond with the current metrics data, which means an app server must accumulate all metric data between two consecutive polls. Scrape targets are called very frequently like at intervals of 10 milliseconds, 1 second or so, which means we need to store data only for the specified interval, nevertheless this interval can be configured using scrape_interval. An application can store data in the application memory instead of using some unified storage layer, storing data in the application memory has it's own pros and cons

Pros

  • Do not require integration with other storage systems.
  • No communication overhead.
  • Easier to build a storage layer as many libraries exist.

Cons

  • On a restart of the application, data is lost.
  • The memory footprint increases with the number of data points.

Now let’s turn our attention toward HTTP request handling.

HTTP Request Handling

HTTP request handling

We can deploy our servers in many ways, like behind a firewall, active-passive, multiple availability zones, etc. The simplest deployment of our servers could be, to deploy multiple servers behind a load balancer and the load balancer would be connected to the outer world via the internet. In such a deployment, a client request reaches a load balancer that forwards the request to a specific server. The request handling mechanism depends on the programming language and the framework used for the application development, server software, and modules like Apache 2, Nginx, Tomcat, Gunicorn, Puma, uWSGI, etc.

Serverless Computing

Serverless computing a.k.a. FaaS (Function as a Service) is a computing infrastructure provided by cloud computing providers. Serverless does seem like they don’t use a server, but internally it does, no computation can be done without any resources. Serverless resources are different from classical web servers, classical web servers are running 24x7 whereas serverless instances are created on demand using triggers/events, and the lifespan of serverless instances is very-very small compared to web servers. For example, AWS provides many ways to trigger a function call like whenever an item is changed in DynamoDB, on the addition of new items in the SQS queue, runs a function at a given time. Once an event is handled by the function then the instance can be terminated immediately or can be reused. Generally, developers do not have control over the serverless instance termination, like in HTTP request handling where a new process can be terminated once the client’s request is served.

Prometheus With Pushgateway

Pushgatway is a data aggregator that aggregates data from multiple application servers and stores it in its internal data store. Prometheus is configured to scrape data from Pushgatway. It solves the problem of application data storage, but this comes with a SPOF.

In this, we’ve two types of applications Application 1 and Application 2. Either of them could be a serverless application or web-apps.

All metrics data could be lost if the Pushgatway goes into a bad state, there could be N number of reasons like hardware failure, network issue, out of memory, etc. Pushgateway should be considered a last resort in the deployment as explained in the doc

We only recommend using Pushgateway in certain limited cases. There are several pitfalls when blindly using the Pushgateway instead of Prometheus’s usual pull model for general metrics collection:

When monitoring multiple instances through a single Pushgateway, the Pushgateway becomes both a single point of failure and a potential bottleneck.

You lose Prometheus’s automatic instance health monitoring via the up metric (generated on every scrape).

The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway’s API.

Reference: https://prometheus.io/docs/practices/pushing/

This should be used for serverless application where new instances are started and stopped on demand, also applications those handles batch jobs.

Due to the above limitations, we should consider using other alternatives like Node exporter, StatsD exporter, and Graphite Exporter. All these three exporters are deployed on the host machine along with the application code and data is collected in the exporter instance.

Node Exporter

Node exporter is one of the recommended exporters as this can export host system details as well like ARP, CPU, disk utilization, Memory info, and text-file. The text file is an interesting part of the exporters, as our application has to write the data in the text file and this text file would be exported to Prometheus.

A text file must follow Prometheus text data formatting, the application code would keep appending data in the text file and that would be exported by the exporter. To use it, set the flag on the Node exporter. The collector will parse all files in that directory matching the glob *.prom using the text format.

Ref: https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-example

# HELP http_requests_total The total number of HTTP requests.# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"} 3 1395066363000
# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
# Minimalistic line:
metric_without_timestamp_and_labels 12.47
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

StatsD Exporter

StatsD exporter speaks StatsD protocol, and an application can talk to the StatsD exporter using TCP/UDP or UNIX datagram socket. StatsD can’t understand any other data than its predefined format, therefore the application must send data in StatsD format. Many client libraries can communicate with the StatsD server.

StatsD data format

<metric_name>:<metric_value>|<metric_type>|@<sampling_rate>

Some sample data might look like as

# login users count sampled at 50% sampling rate

login.users:10|c|@0.5

I would suggest using some library to communicate with StatsD exporters unless you’re willing to take risks.

Graphite Exporter

Graphite exporter is very similar to StatsD exporter, but it speaks Graphite Plaint Text protocol. The application’s memory usage would keep increasing if it’s not scraped for some reason, which can lead to OOM(Out of Memory) exception.

To avoid OOM metrics will be garbage collected five minutes after they are last pushed to the exporter nevertheless, this feature is configurable with the —graphite.sample-expiry flag.

In Graphite plain text protocol data is described using: <metric path> <metric value> <metric timestamp>

Graphite also supports tagging/labeling of metrics as well which can be archived by adding tags to a metric path as:

disk.used;datacenter=dc1;rack=a1;server=web01

In that example, the series name is disk.used and the tags are datacenter=dc1 , rack=a1, and server=web01

If you found this post helpful, please share, like, and leave a comment!

--

--