Interpreting FusionAuth's Prometheus metrics
-
Hello everyone,
I'm setting up a Grafana dashboard for our FusionAuth instances and I've run into some confusion regarding the interpretation of several key metrics, particularly those exported as Dropwizard Histograms/Prometheus Summaries.I would be grateful if anyone who has successfully instrumented and analyzed these metrics could share their insights, especially concerning the units and the meaning of the quantile alignment.
One of the confusing metrics are
Database_primary_pool_*
one example scrape showed
Database_primary_pool_MaxConnections: 20.0
but
Database_primary_pool_Usage{quantile="0.999"}: 45.0
and at some point I saw a1300peak.
Can I assume that in this case the quantile metric is the "time", specifically in milliseconds, consumed in the primary pool connection?Is there any documentation on how to interpret all metrics exposed by
/api/prometheus/metrics/?
I found this page https://fusionauth.io/docs/operate/monitor/prometheus but there's no reference or documentation for the metricsThank you in advance for any hint
_
Fabio -
@fabio-venturi I am not familiar with Prometheus, but I asked the AI on the FusionAuth site and it came back with.
Database_primary_pool_Usageis a Prometheus metric exposed by FusionAuth which reports how much of the primary database connection pool is currently in use. It lets you see whether your HikariCP pool is close to exhaustion and is useful for capacity and health monitoring. [Monitor Prometheus]In the Prometheus UI you can graph it by entering
Database_primary_pool_Usagein the expression box and executing the query. [Monitor Prometheus]It said it based the answer on the page you found, but I don't know enough to say for certain. Does this make sense to you?
-
hello @mark-robustelli
thank you very much to take the time to look into this.
I've also asked the LLM and got a similar answer,
as you suggested theDatabase_primary_pool_Usageindicates "how much of the primary database connection pool is currently in use"
but this cannot be the number of connections because the metrics reported are often way above the max value ofDatabase_primary_pool_MaxConnectionswhich is set to 20, so for example a reported value of 45 or 100 makes no sense.
For this I suspect that the quantile metric reported is the time consumed by the application in milliseconds, but this is just my best guess and I could be completely off.
Also this metric:Database_primary_pool_Usage_countsems to be a monotonic counter, is this the size of the sample used to calculate the percentiles? If this never reset as long as the application is running, how large is the sample pool used to generate the quantiles?
Sadly the page mentioned only helps you setting up Prometheus scraping, it's not a real reference for exported metrics.
It would be very helpful to have a reference page for this metrics.
The Prometheus HELP that comes along with the metrics is not telling the unit either, but it's a generic summary:# HELP Database_primary_pool_Usage Generated from Dropwizard metric import (metric=Database-primary.pool.Usage, type=com.codahale.metrics.Histogram) # TYPE Database_primary_pool_Usage summarySame problem with other similar metrics:
# HELP Database_primary_pool_Wait Generated from Dropwizard metric import (metric=Database-primary.pool.Wait, type=com.codahale.metrics.Timer) # TYPE Database_primary_pool_Wait summary Database_primary_pool_Wait{quantile="0.5",} 3.7480000000000004E-6 Database_primary_pool_Wait{quantile="0.75",} 0.001360131 Database_primary_pool_Wait{quantile="0.95",} 0.0014736360000000002 Database_primary_pool_Wait{quantile="0.98",} 0.001568443 Database_primary_pool_Wait{quantile="0.99",} 0.0016282640000000001 Database_primary_pool_Wait{quantile="0.999",} 0.0023040590000000002 Database_primary_pool_Wait_count 1419.0Sorry for this rant, but it seems to be hard to have some meningful metrics in Grafana.
My kindest regards
_
Fabio -
@dalamenona I see your point about the Database_primary_pool_MaxConnections being set to 20 on the value for usage being reported above that. Browsing around the web, I came across something that said Database_primary_pool_Usage is over the lifetime of the application, but can't seem to find the source now. You also make a valid point about around the other data defenitions. It may make sense to do a deeper dive into HikariCP sources in general. There may be some answers there.
Anyone here familiar with these numbers?
It may also make sense for you to open an issue with FusionAuth as it is not clear to me if these numbers are coming from FusionAuth or HikariCP.