From 7c8c48cdd4d48105c28a7db850fac302fe019cd7 Mon Sep 17 00:00:00 2001
From: Tom Teichler <tom.teichler@teckids.org>
Date: Sun, 9 Jan 2022 16:49:35 +0100
Subject: [PATCH] Update monitoring docs

---
 docs/admin/05_monitoring.rst | 131 +++++++++++++++++++++++++++--------
 1 file changed, 104 insertions(+), 27 deletions(-)

diff --git a/docs/admin/05_monitoring.rst b/docs/admin/05_monitoring.rst
index 1b80a43be..5c0cf5fe0 100644
--- a/docs/admin/05_monitoring.rst
+++ b/docs/admin/05_monitoring.rst
@@ -1,39 +1,116 @@
-Monitoring
-##########
+.. _sec:Monitoring:
 
-Prometheus
-**********
+Monitoring and health checks
+============================
 
-AlekSIS provides a metric endpoint at `/metrics`, so you can scrape metrics in
-your Prometheus instance.
+Configuration
+-------------
 
-Available metrics
-=================
+Thresholds
+~~~~~~~~~~
 
-The exporter provides metrics about responses and requests, e.g.  statistics
-about response codes, request latency and requests per view.  It also
-provides data about database operations.
+Thresholds for health checks can be configured via config file
+(``/etc/aleksis``).
 
-Prometheus config to get metrics
-================================
+.. code:: toml
 
-To get metrics of your AlekSIS instance, just add the following to your
-`prometheus.yml`::
+   [health]
+   disk_usage_max_percent = 90
+   memory_min_mb = 500
 
-  - job_name: aleksis
-    static_configs:
-      - targets: ['my.aleksis-instance.com']
-    metrics_path: /metrics
+   [backup.database]
+   check_seconds = 7200
 
+   [backup.media]
+   check_seconds = 7200
 
-Grafana
-*******
+Status page
+-----------
 
-Visualise metrics with Grafana
-==============================
+AlekSIS status page show information about the health of your AlekSIS
+instance. You can visit it via the left navigation bar (Admin â†’ Status).
 
-If you want to visualise your AlekSIS metrics with Grafana, you can use one
-of the public available Grafana dashboards, for example the following one,
-or just write your own.
+The page show information about debug and maintenance mode, a summary of
+your health checks and the last exit status of your celery tasks. This
+page can not be used as a health check, it will always return HTTP 200
+if the site is reachable.
 
-https://grafana.com/grafana/dashboards/9528
+Health check
+------------
+
+The health check can be used to verify the health of your AlekSIS
+instance. You can access it via the browser
+(https://aleksis.example.com/health/) and it will show you a summary of
+your health checks. If something is wrong it will return HTTP 500.
+
+It is also possible to get a JSON response from the health check, for
+example via ``curl``. You only have to pass a valid
+``Accept: application/json`` header to your request.
+
+The health check can also be executed via ``aleksis-admin``:
+
+.. code:: shell
+
+   $ aleksis-admin health_check
+
+Monitoring with Icinga2
+-----------------------
+
+As already mentioned, there is a JSON endpoint at
+https://aleksis.example.com/health/. You can use an json check plugin to
+check seperate health checks or just use a HTTP check to check if the
+site returns HTTP 200.
+
+Performance monitoring with Prometheus
+--------------------------------------
+
+AlekSIS provides a Prometheus exporter. The exporter provides metrics
+about responses and requests, e.g.Â s about response codes, request
+latency and requests per view. It al provides data about database
+operations.
+
+Metrics endpoint
+~~~~~~~~~~~~~~~~
+
+The metrics endpoint can be found at
+https://aleksis.example.com/metrics. In the default configuration it can
+be scraped from everywhere. You might want to add some webserver
+configuration to restrict access to this url.
+
+To get metrics of your AlekSIS instance, just add the following to
+``prometheus.yml``
+
+.. code:: yaml
+
+     - job_name: aleksis
+       static_configs:
+         - targets: ['my.aleksis-instance.com']
+       metrics_path: /metrics
+
+Rules for prometheus alertmanager
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are using the prometheus alertmanager, it is possible to create
+some alerting rules so that an alert is fired when your AlekSIS instance
+is slow or something.
+
+.. code:: yaml
+
+   groups:
+   - name: aleksis
+     rules:
+     - alert: HighRequestLatency
+       expr: histogram_quantile(0.999, sum(rate(django_http_requests_latency_seconds_by_view_method_bucket{instance="YOUR-INSTANCE",view!~"prometheus-django-metrics|healthcheck"}[15m])) by (job, le)) < 30
+       for: 15m
+       labels:
+         severity: page
+       annotations:
+         summary: High request latency for 15 minutes
+
+Grafana dashboard
+~~~~~~~~~~~~~~~~~
+
+There is a Grafana dashboard available to visualize the metrics.
+
+The dashboard is available at
+https://grafana.com/grafana/dashboards/9528.
-- 
GitLab