aboutsummaryrefslogtreecommitdiff
path: root/content/en/docs/Tools/metrics.md
diff options
context:
space:
mode:
authorDimitri Staessens <[email protected]>2021-07-25 12:42:59 +0200
committerDimitri Staessens <[email protected]>2021-07-25 12:42:59 +0200
commite5fff042c0af949afe5904d4138641841b504cb8 (patch)
treea1ba4ee692f524cb13a6068cd943ca8b265785bf /content/en/docs/Tools/metrics.md
parentb22970134f538af6ec3483fbb3c910d1ccbecc22 (diff)
downloadwebsite-e5fff042c0af949afe5904d4138641841b504cb8.tar.gz
website-e5fff042c0af949afe5904d4138641841b504cb8.zip
content: Add metrics exporter description
Diffstat (limited to 'content/en/docs/Tools/metrics.md')
-rw-r--r--content/en/docs/Tools/metrics.md285
1 files changed, 285 insertions, 0 deletions
diff --git a/content/en/docs/Tools/metrics.md b/content/en/docs/Tools/metrics.md
index 05f8d7e..65c9abb 100644
--- a/content/en/docs/Tools/metrics.md
+++ b/content/en/docs/Tools/metrics.md
@@ -6,3 +6,288 @@ draft: false
description: >
Realtime monitoring using a time-series database
---
+
+## Ouroboros metrics
+
+A collection of observability tools for exporting and
+visualising metrics collected from Ouroboros.
+
+Currently has one very simple exporter for InfluxDB, and provides
+additional visualization via grafana.
+
+More features will be added over time.
+
+### Requirements:
+
+Ouroboros version >= 0.18.3
+
+InfluxDB OSS 2.0, https://docs.influxdata.com/influxdb/v2.0/
+
+python influxdb-client, install via
+
+```
+pip install 'influxdb-client[ciso]'
+```
+
+### Optional requirements:
+
+Grafana, https://grafana.com/
+
+### Setup
+
+Install and run InfluxDB and create a bucket in influxDB for exporting
+Ouroboros metrics, and a token for writing to that bucket. Consult the
+InfluxDB documentation on how to do this,
+https://docs.influxdata.com/influxdb/v2.0/get-started/#set-up-influxdb.
+
+To use grafana, install and run grafana open source,
+https://grafana.com/grafana/download
+https://grafana.com/docs/grafana/latest/?pg=graf-resources&plcmt=get-started
+
+Go to the grafana UI (usually http://localhost:3000) and set up
+InfluxDB as your datasource:
+Go to Configuration -> Datasources -> Add datasource and select InfluxDB
+Set "flux" as the Query Language, and
+under "InfluxDB Details" set your Organization as in InfluxDB and set
+the copy/paste the token for the bucket to the Token field.
+
+To add the Ouroboros dashboard,
+select Dashboards -> Manage -> Import
+
+and then either upload the json file from this repository in
+
+dashboards-grafana/general.json
+
+or copy the contents of that file to the "Import via panel json"
+textbox and click "Load".
+
+### Run the exporter:
+
+Clone this repository and go to the pyExporter directory.
+
+Edit the config.ini.example file and fill out the InfluxDB
+information (token, org). Save it as config.ini.
+
+and run oexport.py
+
+```
+cd exporters-influxdb/pyExporter/
+python oexport.py
+```
+
+## Overview of Grafana general dashboard for Ouroboros
+
+The grafana dashboard allows you to explore various aspects of
+Ouroboros running on your local or remote systems. As the prototype
+matures, more and more metrics will become available.
+
+### Variables
+
+At the top, you can set a number of variables to restrict what is seen
+on the dashboard:
+
+{{<figure width="30%" src="/docs/Tools/grafana-variables.png">}}
+
+* System allows you to specify a set of host/node/devices in the network:
+
+{{<figure width="30%" src="/docs/Tools/grafana-variables-system.png">}}
+
+The list will contain all hosts that put metrics in the InfluxDB
+database in the last 5 days (Unfortunaly there seems to be no current
+option to restrict this to the current selected time range).
+
+* Type allows you to select metrics for a certain IPCP type
+
+{{<figure width="30%" src="/docs/Tools/grafana-variables-type.png">}}
+
+As you can see, all Ouroboros IPCP types are there, with unclusion of
+an UNKNOWN type. This may briefly pop up when the metric is misread by
+the exporter.
+
+* Layer allows you to restrict the metrics to a certain layer
+
+* IPCP allows to restrict metrics to a certain IPCP
+
+* Interval allows to select a window in which metrics are aggregated.
+
+{{<figure width="30%" src="/docs/Tools/grafana-variables-interval.png">}}
+
+Metrics will be aggregated from the actual exporter values (e.g. mean
+or last value) that fall in this interval. This interval should thus
+be larger than the exporter interval to ensure that each window has
+enough raw data.
+
+### Panels
+
+As you can see in the image above, the dashboard is subdivided in a
+bunch of panels, each of which focuses on some aspect of the
+prototype.
+
+#### System
+
+{{<figure width="80%" src="/docs/Tools/grafana-system.png">}}
+
+The system panel shows the number of IPCPs and known IPCP flows in all
+monitored systems as a stacked series. This system is running a small
+test with 3 IPCPs (2 unicast IPCPs and a local IPCP) with a single
+flow between oping server/client(which has one endpoint in each IPCP,
+so it shows 2 because this small test runs on a single host). The
+colors on the graphs are sometimes not matching the labels, which is a
+grafana issue that I hope will get fixed soon.
+
+#### Link State Database
+
+{{<figure width="80%" src="/docs/Tools/grafana-lsdb.png">}}
+
+The Link State Database panel shows the knowledge each IPCP has about
+the network routing area(s) it is in. The example has 2 IPCPs that are
+directly connected, so each knows 1 neighbor (the other IPCP), 2
+nodes, and two links (each unidirectional arc in the topology graph is
+counted).
+
+#### Process N-1 flows
+
+{{<figure width="80%" src="/docs/Tools/grafana-frcp.png">}}
+
+This is the first panel that deals with the [Flow-and-Retransmission
+Control
+Protocol](/docs/concepts/protocols##flow-and-retransmission-control-protocol-frcp)
+(FRCP). It shows metrics for the flows between the applications (this
+is not the same flow as the data transfer flow above, which is between
+the IPCPs). This panel shows metrics relating to retransmission. The
+first is the current retransmission timeout, i.e. the time after which
+a packet will be retranmitted. This is calculated from the smoothed
+round-trip time and its estimated deviation (well below 1ms), as
+estimated by FRCP.
+
+The flow is created by the oping application that is pinging at a 10ms
+interval with packet retransmission enabled (so basically a service
+equivalent as running ping over TCP). The main difference with TCP is
+that Ouroboros flows are between the applications themselves. The
+oping server immediately responds to the client, so the client sees a
+response time well below 1 ms[^1]. The server, however, sees the
+client sending a packet only every 10ms and its RTO is a bit over
+10ms. The ACKs from the perspective of the server are piggybacked on
+the client's next ping. (This is similar to TCP "delayed ACK", the
+timer in Ouroboros is set to 10ms, so if I would ping at 1 second
+intervals over a flow with FRCP enabled, the server would also see a
+10ms Round-trip time).
+
+#### Delta-t constants
+
+The second panel to do with FRCP are the Delta-t constants. Delta-t is
+the protocol on which FRCP is based. Right now, they are only
+configurable at compile time, but in the future they will probably be
+configurable using fccntl().
+
+{{<figure width="80%" src="/docs/Tools/grafana-frcp-constants.png">}}
+
+A quick refresher on these Delta-t timers:
+
+* **Maximum Packet Lifetime** (MPL) is the maximum time a packet can
+ live in the network, default is 1 minute.
+
+* **Retransmission timer** (R) is the maximum time which a
+ retransmission for a packet may be sent by the sender. The default
+ is 2 minutes. The first retransmission will happen after RTO,
+ then 2 * RTO, 4* RTO and so on with an exponential back-off, but
+ no packets will be sent after R has expired. If this happens, the
+ flow is considered failed / down.
+
+* **Acknowledgment timer** (A) is the maximum time which an packet may
+ be acknowledged by the receiver. Default is 10 seconds. So a
+ packet may be acknowledged immediately, or after 10 milliseconds,
+ or after 4 seconds, but not any more after 10 seconds.
+
+#### Delta-t window
+
+{{<figure width="80%" src="/docs/Tools/grafana-frcp-window.png">}}
+
+The third and (at least at this point) last panel related to FRCP is
+the window panel that shows information regarding Flow Control. FRCP
+flow control tracks the number of packets in flight. These are the
+packets that were sent by the sender, but have not been
+read/acknowledged yet by the receiver. Each packet is numbered
+sequentially starting from a random value. The default maximum window
+size is currently 256 packets.
+
+#### IPCP N+1 flows
+
+{{<figure width="80%" src="/docs/Tools/grafana-ipcp-np1.png">}}
+
+These graphs show basic statistics from the point of view of the IPCP
+that is serving the application flow. It shows upstream and downstream
+bandwidth and packet rates, and total sent and received packets/bytes.
+
+#### N+1 Flow Management
+
+{{<figure width="60%" src="/docs/Tools/grafana-ipcp-np1-fu.png">}}
+
+These 4 panels show the management traffic sent by the flow
+allocators. Currently this traffic is only related to congestion
+avoidance. The example here is taken from a jFed experiment during a
+period of congestion. The receiver IPCP monitors packets for
+congestion markers and it will send an update to the source IPCP to
+inform it to slow down. It shows the rate of flow updates for
+multi-bit Explicit Congestion Notification. As you can see, there is
+still an issue where the receiver is not receiving all the flow
+updates and there is a lot of jitter and burstiness at the receiver
+side for these (small) packets. I'm working on fixing this.
+
+#### Congestion Avoidance
+
+{{<figure width="80%" src="/docs/Tools/grafana-ipcp-np1-cc.png">}}
+
+This is a more detailed panel that shows the internals of the MB-ECN
+congestion avoidance algorithm.
+
+The left side shows the congestion window width, which is the
+timeframe over which the algorithm is averaging bandwidth. This scales
+with the packet rate, as there have to be enough packets in the window
+to make a reasonable measurement. Biggest change compared to TCP is
+that this window width is independent of RTT. The congestion
+algorithm then sets a target for the maximum number of bytes to send
+within this window (congestion window size). The division of the
+number of bytes that can be sent and the size of the windows yields
+the target bandwidth. The congestion was caused by a 100Mbit link, and
+the target set by the algorithm is quite near this value. The
+congestion level is a quantity that controls the rate at which the
+window scales down when there is congestion. This upstream/downstream
+view should be as close as possible to identical, the reason they are
+not is because of the jitter and loss in the flow updates as observed
+above. Work in progress.
+
+The graphs also show the number of packets and bytes in the current
+congestion window. The default target is set to min 8 and max 64
+packets within the congestion window before it scales up/down.
+
+And finally, the upstream packet counters shows the number of packets
+sent without receiving a congestion update from the receiver, and the
+downstream packet counter shows the number of packets received since
+the last time there was no congestion.
+
+#### Data transfer local components
+
+The last panel shows the (management) traffic sent and received by the
+IPCP internal as measured by the forwarding engine (Data transfer).
+
+{{<figure width="80%" src="/docs/Tools/grafana-ipcp-dt-dht.png">}}
+
+The components that are current shown on this panel are the DHT and
+the Flow Allocator. As you can see, the DHT didn't do much during this
+interval. That's because it is only needed for name-to-address
+resolution and it will only send/receive packets when an address is
+resolved or when it needs to refresh its state, which happens only
+once every 15 minutes or so.
+
+{{<figure width="80%" src="/docs/Tools/grafana-ipcp-dt-fa.png">}}
+
+The bottom part of the local components is dedicated to the flow
+allocator. During the monitoring period, only flow updates were sent,
+so this is the same data as shown in the flow management traffic, but
+from the viewpoint of the forwarding element in the IPCP, so it shows
+actual bandwidth in addition to the packet rates.
+
+[^1]: If this still seems high, disabling CPU "C-states" and tuning
+ the kernel for low latency can reduce this to a few
+ microseconds.