diff options
Diffstat (limited to 'content/en/blog/20201212-congestion-avoidance.md')
-rw-r--r-- | content/en/blog/20201212-congestion-avoidance.md | 358 |
1 files changed, 358 insertions, 0 deletions
diff --git a/content/en/blog/20201212-congestion-avoidance.md b/content/en/blog/20201212-congestion-avoidance.md new file mode 100644 index 0000000..0b010c5 --- /dev/null +++ b/content/en/blog/20201212-congestion-avoidance.md @@ -0,0 +1,358 @@ +--- +date: 2020-12-12 +title: "Congestion avoidance in Ouroboros" +linkTitle: "Congestion avoidance" +description: "API for congestion avoidance and the Ouroboros MB-ECN algorithm" +author: Dimitri Staessens +--- + +The upcoming 0.18 version of the prototype has a bunch of big +additions coming in, but the one that I'm most excited about is the +addition of congestion avoidance. Now that the implementation is +reaching its final shape, I just couldn't wait to share with the world +what it looks like, so here I'll talk a bit about how it works. + +# Congestion avoidance + +Congestion avoidance is a mechanism for a network to avoid situations +where the where the total traffic offered on a network element (link +or node) systemically exceeds its capacity to handle this traffic +(temporary overload due to traffic burstiness is not +congestion). While bursts can be handled with adding buffers to +network elements, the solution to congestion is to reduce the ingest +of traffic at the network endpoints that are sources for the traffic +over the congested element(s). + +I won't be going into too many details here, but there are two classes +of mechanisms to inform traffic sources of congestion. One is Explicit +Congestion Notification (ECN), where information is sent to the sender +that its traffic is traversing a congested element. This is a solution +that is, for instance, used by +[DataCenter TCP (DCTCP)](https://tools.ietf.org/html/rfc8257), +and is also supported by +[QUIC](https://www.ietf.org/archive/id/draft-ietf-quic-recovery-33.txt). +The other mechanism is implicit congestion detection, for instance by +inferring congestion from packet loss (most TCP flavors) or increases +in round-trip-time (TCP vegas). + +Once the sender is aware that its traffic is experiencing congestion, +it has to take action. A simple (proven) way is the AIMD algorithm +(Additive Increase, Multiplicative Decrease). When there is no sign of +congestion, senders will steadily increase the amount of traffic they +are sending (Additive Increase). When congestion is detected, they +will quickly back off (Multiplicative Decrease). Usually this is +augmented with a Slow Start (Multiplicative Increase) phase when the +senders begins to send, to reach the maximum bandwidth more +quickly. AIMD is used by TCP and QUIC (among others), and Ouroboros is +no different. It's been proven to work mathematically. + +Now that the main ingredients are known, we can get to the +preparation of the course. + +# Ouroboros congestion avoidance + +Congestion avoidance is in a very specific location in the Ouroboros +architecture: at the ingest point of the network; it is the +responsibility of the network, not the client application. In +OSI-layer terminology, we could say that in Ouroboros, it's in "Layer +3", not in "Layer 4". + +Congestion has to be dealt with for each individual traffic +source/destination pair. In TCP this is called a connection, in +Ouroboros we call it a _flow_. + +Ouroboros _flows_ are abstractions for the most basic of packet flows. +A flow is defined by two endpoints and all that a flow guarantees is +that there exist strings of bytes (packets) that, when offered at the +ingress endpoint, have a non-zero chance of emerging at the egress +endpoint. I say 'there exist' to allow, for instance, for maximum +packet lengths. If it helps, think of flow endpoints as an IP:UDP +address:port pair (but emphatically _NOT_ an IP:TCP address:port +pair). There is no protocol assumed for the packets that traverse the +flow. To the ingress and egress point, they are just a bunch of bytes. + +Now this has one major implication: We will need to add some +information to these packets to infer congestion indirectly or +explicitly. It should be obvious that explicit congestion notification +is the simplest solution here. The Ouroboros prototype (currently) +allows an octet for ECN. + +# Functional elements of the congestion API + +This section glances over the API in an informal way. A reference +manual for the actual C API will be added after 0.18 is in the master +branch of the prototype. The most important thing to keep in mind is +that the architecture dictates this API, not any particular algorithm +for congestion that we had in mind. In fact, to be perfectly honest, +up front I wasn't 100% sure that congestion avoidance was feasible +without adding additional fields fields to the DT protocol, such as a +packet counter, or sending some feedback for measuring the Round-Trip +Time (RTT). But as the algorithm below will show, it can be done. + +When flows are created, some state can be stored, which we call the +_congestion context_. For now it's not important to know what state is +stored in that context. If you're familiar with the inner workings of +TCP, think of it as a black-box generalization of the _tranmission +control block_. Both endpoints of a flow have such a congestion +context. + +At the sender side, the congestion context is updated for each packet +that is sent on the flow. Now, the only information that is known at +the ingress is 1) that there is a packet to be sent, and 2) the length +of this packet. The call at the ingress is thus: + +``` + update_context_at_sender <packet_length> +``` + +This function has to inform when it is allowed to actually send the +packet, for instance by blocking for a certain period. + +At the receiver flow endpoint, we have a bit more information, 1) that +a packet arrived, 2) the length of this packet, and 3) the value of +the ECN octet associated with this packet. The call at the egress is +thus: + +``` + update_context_at_receiver <packet_length, ecn> +``` + +Based on this information, receiver can decide if and when to update +the sender. We are a bit more flexible in what can be sent, at this +point, the prototype allows sending a packet (which we call +FLOW_UPDATE) with a 16-bit Explicit Congestion Experienced (ECE) field. + +This implies that the sender can get this information from the +receiver, so it knows 1) that such a packet arrived, and 2) the value +of the ECE field. + +``` + update_context_at_sender_ece <ece> +``` + +That is the API for the endpoints. In each Ouroboros IPCP (think +'router'), the value of the ECN field is updated. + +``` + update_ecn_in_router <ecn> +``` + +That's about as lean as as it gets. Now let's have a look at the +algorithm that I designed and +[implemented](https://ouroboros.rocks/cgit/ouroboros/tree/src/ipcpd/unicast/pol/ca-mb-ecn.c?h=be) +as part of the prototype. + +# The Ouroboros multi-bit Forward ECN (MB-ECN) algorithm + +The algorithm is based on the workings of DataCenter TCP +(DCTCP). Before I dig into the details, I will list the main +differences, without any judgement. + +* The rate for additive increase is the same _constant_ for all flows + (but could be made configurable for each network layer if + needed). This is achieved by having a window that is independent of + the Round-Trip Time (RTT). This may make it more fair, as congestion + avoidance in DCTCP (and in most -- if not all -- TCP variants), is + biased in favor of flows with smaller RTT[^1]. + +* Because it is operating at the _flow_ level, it estimates the + _actual_ bandwidth sent, including retransmissions, ACKs and what + not from protocols operating on the flow. DCTCP estimates bandwidth + based on which data offsets are acknowledged. + +* The algorithm uses 8 bits to indicate the queue depth in each + router, instead of a single bit (due to IP header restrictions) for + DCTCP. + +* MB-ECN sends a (small) out-of-band FLOW_UPDATE packet, DCTCP updates + in-band TCP ECN/ECE bits in acknowledgment (ACK) packets. Note that + DCTCP sends an immediate ACK with ECE set at the start of + congestion, and sends an immediate ACK with ECE not set at the end + of congestion. Otherwise, the ECE is set accordingly for any + "regular" ACKs. + +* The MB-ECN algorithm can be implemented without the need for + dividing numbers (apart from bit shifts). At least in the linux + kernel implementation, DCTCP has a division for estimating the + number of bytes that experienced congestion from the received acks + with ECE bits set. I'm not sure this can be avoided[^2]. + +Now, on to the MB-ECN algorithm. The values for some constants +presented here have only been quickly tested; a _lot_ more scientific +scrutiny is definitely needed here to make any statements about the +performance of this algorithm. I will just explain the operation, and +provide some very preliminary measurement results. + +First, like DCTCP, the routers mark the ECN field based on the +outgoing queue depth. The current minimum queue depth to trigger and +ECN is 16 packets (implemented as a bit shift of the queue size when +writing a packet). We perform a logical OR with the previous value of +the packet. If the width of the ECN field would be a single bit, this +operation would be identical to DCTCP. + +At the _receiver_ side, the context maintains two state variables. + +* The floating sum (ECE) of the value of the (8-bit) ECN field over the +last 2<sup>N</sup> packets is maintained (currently N=5, so 32 +packets). This is a value between 0 and 2<sup>8 + 5</sup> - 1. + +* The number of packets received during a period of congestion. This + is just for internal use. + +If th ECE value is 0, no actions are performed at the receiver. + +If this ECE value becomes higher than 0 (there is some indication of +start of congestion), an immediate FLOW_UPDATE is sent with this +value. If a packet arrives with ECN = 0, the ECE value is _halved_. + +For every _increase_ in the ECE value, an immediate update is sent. + +If the ECE value remains stable or decreases, an update is sent only +every M packets (currently, M = 8). This is what the counter is for. + +If the ECE value returns to 0 after a period of congestion, an +immediate FLOW_UPDATE with the value 0 is sent. + +At the _sender_ side, the context keeps track of the actual congestion +window. The sender keeps track of: + +* The current sender ECE value, which is updated when receiving a + FLOW_UPDATE. + +* A bool indicating Slow Start, which is set to false when a + FLOW_UPDATE arrives. + +* A sender_side packet counter. If this exceeds the value of N, the + ECE is reset to 0. This protects the sender from lost FLOW_UPDATES + that signal the end of congestion. + +* The window size multiplier W. For all flows, the window starts at a + predetermined size, 2<sup>W</sup> ns. Currently W = 24, starting at + about 16.8ms. The power of 2 allows us to perform operations on the + window boundaries using bit shift arithmetic. + +* The current window start time (a single integer), based on the + multiplier. + +* The number of packets sent in the current window. If this is below a + PKT_MIN threshold before the start of a window period, the new + window size is doubled. If this is above a PKT_MAX threshold before + the start of a new window period, the new window size is halved. The + thresholds are currently set to 8 and 64, scaling the window width + to average sending ~36 packets in a window. When the window scales, + the value for the allowed bytes to send in this window (see below) + scales accordingly to keep the sender bandwidth at the same + level. These values should be set with the value of N at the + receiver side in mind. + +* The number bytes sent in this window. This is updated when sending + each packet. + +* The number of allowed bytes in this window. This is calculated at + the start of a new window: doubled at Slow Start, multiplied by a + factor based on sender ECE when there is congestion, and increased + by a fixed (scaled) value when there is no congestion outside of + Slow Start. Currently, the scaled value is 64KiB per 16.8ms. + +There is one caveat: what if no FLOW_UPDATE packets arrive at all? +DCTCP (being TCP) will timeout at the Retransmission TimeOut (RTO) +value (since its ECE information comes from ACK packets), but this +algorithm has no such mechanism at this point. The answer is that we +currently do not monitor flow liveness from the flow allocator, but a +Keepalive or Bidirectional Forwarding Detection (BFD)-like mechanism +for flows should be added for QoS maintenance, and can serve to +timeout the flow and reset it (meaning a full reset of the +context). + +# MB-ECN in action + +From version 0.18 onwards[^3], the state of the flow -- including its +congestion context -- can be monitored from the flow allocator +statics: + +```bash +$ cat /tmp/ouroboros/unicast.1/flow-allocator/66 +Flow established at: 2020-12-12 09:54:27 +Remote address: 99388 +Local endpoint ID: 2111124142794211394 +Remote endpoint ID: 4329936627666255938 +Sent (packets): 1605719 +Sent (bytes): 1605719000 +Send failed (packets): 0 +Send failed (bytes): 0 +Received (packets): 0 +Received (bytes): 0 +Receive failed (packets): 0 +Receive failed (bytes): 0 +Congestion avoidance algorithm: Multi-bit ECN +Upstream congestion level: 0 +Upstream packet counter: 0 +Downstream congestion level: 48 +Downstream packet counter: 0 +Congestion window size (ns): 65536 +Packets in this window: 7 +Bytes in this window: 7000 +Max bytes in this window: 51349 +Current congestion regime: Multiplicative dec +``` + +I ran a quick test using the ocbr tool (modified to show stats every +100ms) on a jFed testbed using 3 Linux servers (2 clients and a +server) in star configuration with a 'router' (a 4th Linux server) in +the center. The clients are connected to the 'router' over Gigabit +Ethernet, the link between the 'router' and server is capped to 100Mb +using ethtool[^4]. + +Output from the ocbr tool: + +``` +Flow 64: 998 packets ( 998000 bytes)in 101 ms => 9880.8946 pps, 79.0472 Mbps +Flow 64: 1001 packets ( 1001000 bytes)in 101 ms => 9904.6149 pps, 79.2369 Mbps +Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9882.8697 pps, 79.0630 Mbps +Flow 64: 998 packets ( 998000 bytes)in 101 ms => 9880.0143 pps, 79.0401 Mbps +Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9887.6627 pps, 79.1013 Mbps +Flow 64: 999 packets ( 999000 bytes)in 101 ms => 9891.0891 pps, 79.1287 Mbps +New flow. +Flow 64: 868 packets ( 868000 bytes)in 102 ms => 8490.6583 pps, 67.9253 Mbps +Flow 65: 542 packets ( 542000 bytes)in 101 ms => 5356.5781 pps, 42.8526 Mbps +Flow 64: 540 packets ( 540000 bytes)in 101 ms => 5341.5105 pps, 42.7321 Mbps +Flow 65: 534 packets ( 534000 bytes)in 101 ms => 5285.6111 pps, 42.2849 Mbps +Flow 64: 575 packets ( 575000 bytes)in 101 ms => 5691.4915 pps, 45.5319 Mbps +Flow 65: 535 packets ( 535000 bytes)in 101 ms => 5291.0053 pps, 42.3280 Mbps +Flow 64: 561 packets ( 561000 bytes)in 101 ms => 5554.3455 pps, 44.4348 Mbps +Flow 65: 533 packets ( 533000 bytes)in 101 ms => 5272.0079 pps, 42.1761 Mbps +Flow 64: 569 packets ( 569000 bytes)in 101 ms => 5631.3216 pps, 45.0506 Mbps +``` + +With only one client running, the flow is congestion controlled to +about ~80Mb/s (indicating the queue limit at 16 packets may be a bit +too low a bar). When the second client starts sending, both flows go +quite quickly (at most 100ms) to a fair state of about 42 Mb/s. + +The IO graph from wireshark shows a reasonably stable profile (i.e. no +big oscillations because of AIMD), when switching the flows on the +clients on and off which is on par with DCTCP and not unexpected +keeping in mind the similarities between the algorithms: + +{{<figure width="60%" src="/blog/20201212-congestion.png">}} + +The periodic "gaps" were not seen at the ocbr endpoint applicationand +may have been due to tcpdump not capturing everything that those +points, or possibly a bug somewhere. + +As said, a lot more work is needed analyzing this algorithm in terms +of performance and stability[^5]. But I am feeling some excitement about its +simplicity and -- dare I say it? -- elegance. + +Stay curious! + +Dimitri + +[^1]: Additive Increase increases the window size with 1 MSS each + RTT. Slow Start doubles the window size each RTT. + +[^2]: I'm pretty sure the kernel developers would if they could. +[^3]: Or the current "be" branch for the less patient. +[^4]: Using Linux traffic control (```tc```) to limit traffic adds + kernel queues and may interfere with MB-ECN. +[^5]: And the prototype implementation as a whole! |