aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorThijs Paelman <[email protected]>2022-05-20 16:57:08 +0200
committerThijs Paelman <[email protected]>2022-05-20 16:57:08 +0200
commitdfe6c175407975475bd61b67b3cd1da7d119f749 (patch)
tree0641aab108bd075327d4f01bd433113c7ccb973b
parent76e1178e781d451fe3a5a53834e05b77a073c8dd (diff)
downloadwebsite-dfe6c175407975475bd61b67b3cd1da7d119f749.tar.gz
website-dfe6c175407975475bd61b67b3cd1da7d119f749.zip
blog: Add post about flow clean-up and FLM
Signed-off-by: Thijs Paelman <[email protected]>
-rw-r--r--content/en/blog/20220520-oping-flm.md104
1 files changed, 104 insertions, 0 deletions
diff --git a/content/en/blog/20220520-oping-flm.md b/content/en/blog/20220520-oping-flm.md
new file mode 100644
index 0000000..eb9e25a
--- /dev/null
+++ b/content/en/blog/20220520-oping-flm.md
@@ -0,0 +1,104 @@
+---
+date: 2022-05-20
+title: "What is there to learn from oping about flow liveness monitoring?"
+linkTitle: "cleaning up flows"
+author: Thijs Paelman
+---
+
+### Cleaning up flows
+
+While I was browsing through some oping code
+(trying to get a feeling about how to do [broadcast](https://ouroboros.rocks/blog/2021/04/02/how-does-ouroboros-do-anycast-and-multicast/#broadcast)),
+I stumbled about the [cleaner thread](https://ouroboros.rocks/cgit/ouroboros/tree/src/tools/oping/oping_server.c?id=bec8f9ac7d6ebefbce6bd4c882c0f9616f561f1c#n54).
+As we can see, it was used to clean up 'stale' flows (sanitized):
+
+```C
+void * cleaner_thread(void * o)
+{
+ int deadline_ms = 10000;
+
+ while (true) {
+ for (/* all active flows i */) {
+
+ diff = /* diff in ms between last valid ping packet and now */;
+
+ if (diff > deadline_ms) {
+ printf("Flow %d timed out.\n", i);
+ flow_dealloc(i);
+ }
+ }
+ sleep(1);
+ }
+}
+```
+
+But we have since version 19.x flow liveness monitoring (FLM), which does this for us!
+So all this code could be thrown away, right?
+
+Turns out I was semi-wrong!
+It's all about semantics, or 'what do you want to achieve'.
+
+If this thread was there for cleaning up flows from which the peers stopped their flow (and stopped sending keep-alives),
+then we could throw it away by all means! Because FLM does that job.
+
+Or was it there to clean up valid flows, but from which the peers didn't send any ping packets anymore (they *do* send keep-alives, otherwise FLM kicks in)?
+Then we should of course keep it, because this is a server-side decision to cut those peers off.
+This might protect for example against client implementations which connect, send a few pings, but then leave the flow open.
+Or a better illustration of the 'cleaner' thread might be to cut off peers after a 100 pings,
+showing that this decision to 'clean up' has nothing to do with flow timeouts.
+
+### Keeping timed-out flows
+
+On the other side of the spectrum, we have those flows that are timing out (no keep-alives are coming in anymore).
+This is my proposal for the server side parsing of messages:
+
+```C
+while(/* get next fd on which an event happened */) {
+ msg_len = flow_read(fd, buf, OPING_BUF_SIZE);
+ if (msg_len < 0) {
+ /* if-statement is the only difference with before */
+ if (msg_len == -EFLOWPEER) {
+ fset_del(server.flows, fd);
+ flow_dealloc(fd);
+ }
+ continue;
+ }
+ /* continue with parsing and responding */
+}
+```
+
+We can see here that the decision is taken to 'clean up' (= `flow_dealloc`) those flows that are timing out.
+But, as we can see, it's an application decision!
+We might as well decide to keep it open for another 10 min to see if the client (or the network in between) recovers from interruptions, e.g..
+
+### Assymetrical QoS
+There will probably follow some more [discussion](https://ouroboros.rocks/community/)
+on assymetric QoS[^1], but here we're only talking about assymetric FLM.
+
+When I was working on the server code, I tried to set a FLM of 10 sec on the server side,
+such that it would time-out (and [remove](#keeping-timed-out-flows))
+flows from unresponsive clients.
+I then discovered that `flow_accept(qosspec_t * qs, const struct timespec * timeo)` only uses qs to write to,
+not to read the server QoS expectations.
+Thus, at the moment, only the client sets the QoS (if the layer can give it).
+This will change, such that the server sends its QoS to the client too.
+
+When we're talking about FLM, this might result in the server saying something like:
+> I don't trust you, I'm gone if I don't hear from you in 4 minutes, and if you want to want to wait for me 2 days, suits me just fine
+
+where the server sets a FLM of 4 min and the client sets a FLM of 2 days.
+In this scenario, the client has to send[^2] every minute a keep-alive to keep the server interested ([unless it cleans up anyway](#cleaning-up-flows)).
+The server, on the other hand, only needs to send a keep-alive every 12 hours to keep the flow open (assuming no other traffic at all).
+
+There might be cases where you want to sync this timeout (like taking the smallest value),
+but it still needs to be determined if this should be done at the application level or ouroboros and if so, how.
+But for a 'raw' flow, we go with the 'none-of-your-business, just send in time' principle.
+
+Excited for my first blog post & always learning,
+
+Thijs
+
+[^1]: like for wireless links, with a potential different BER or loss in each direction
+ or even IPsec, where you can have encryption in one direction, but not the other
+
+[^2]: fixed at 1/4 of the time-out period at the moment, see [previous post](https://ouroboros.rocks/blog/2022/02/28/application-level-flow-liveness-monitoring/)