Skip to content

Delay auto-tuning v2#2337

Open
stachuman wants to merge 16 commits intomeshcore-dev:devfrom
stachuman:delay-tuning-v2
Open

Delay auto-tuning v2#2337
stachuman wants to merge 16 commits intomeshcore-dev:devfrom
stachuman:delay-tuning-v2

Conversation

@stachuman
Copy link
Copy Markdown

@stachuman stachuman commented Apr 19, 2026

Improving ACK delivery and number of ACK received - theoretically ensuring that if DM is delivered - at least one ACK is received by sender (we speak of probability!)

This base on an extensive simulations (over 100k simulations done - dedicated simulator built (hopefully can be used also for other purposes)) - all details and road to this PR can be traced in this discussion #2053

Proposal base on theoretical work of KPrivitt and my simulations (all source data used for simulations are available on my github page)

--

There are some differences comparing to the previous PR

  1. change of variable to: auto.tune.delays (to make it more consistent)
    set auto.tune.delays on/off
  2. wired recalc of parameters into onAdvertRecv - so no periodic recalc done, instead - automatic recalc called
  3. auto.tune.delays by default is off

Measured performance at defaults (same 4 topologies, 6 rnd seeds each):

Density Msg Delivery Channel Delivery ACK Delivery Avg ACK Copies
sparse 67.0% 57.5% 36.0% 0.60
medium 62.7% 78.8% 33.7% 0.70
dense 65.0% 64.5% 24.3% 0.50
very_dense 64.3% 43.0% 13.0% 0.20
Mean 64.8% 60.9% 26.8% 0.50

Proposed changes - autotune tx/direct tx delays:

Density Msg Delivery Δ ACK Delivery Δ ACK Copies Δ
sparse −2.7pp +11.0pp +0.80
medium −6.3pp +1.6pp +1.00
dense −6.0pp +13.7pp +1.30
very_dense −9.3pp +17.0pp +1.40
Mean −6.1pp +10.9pp +1.13

liamcottle and others added 15 commits March 24, 2026 15:38
…or changes

- Add multi byte FAQ
- Reword amped radio output setting numbers
- Clarify repeater ID collision including distance, supercede meshcore-dev#1478
- Reference awesome meshcore for community projects. Supercede meshcore-dev#1893
Removed "see note" from RAK 4631 entry in FAQ.
Fixed an extra TOC jump link inserted by VSCode Markdown All in One VS Code extension.
fixed typos and refined multibyte sections.
add multibyte FAQ, reference awesome-meshcore community projects, minor changes
Update RAK 4631 entry in FAQ on new bootloader - removed "see note"
# Conflicts:
#	docs/faq.md
… - improving ACK delivery and number of ACK received - theoretically ensuring that if DM is delivered - at least one ACK is received by sender.
@KPrivitt
Copy link
Copy Markdown
Contributor

KPrivitt commented Apr 19, 2026

In the prior PR the frequency of the neighbor count was every 5 min. This is far too frequent and is consuming compute resources that can be utilized elsewhere.

While the SNR of a received Advert can vary several dB from message to message (pings can change on every one sent) and this can affect the neighbor count (for repeaters close to the 0dB SNR threshold), but the surrounding number of repeaters actually changes very slowly. I believe the count should be done daily, twice a week or once a week.

@1nerdherder
Copy link
Copy Markdown

1nerdherder commented Apr 20, 2026

I've been watching this work across the two pull request conversations. This was a heavy lift, deserving of strong consideration amongst the devs. At a minimum, the existing defaults are not optimal. The power of the autotune algorithm approach is that it makes all repeaters "good neighbors" who will adapt their settings in harmony as the mesh evolves.

@stachuman
Copy link
Copy Markdown
Author

In the prior PR the frequency of the neighbor count was every 5 min. This is far too frequent and is consuming compute resources that can be utilized elsewhere.

While the SNR of a received Advert can vary several dB from message to message (pings can change on every one sent) and this can affect the neighbor count (for repeaters close to the 0dB SNR threshold), but the surrounding number of repeaters actually changes very slowly. I believe the count should be done daily, twice a week or once a week.

Correct- on one hand calculating every couple of minutes is not a big burden, yet for the sake of clean code I have moved that to advert recp. Code.

@terminalvelocity23
Copy link
Copy Markdown
Contributor

terminalvelocity23 commented Apr 21, 2026

Hello, I've tested your PR in our mesh, which is very dense and there's a lot of in-band noise. The algorithm has set the delays so high that the repeater effectively stopped functioning.
Also, it didn't return the delays to their original values after disabling.

изображение

@stachuman
Copy link
Copy Markdown
Author

Hello, I've tested your PR in our mesh, which is very dense and there's a lot of in-band noise. The algorithm has set the delays so high that the repeater effectively stopped functioning. Also, it didn't return the delays to their original values after disabling.

--
Can you please elaborate on 'stopped functioning'? Was it one repeater with changed firmware or more? Creating wider network? Also - what do you mean - 'effectively stopped functioning'? There are delays - to limit number of collisions, but transmission is done.

For the last point - very valid point, let me update that.

@terminalvelocity23
Copy link
Copy Markdown
Contributor

Can you please elaborate on 'stopped functioning'? Was it one repeater with changed firmware or more? Creating wider network? Also - what do you mean - 'effectively stopped functioning'? There are delays - to limit number of collisions, but transmission is done.

It was one repeater to test this feature. The delays were set so high it effectively stopped relaying packets, everything but its admin interface was handled by other repeaters around. It stopped showing up in outbound and inbound paths.

@stachuman
Copy link
Copy Markdown
Author

stachuman commented Apr 21, 2026

Can you please elaborate on 'stopped functioning'? Was it one repeater with changed firmware or more? Creating wider network? Also - what do you mean - 'effectively stopped functioning'? There are delays - to limit number of collisions, but transmission is done.

It was one repeater to test this feature. The delays were set so high it effectively stopped relaying packets, everything but its admin interface was handled by other repeaters around. It stopped showing up in outbound and inbound paths.

In fact - this is not bad thing what you observed, it’s in fact desired effect. The purpose of mesh network is NOT to ensure that every repeater is transmitting but to ensure effectivenes of the overall network.
To be precise - in a dense network it is NOT recommended that ‚all repeaters’ transmit within the same time window - as this only increase probability of collision - failed transmission.

Not to go into details - ‚all the repeaters in the area carried on the transmission but your one was silent’ - if then - due to collisions - ‚all the other traffic would fails’ - your repeater will retransmit with a delay - giving the chance to deliver message, and opposite - if ‚all the other traffic’ will deliver, your one won’t be even required (it will kind of reduce density of network - what is a recommended thing).

And this is the purpose of PR - to increase overall probability of delivery (ACK) - it is NOT to increase single repeater number of transmissions. Effectiveness is not coming here from how quick re-transmission is done - but is coming from probability of evading collisions with other repeaters.

Hope- I’m clear in my explanation.

@stachuman
Copy link
Copy Markdown
Author

stachuman commented Apr 21, 2026

Here are theoretical results with 2 scenarios - 1. We address the busiest routers in an organized way, 2. We address randomly routers with auto delay function.
(0% - we use only default firmware, 30% - means - we use 30% of repeaters in auto-delay optimization mode)

Degree Strategy (upgrade busiest nodes first)

% Optimized N nodes Delivery std ACK Channel F_del P_del F_ack P_ack col/lost ack/del
0% 0 61.7% 7.1 22% 62% 58% 48% 15% 44% 32.8 0.4
10% 14 62.0% 3.3 23% 73% 59% 43% 17% 38% 42.3 0.5
30% 43 57.7% 3.4 25% 79% 44% 36% 18% 36% 22.8 0.7
50% 72 55.0% 9.9 30% 75% 34% 32% 29% 34% 32.1 0.8
75% 107 52.3% 3.9 30% 84% 25% 24% 29% 31% 31.8 1.1
100% 143 50.7% 6.5 27% 72% 21% 22% 28% 26% 43.3 1.3

Random Strategy (uncoordinated rollout - random repeaters uses auto-delay)

% Optimized N nodes Delivery std ACK Channel F_del P_del F_ack P_ack col/lost ack/del
0% 0 61.7% 7.1 22% 62% 58% 48% 15% 44% 32.8 0.4
10% 14 58.3% 8.0 27% 74% 51% 46% 21% 38% 34.3 0.6
30% 43 56.0% 4.2 29% 80% 43% 33% 25% 35% 38.4 0.8
50% 72 51.3% 7.8 22% 79% 42% 23% 23% 21% 26.7 0.8
75% 107 46.3% 5.7 26% 79% 30% 22% 30% 26% 33.9 1.2
100% 143 50.7% 6.5 27% 72% 21% 22% 28% 26% 43.3 1.3

Radio Efficiency (collision-free RX ratio)

% Optimized Degree radio_eff Degree ackpath_eff Random radio_eff Random ackpath_eff
0% 62.1% 59.2% 62.1% 59.2%
10% 65.3% 63.6% 65.1% 62.9%
30% 71.4% 69.9% 69.5% 68.1%
50% 74.1% 72.2% 73.6% 71.9%
75% 77.1% 77.1% 77.2% 76.8%
100% 76.8% 76.5% 76.8% 76.5%

1. Mixed firmware outperforms full optimization on ACK delivery

The best ACK rate (30%) occurs at degree 50-75%, not at 100% (27%) - as it comes from the scenario 1 - it means - addressing the most busiest repeaters
My interpretation:

  • Optimized nodes reduce collision pressure in dense clusters
  • Default nodes relay faster, creating alternative paths and timing diversity
  • The combination produces more successful ACK round-trips than either firmware alone

Note - around 75% we are reaching 1 ack delivered - so that's another reason to see that ass a sweet-spot setting.

2. Channel (broadcast) delivery peaks at degree 75%

Channel delivery reaches 84% at degree 75% — a +22pp improvement over the 0% baseline (62%) and +12pp over full optimization (72%).

3. Degree strategy is consistently better than random

Metric (at 75%) Degree Random Delta
Delivery 52.3% 46.3% +6.0pp
ACK 30% 26% +4pp
Channel 84% 79% +5pp
Std deviation 3.9% 5.7% more stable

@1nerdherder
Copy link
Copy Markdown

Sorry for the late comment:
The earlier analysis showed the existing defaults to be flawed and actually making things worse, so should we not be selecting a new “default” rather than reuse the known defective one?

@terminalvelocity23
Copy link
Copy Markdown
Contributor

terminalvelocity23 commented Apr 22, 2026

I'd say it's still too agressive. I've switched auto-tuning on two my repeaters pointed no the north and south of the high-rise I'm in, and dropped tx power on the companion, so nobody but them will hear it.
After a few hours the delays have settled at 12.8 for flood and 38 and 40 for direct. I mean yeah, it makes for collision avoidance, but considering the fact that the max message length is 150/2 minus whatever bytes your name requires if you use non-English alphabet, conveying any complex thought requires a few messages in succession. And the delay of up to 40 seconds breaks the sequence.

@stachuman
Copy link
Copy Markdown
Author

I'd say it's still too agressive. I've switched auto-tuning on two my repeaters pointed no the north and south of the high-rise I'm in, and dropped tx power on the companion, so nobody but them will hear it. After a few hours the delays have settled at 12.8 for flood and 38 and 40 for direct. I mean yeah, it makes for collision avoidance, but considering the fact that the max message length is 150/2 minus whatever bytes your name requires if you use non-English alphabet, conveying any complex thought requires a few messages in succession. And the delay of up to 40 seconds breaks the sequence.

Well... everything base on probability, not on feelings. However - I admit - scenario where sequences of messages is sent was not tested.
Would you like to propose a scenario?

@terminalvelocity23
Copy link
Copy Markdown
Contributor

@stachuman Idk about a scenario, but maybe capping the delays at maybe 20s max isn't a bad idea.

@KPrivitt
Copy link
Copy Markdown
Contributor

KPrivitt commented Apr 25, 2026

stachuman,
Thank you for your tremendous effort. Your results validate my original supposition that as the density increases the need for additional backoff increases and density (which is an easy thing to measure using a neighbor count) is a valid metric to use to adjust and set the amount.

From a theoretical point of view this is easy to see: the probability of a collision drops off as the number of slots increases, however if the number of neighbors increases it counteracts that benefit. Essentially, if the number of neighbors doubles the number of slots needs to double.

This does lead to high delay values in a very dense mesh. But remember in my comments on "success" I did mention that as this increases it does introduce a delay and when that delay becomes humanly noticeable (40 seconds definitely exceeded that threshold) it would need to be capped. A "balanced" or throttled setting is needed.

One thing is clear, an automatic tuning does improve mesh performance, and the current defaults are bad settings, in that they provide insufficient backoff. The mesh works better when collisions are reduced and backoff reduces collisions.

The question and discussion at hand is what should the table entries be for each neighbor count?

The discussion also needs to be about what the "success criteria" should be: what metric should be used? Just one? A combination? How should balancing be done. This will likely be contentious and have many different and sometimes opposing views. But the discussion is GOOD, all views should be entertained. That usually will generate a better result. I hope the dev experts will participate.

So, given the data we have: Can we start with a conservative table to enable automatic tuning and finalize the table later? Lets take the low hanging fruit that is right in front of us.

One comment regarding some comments that this is a centralized approach. It is not, it is decentralized since each repeater can have its own setting. This can be disabled and any repeater owner can set the values they choose. It is not forcing any particular setting, it allows choice and optimization.

The immediate value here is getting the defaults changed to a better setting (eliminate the majority of repeaters being set at the insufficient default values, we need to be good neighbors to each other, just one repeater setting it to better values will help their neighbor, but if the neighbors continue to stomp on you... well, it's best if we are all good neighbors) Plus having the ability to adjust based on density, "the mesh" can adjust for the future.

Optimization of the optimizer can come later.

My opinion: For what it is worth..

Last item: do we know why rxdelay does not affect the simulation results? It should...

That can easily be added to the table as another column. But with what settings... Can it be added and left all 0's (the current default) until we get a future better table and understand what is going on and the "optimum" settings. rxdelay is a secondary benefit, but it is a tool in our chest why not use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants