Synchronization of devices happens if end devices respond to a large-scale external event. Some examples of synchronized events that we’ve experienced are:
- Hundreds of end devices that are connected to the same power source (could be in a train, ship, building) and the power is switched off and on again
- Hundreds of end devices that are connected to the same gateway, and the firmware of the gateway needs to be updated
- Hundreds of thousands of end devices that are connected to The Things Network, and we have a database failover
Many end devices respond to these events, but if they respond in the wrong way, things can go terribly wrong.
Let’s take an example device that starts in Join mode when it powers on, and reverts to Join mode after being disconnected from the network. There are 100s of such devices in a field, and one gateway that covers this field.
The power source for the devices is switched on, and the gateway immediately receives the noise of 100s of simultaneous Join requests. LoRa gateways can deal quite well with noise, but this is just too much, and the gateway can’t make any sense of it. No Join requests are decoded, so no Join requests are forwarded to the network and no Join requests are accepted.
Exactly, or approximately 10 seconds later (the devices either have pretty accurate clocks, or they’re all equally inaccurate), the gateway again receives the noise of 100s of simultaneous Join requests, and still can’t make anything of it. This continues every 10 seconds after that, and the entire site stays offline.
Not great.
This situation can be improved by using jitter. Instead of sending a Join request every 10 seconds, the devices send a Join request 10 seconds after the previous one, plus or minus a random duration of 0-20% of this 10 seconds. This jitter percentage needs to be truly random, because if your devices all use the same pseudorandom number generator, they will still be synchronized, as they will all pick the same “random” number.
With these improved devices, the Join requests will no longer all be sent at exactly the same time, and the gateway will have a better chance of decoding the Join requests.
Much better. Especially if also the initial Join request was sent after a random delay.
But what if you have another site with 1000s of these devices instead of your site with 100s of them? Then the 10 seconds between Join messages may not be enough. This is where backoff comes in. Instead of having a delay of 10s±20%, you increase the delay after each attempt, so you do the second attempt after 20s±20%, the third after 30s±20%, and you keep increasing the delay until you have, say, 1h±20% between Join requests.
An implementation like this prevents persistent failures of sites and the network as a whole and helps speed up recovery after outages.