Does a node ever need to rejoin after OTAA?

htdvisser · September 18, 2020, 2:22pm

The ReJoin procedure is mostly useful for roaming. When a device leaves the service area of its network operator, and enters the service area of another operator, the device can send a ReJoin request to “ask” the home network operator if it should start a roaming session with the other network operator. The home network operator can also send MAC commands to the device telling it to (periodically) send such ReJoin requests.

In practice, we don’t see this being used yet, and we also don’t have guidelines for it yet.

Since the Network Server can tell a device (through MAC commands) if, when and how often it should send ReJoin requests, I think that’s sufficient for most devices, so I wouldn’t implement custom behavior for it if I were developing devices, and only use regular Join requests.

htdvisser · September 18, 2020, 2:59pm

I think it may also be good to come back to the initial question asked in this thread, since we’ve learned quite a bit since 2016.

Here’s a quick summary of what Joins do:

Every Join request contains a unique DevNonce to keep the join procedure secure
There can be 64k (65536) different DevNonces for the same AppKey/NwkKey
Assuming that the root keys don’t change, a node can therefore send 64k Join requests in its lifetime
In LoRaWAN versions prior to 1.0.4, the DevNonce was random, and therefore the probability of picking a random one decreased over time. In LoRaWAN versions after 1.0.4, the DevNonce is a counter, which requires a bit of persistent memory on the device to keep track of the counter, but does not have increasing Join difficulty.
When a Join is accepted, a new Session is started
When the network receives the first message with the new Session, the old one is discarded
Every Uplink and Downlink message in a Session uses a unique Frame Counter (FCnt)
Frame Counters are 32 bits wide (allowing for 4G, 4.294.967.296, messages in a Session). Older versions of LoRaWAN used 16 bits Frame Counters (allowing for 64k, 65536, messages in a Session)

There are no real rules for when a device should transmit Join requests, but generally speaking we see devices that Join:

When the device doesn’t have a Session
- when it is activated for the first time
- when it loses its Session on reset
When the device thinks it has lost connection to the network
- after following the usual reconnection steps (TODO: link to guideline)
When the application tells it to
- always good to be able to send a downlink to the device to reset it
Periodically
- resetting a device once every week

Since most devices send Join requests when they reset, it is EXTREMELY important to avoid synchronization by always using backoff and jitter in the implementation of the Join mechanism of devices.

bluejedi · September 18, 2020, 3:07pm

Can you explain what exactly is meant by above?

htdvisser · September 18, 2020, 4:19pm

Synchronization of devices happens if end devices respond to a large-scale external event. Some examples of synchronized events that we’ve experienced are:

Hundreds of end devices that are connected to the same power source (could be in a train, ship, building) and the power is switched off and on again
Hundreds of end devices that are connected to the same gateway, and the firmware of the gateway needs to be updated
Hundreds of thousands of end devices that are connected to The Things Network, and we have a database failover

Many end devices respond to these events, but if they respond in the wrong way, things can go terribly wrong.

Let’s take an example device that starts in Join mode when it powers on, and reverts to Join mode after being disconnected from the network. There are 100s of such devices in a field, and one gateway that covers this field.

The power source for the devices is switched on, and the gateway immediately receives the noise of 100s of simultaneous Join requests. LoRa gateways can deal quite well with noise, but this is just too much, and the gateway can’t make any sense of it. No Join requests are decoded, so no Join requests are forwarded to the network and no Join requests are accepted.

Exactly, or approximately 10 seconds later (the devices either have pretty accurate clocks, or they’re all equally inaccurate), the gateway again receives the noise of 100s of simultaneous Join requests, and still can’t make anything of it. This continues every 10 seconds after that, and the entire site stays offline.

Not great.

This situation can be improved by using jitter. Instead of sending a Join request every 10 seconds, the devices send a Join request 10 seconds after the previous one, plus or minus a random duration of 0-20% of this 10 seconds. This jitter percentage needs to be truly random, because if your devices all use the same pseudorandom number generator, they will still be synchronized, as they will all pick the same “random” number.

With these improved devices, the Join requests will no longer all be sent at exactly the same time, and the gateway will have a better chance of decoding the Join requests.

Much better. Especially if also the initial Join request was sent after a random delay.

But what if you have another site with 1000s of these devices instead of your site with 100s of them? Then the 10 seconds between Join messages may not be enough. This is where backoff comes in. Instead of having a delay of 10s±20%, you increase the delay after each attempt, so you do the second attempt after 20s±20%, the third after 30s±20%, and you keep increasing the delay until you have, say, 1h±20% between Join requests.

An implementation like this prevents persistent failures of sites and the network as a whole and helps speed up recovery after outages.

bluejedi · September 18, 2020, 5:44pm

Power off/on related synchronized events can also be caused by power-outages in geographic regions (e.g. districts, cities).

One usually has no control over other LoRaWAN applications in an area and (depending on the application) the RF signals usually reach a larger area than where they are needed.
For many locations there is no guarantee that there will not be many end devices in the area and the number of devices may change/increase over time. Therefore, in theory, each gateway and each end device is prone to such large-scale external events.

So the ‘backoff and jitter’ strategy should actually be implemented in each LoRaWAN end device that performs OTAA joins.

Does randomizing of the delays have any impact on how spreading factors are/should be changed during join retries?

Will a ‘jitter and backoff’ strategy cause unnecessary join delays for devices in areas with only limited number of devices?

A ‘backoff and jitter’ strategy will probably need to be implemented in LoRaWAN libraries like LMIC and LoRaMac-node, because retries of failed joins are automatically performed by those libraries as part of a join request.

kersing · September 18, 2020, 6:14pm

One way to avoid pseudo random to provide the same results on all devices is to use a unique number to seed the random generator. The DevEUI (maybe combined with AppEUI) comes to mind.

sjphilip · September 18, 2020, 7:29pm

This makes sense. Thank you!

descartes · September 18, 2020, 8:41pm

Ideally, a node knows what rate it was communicating at and can adjust it’s rejoin appropriately, lower SF trying more often.

descartes · September 18, 2020, 8:45pm

Not if you use a random ± 20% and 8 channels for lower SF’s. This would be a good candidate for stochastic modelling.

bluejedi · September 18, 2020, 8:47pm

This already happens in current LMIC (and probably LoRaMac-node) implementations but the intervals are predefined and of fixed length. AFAIK no randomization is applied.

Knowing that retry intervals in current (LMIC) implementations are already automatically incremented (by LMIC) but at the same time also spreading factors are automatically increased during retries, I was wondering if it suffices to only add a randomization to the length of retry intervals.

I know at least one LoRaWAN library implementation that works the opposite, tries joining using the highest SF first and gradually decrements if it fails. IIRC it is still unclear whether latter conforms to LoRaWAN specifications or not.

So it actually depends on the implemented algorithm.
Practical guidance for implementing ‘jitter and backoff’ would therefore be useful.

descartes · September 19, 2020, 8:28am

Not for power loss unless the user saves this info, which I sort of doubt, I know I haven’t

bluejedi · September 19, 2020, 11:48am

I meant after a power cycle (reset) when the device has not stored the keys receiced from from a previous join and does a new fresh join after the restart.

htdvisser · September 29, 2020, 10:54am

Good question. I don’t think we’ve actually ever given “official” recommendations on using different spreading factors during the Join procedure. @benolayinka is currently working on a “best practices” document that includes some of the content from my earlier posts here. Maybe we can also write something about this.

It’s always a tradeoff. For the “jitter” part, no, because that’s just randomization. For backoff you’ll need to set a maximum delay. For some devices it’s perfectly acceptable to have the retry delay slowly increase to a maximum of 24 hours ±20%. For other devices you may want them to retry more frequently.

arjanvanb · September 29, 2020, 11:00am

In case it helps, and I assume you know, Semtech has an opinion too:

Note: It is important to vary the DR. If you always choose a low DR, join requests will take much more time on air. Join requests will also have a much higher chance of interfering with other join attempts as well as with regular message traffic from other devices. Conversely, if you always use a high DR and the device trying to join the network is far away from the LoRaWAN gateway or sitting in an RF-obstructed or null region, the gateway may not receive a device’s join request. Given these realities, randomly vary the DR and frequency to defend against low signals while balancing against on-air time for join requests.

cjhdev · September 29, 2020, 1:55pm

In LDL, OTAA rotates through all spreading factors from most efficient to least efficient. LDL will keep retrying until OTAA is successful or the application cancels the operation.

The JoinRequest transmit time is dithered by up to 30 seconds on each attempt. LDL will also gradually reduce the duty cycle so that it does not exceed 0.0001 over 24 hours as described in the specification.

Source of random used for dither depends on how LDL has been integrated. It can be pseudorandom (i.e. rand()) seeded by entropy gathered by the radio driver which is made available to the application on startup.

cslorabox · September 29, 2020, 4:15pm

This can get “fun” when someone cheats building “less immediately needed” functionality during a port to a new platform. The system behaves oddly (maybe always starts with the same already used-up join nonce…) and on investigation it is found that:

https://imgs.xkcd.com/comics/random_number.png

matthiasdg · June 13, 2023, 5:13pm

I have a Dragino lt-22222-l device and I think it’s at the outer limits of the nearest gateway. I was under the impression - from this thread + also ABP vs OTAA | The Things Stack for LoRaWAN, that once it joined successfully, it should not need to reconnect because of bad reception or something.
While connecting works, it’s not very smooth; am often having this accept join request loop (pbly issues with downlink). Once connected, I always see new join requests within a few days at the most (no power outage or anything), which is pretty annoying. I switched to ABP which keeps working, but was wondering what could cause this and if there are better solutions?

kersing · June 13, 2023, 6:02pm

The newer LoRaWAN standards include a kind of keep alive handshake which makes the node rejoin if it doesn’t get any response from the network for an extended period.

What is your uplink period and what spreading factor is being used? Is your device within the fair use limits of TTN? (Average of 30 seconds of airtime a day)

matthiasdg · June 14, 2023, 12:13pm

I put the uplink period slightly smaller than 2 hours (imagining there might be more other devices sending at very regular intervals), cause it uses SF12 (default is 10 minutes for this node ). Not sure if even half the messages get through, but I can live with that. Still, during such an accept join request loop, it did seem to send more frequently than my uplink period, so the keep alive seems counterproductive in case of a crappy connection.

descartes · June 14, 2023, 12:44pm

Um, you appear to be saying that you get about 1 every 2 hours but you are trying once every 10 minutes. If this is the case, then you are breaching the Fair Use Policy by a wide margin and very likely the local legal duty cycle.

LoRa Alliance members (like wot TTI is) are required to restrict routine SF11 & SF12 for the very reasons you are discovering - it’s just far too marginal for regular use & takes up seconds of air time.

It looks like you need to review your gateway antenna, perhaps the device antenna or add another gateway.