Join-accept message sent by gateway, but node says join failed

ianmercer · July 8, 2021, 3:40am

There are several threads here with similar issues and I’ve tried all the options proposed there. I have two gateways TTOG and TTIG both on V3. I have a node (Sparkfun ESP32 single channel Lora gateway running in [not-a-gateway] [not-single-channel] Lorawan device mode). Sometimes my node will run fine for hours, joining the closest gateway (OTAA) and sending data, but sometimes it just sits there trying to join and failing.

The gateway is showing Join Accept but the node shows it failing to join.

BUT when I unplug the nearest TTIG and then plug it back in the join works right away.

What could cause the node to fail to receive the join accept message that could also be cured by simply restarting the gateway?

wolfp · July 8, 2021, 5:32am

Does that mean that the node uses only one channel (frequency)?

ianmercer · July 8, 2021, 6:11am

Updated question to be clearer: it’s called a 1-channel gateway but you can run it in device mode instead. It behaves just like any other RFM95W lora node. I’m running it with a typical US915 config and it works just fine most of the time except when it decides it cannot join.

descartes · July 8, 2021, 7:13am

Why does it join more than once?
Which MAC (software) are you using - name, source & version number?
How far away from the gateway(s) is this device?

kersing · July 8, 2021, 7:58am

It is a device that some people try to abuse for gateway functionality. It is not a gateway and there is no such thing as a 1 channel gateway in LoRaWAN. BTW, spark fun call it a LoRa gateway, not LoRaWAN…

Jeff-UK · July 8, 2021, 8:31am

When unplug and repluggeding the TTIG is the join req/accept process totally handled by the TTIG after restart and after it is fully back online or is the req/ack handled by TTOG in whole or in part in the interim? If TTOG is involved it may be you are too close to the TTIG and either it is overloaded by node signal or vice versa, causing handshake process to fail…move TTIG a few meters further away from node or behind an additional wall just in case…

ianmercer · July 8, 2021, 1:29pm

When I unplug and replug the TTIG the process is totally handled by the TTIG.

The TTIG is about 10m away through two walls. The TTOG is about 1km away but with poor line of sight where I’m testing so unlikely to get a connection. When the device is in its intended location it’s about 0.5km from each and uses the TTOG but I don’t have debug output when it’s in the field so I don’t get to see why it stops sending. And, to be clear, the stopping sending may be a crashing bug, but after reboot the failure to reconnect over OTAA is the issue.

I did have the TTIG closer initially but moved it at least two walls away per advice here and that appeared to cure the problem, but then it happened again and now in it’s ~10m distance position simply unplugging and replugging it appears to cure the problem. So I don’t think it’s the “too-close-distance” issue.

It joins more than once because (i) it gets a failed to join response so it retries; or (ii) the ESP32 watchdog timer trips because any one component in the system is unresponsive (sensors, bluetooth, …). Once it has joined it carries on running indefinitely until it crashes or hangs, and then it reboots and tries again. And when it does that, sometimes it just fails to reconnect over and over with the gateway claiming it’s accepted it but the device thinking it hasn’t.

I didn’t choose the name.

descartes · July 8, 2021, 3:36pm

As we’ve established an ESP32 device is the right distance away and the reason it needs to join a lot is because it crashes a lot, answering this becomes even more important?

If you can do two answers, my supplementary question is what is the device & what sensors does it have on it?

ianmercer · July 9, 2021, 3:02am

It doesn’t need to “join a lot” but sometimes when it does try to join after a reboot it fails and sits in a retry loop. I’m developing s/w for it and trying it in the field (literally) on a battery, so it does get rebooted more than it will once it’s finished.

The library is:

mcci-catena/MCCI LoRaWAN LMIC library@^4.0.0

Also:
LMIC_LORAWAN_SPEC_VERSION=LMIC_LORAWAN_SPEC_VERSION_1_0_3

The ESP32 device is as specified in the question. It has no additional sensors on it. It is using Bluetooth LE and Lorawan only.

My question is why rebooting the gateway fixes the problem, what could explain that?

petertoome · July 9, 2021, 4:48am

After days struggling with similar issues, if the Join is accepted and you can see the Join Accept message going out on the Gateway, failure of the node to Join will come down to : (1) incorrect settings of the RX1 window; which has started to pop up with TTN V3 now using 5 sec as the default. But since OTAA nodes join with the default 1 sec setting, this issue will affect ABP nodes rather than OTAA nodes; (2) Incorrect frequency & DR settings on the node - you should be able to check these in the Node settings; (3) Most commonly, incorrect LoRa WAN MAC version selection. You need to check with your Node to find out what LoRa WAN version is being used (on the Rising HF modems you use AT+LW=VER and other modem brands will have a similar command; If you are using LMIC check the library.). This last step is the one most commonly missed - make sure you check all the advanced options offered for the device when you add a device manually.

kersing · July 9, 2021, 8:38am

For OTAA responses the RX1 window is 5 seconds. It has been that value from the start of TTN.

jimorg123 · August 27, 2021, 11:20am

I get the same issue with TTIG and several end nodes. There is definitely something wrong with the connection keep alive between TTIG and TTN which gets resolved by rebooting the TTIG which pokes through the firewall and so has a healthy connection again, just not permanently. Its very annoying and they are of no use for any long term applications at the moment.

gniezen · August 27, 2021, 1:23pm

I’m having the same issue, where everything works fine when the TTIG is rebooted, but as soon as you power cycle a node and have it attempt to reconnect, the node says Join failed, even though the TTIG says Accept join-request.

I’m using a LoRa-E5 module and have checked:

5s Join Accept RXWIN1 delay (even if regular RXWIN1 delay is 1s)
freq and DR settings
LoRaWAN MAC version

For (3), I originally thought that the MAC version was v1.0.2 (as that’s the default in the datasheet), but using AT+LW=VER I discovered that it was set to v1.0.3. Changing it to v1.0.3 in TTN didn’t help though, neither did switching to v1.0.2B on both sides.

Why does it work without any issues after a TTIG reboot, but then fails whenever the node tries to join again? Could it be some kind of packet/frame counter that gets reset on the TTIG when it’s rebooted? I wish I had another gateway nearby so I could verify if it’s the TTIG that’s at fault.

cslorabox · August 27, 2021, 3:06pm

Gateways are not supposed to have any state, but there used to be problems with the TTIG falling off WiFi and the reconnection taking too much time, causing it to not get the downlink / join accept messages back from the servers in time to transmit for the window the node expects.

Nodes with buggy firmware that don’t properly open receive windows at the right time / settings are an issue, too.