Need help with MCCI LMIC and TTN / "JOIN_WAIT" issue

I need help with LMIC and TTN.

In my paxcounter project i ran into a problem which won’t fix, resulting in paxcounter software is currently unusable with TTN.

With current version 3.0.99.x of MCCI LMIC joins stuck in a loop with join requests.

This is not a hardware / device dependent problem. Same test devices joined with previous MCCI LMIC versions. The bug came in start of Sep 2019. I discussed this issue with the maintainer of MCCI LMIC, result is so far that this doesn’t look like a bug in LMIC.

Sometimes joins are succesful after a couple of minutes, when LMIC has climbed on higher SF levels.

The problem can be reproduced with the OTAA compliance test script found in the MCCI LMIC repository.

After a gateway reset next joins seem to be immediately succesful, but after a while the issue recurs. This could be a pointer to a duty cycle related problem.

Could some of you with LMIC experience please install this compliance script on ESP32 board, and run it on top of arduino-esp32 core?

My assumption is, that this issue appears with EU868 frequency scheme only, and maybe only with TTN, not with e.g. LoRa-Server.

I spent a lot of time in analyzing this, but didn’t find evidence where to look at. Need some help.

Thanks.

Output of OTAA compliance script:

main.cpp
LMIC version 3.0.99.5 configured for region 1.
Remember to select ‘Line Ending: Newline’ at the bottom of the monitor window.

calibration not supported
Packet queued
316928 (5070 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=0
316943 (5071 ms): EV_JOINING
741091 (11857 ms): EV_TXSTART: ch=1 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=741141, avail=0
745010 (11920 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
1055259 (16884 ms): EV_RXSTART: freq=868.3 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=745006, avail=0, delta ms 4964
1059947 (16959 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
1119947 (17919 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=745006, avail=0, delta ms 5999
1134447 (18151 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
1134573 (18153 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80

EV_JOIN_TXCOMPLETE means LMIC stack has completed sending a join request, but did not get (or did not finish processing) a join accept message in time, thus could not join the network.

First, you need to try to figure out:

  • Is no join accept transmitted?

Or

  • Is the join accept transmitted but not received ?

Or when it has walked, one number at a time, out of the range of previously used join nonces which will be ignored, and now for the first time uses one that has not previously been used.

This is why you need to clarify if a join response is or is not being generated, irrespective of if one is being received.

After a gateway reset next joins seem to be immediately succesful, but after a while the issue recurs. This could be a pointer to a duty cycle related problem.

For that you need to examine the gateway’s internal logs to determine if it actually transmitted.

Yes, I don’t see join accept messages on the gateway console, but what if TTN backend sent them to another gateway?

I already thought on a DevNonce problem, too. To verify this, i created a new device with fresh keys. Same problem.

Update: Looks like there are two stacked issues. During testing i accidently changed an AppKey. This explains why the backend did not answer the join request: just because wrong key.

But now i am back to the original issue. Join request and join accept are shown on the gateway. But the device with the MCCI LMIC OTAA compliance script does not join reliably.

The timing situation on the ESP32 Arduino platform looks nasty, and I’m not sure it’s truly resolved.

If you control the gateway, I think the most useful thing to do is modify the node code to set a GPIO on entering the receive window, and then clear it on success or giving up. Put one scope probe on that and trigger off of it. Then put the other scope probe on the gateway’s transmit LED, and see how they match up (if the LED voltages are suitable you can also use a cheap USB logic analyzer, which is good for recording longer sequences, such as a bunch of join attempts that don’t work concluding with one that does)

If you can’t get at the gateway signal, you can use one GPIO to signal the end of transmit and another for the receive window and at least see how that timing compares, but it’s much better if you can use the gateway’s signal. The gateway LED might blip at other times for other nodes, but by triggering off your node’s transmission you will mostly see the ones belonging to you - in particular, something consistently just too early or just too late is likely meant for you, and missed.

Debugging might also be easier with a private server set to always use one receive window or the other, and momentarily remove the issue with TTN for EU868 having a custom RX2 but sometimes falling back to the standard one - ie, sort things out in known circumstances first, before worrying about variable ones.

1 Like

Thanks for this cooking receipt. I will try to get a timing measurement reference setup.

Do you have any suggestions how to increase realtime behaviour of arduino-esp32 core, i.e. switch off / deactivate certain components which cause jitter?

Program the ESP32 natively without Arduino, or if you have to stay with Arduino switch to a platform like ESP8266, STM32, or SAMD where the Arduino core is more reasonable, debugged, and tested by people maintaining LMiC ports.

(That said, the last time I tried to use Arduino-ESP8266 for a “simple” project it was broken because of a breaking change between the Arduino IDE and the ESP8266 port… something that seems to happen regularly in that world and remove whatever advantage there might have been, at least if all the effort put into contributing Arduino-bound examples were put into native examples instead)

I would prefer native ESP32 IDF, but did not find a native ESP32 LMIC port yet.

It’s not really a big deal to de-Arduino-ify a current LMiC repo; the code itself wasn’t originally intended for Arduino at all, mostly what you have to do is turn the silly debug print functions back into something normal (but there aren’t that many to begin with, since they are CPP methods and most of LMiC is C files - some cases C files with illicit CPP code in them, but that’s actually excluded by the preprocessor unless you set some very odd debug defines)

1 Like

ttn-esp32 is a port of LMIC to ESP-IDF. The lmic-early-v3 branch even integrates the latest MCCI LMIC development.

It has an entirely ESP-IDF specific hardware abstraction layer (HAL) and most likely no timing issues as the LMIC core runs in its own high-priority task and cannot be blocked by other code. Does this count as a “native ESP32 LMIC port”?

2 Likes

As the ESP32 Arduino Core runs on top of the ESP32 IDF, would it be possible to use ttn-esp32 with ESP32 Arduino Core to get a better LMIC implementation for Arduino on ESP32?

1 Like

Some of the code could certainly be reused but it would need to be a separate library specifically built for the combination ESP32 / Arduino framework.

The ttn-esp32 does not fit well into the Arduino framework:

  • In ttn-esp32, join() and transmitMessage(...) are blocking calls. That’s a natural choice if you have a RTOS with multiple task. But it’s not what you’d want in an Arduino program. The code could possibly stop for minutes.

  • Scheduling operations such as os_setTimedCallback are probably required for Arduino apps but in ttn-esp32 they may not be used for user code as the jobs run on a task not suitable for user code.

  • Since the Arduino framework bypasses IDF for SPI communication and uses code that does not seem to be multi-tasking capable, the SPI bus could probably not be shared with other devices.

  • The LMIC data structure cannot be accessed from user code since the user code and the LMIC task run concurrently and are not synchronized.

So the resulting library:

  • would be reliable in the sense that the RX/TX timing cannot easily be disturbed by other code,
  • would not require calls to os_runloop_once() every so often,
  • would not restrict the amount of work performed in callback / event handler functions,
  • would require exclusive access to the SPI bus where the SX12xx chip is connected,
  • would restrict the use of os_setTimedCallback() and possibly further functions,
  • would disallow access to the LMIC data structure,
  • would provide alternative functions for the most common use cases for accessing and modifying values in the LMIC data structure,
  • would not provide any means for scheduling future tasks,
  • would only have limited compatibility with existing code built for the LMIC library.

Would such a library - despite all this - still be helpful?

2 Likes

Thanks for your thorough explanation.

I understand the implications you describe but had actually hoped for a simple ‘yes’. :blush:
But of course the reality is much less simple.

I have seen several posts lately about timing issues related to combination of ESP32 and ‘Arduino LMIC’ (LMIC-Arduino or MCCI LoRaWAN LMIC library). It would have been nice if ttn-esp32 could have been the solution for that with added benefit that event handler/callback functions will be much less amount of work/time restricted.

I noticed your ttn-esp32 remarks that 26MHz crystals are used (instead of the default 40MHz) on several of the popular ESP32 LoRa boards and that the crystal frequency has to be set as parameter for the library. Where does the 40MHz default come from, from Espressif’s reference design(s)?

Where is the crystal frequency parameter used in your library (where does it play a role)?
I haven’t noticed something similar for Arduino LMIC. Could the crystal speed (difference) be related to those timing issues?

Based on the implications you described, I guess not.

In my paxcounter app the lmic runs on top of arduino-esp32 in own task with high prio, too. That’s not difficult, but nevertheless i’m struggling with downlink problems which probably means timing issue. Maybe it’s something with the SPI bus, i didn’t look at this aspect yet. But did never get any radio errors from the hal, which i would expect from SPI communication issues.

It’s still totally unclear for me how and why i suddenly ran into that nasty JOIN_WAIT problem.

6 posts were merged into an existing topic: Big ESP32 + SX127x topic part 3

I also noticed this problem on my nodes. Lately the modules wont join using OTAA, but after resetting the gateway the module joins without a problem. For join I set SF10 and after join go back to SF7. I have a Ideetron gateway, and a RAK outdoor gateway within 500m and receiving strength cant be the problem. Also have a things gateway and RAK831 gateway so i’m going to test if they have the same problem.
Strange problem as I use these gateways some time now and never had the problem.

1 Like

Exactly the same situation here. So the question is, what triggered this? Changes in LMIC, or changes in backend of TTNv2, or maybe it’s a network congestion effect due to growth of TTN community?