TTN GATEWAY central

Quite right, my gateway works fine for hours at a time, so this is perhaps a different problem than never being able to activate the gateway, run it stably for more than a few seconds.

I assumed my problem was also a reboot problem since when it stops working, invariable the second led is flashing rapidly indicating failure to connect. Why would it disconnect once connected except upon some kind of restart/reboot event?

"Why would it disconnect once connected except upon some kind of restart/reboot event?"
My sense is that it’s not that rare for client devices to at least briefly lose connection to an access point and a major aspect of the problem you are seeing, that one hopes will be cured via a firmware update, is that once the connection is lost it isn’t quite rapidly reestablished.

I have been running my MTCAP-915 gateway for 48 hours straight with no touble but this is on ethernet. I suppose to be fair, I need to swap it with the TTN gateway ;>

Yeah, I’m optimistic that your TTN device will be very stable on Ethernet… Mine are. And I’d guess the MTCAP device will do fine on the Wifi. Do let us know the results if you try it.

I’m located in the Netherlands and the device isn’t able to get activated. The status is still “not connected”27

Thanks for the info. If you haven’t yet had a chance to do much troubleshooting, here are some questions you might ask…
Is it stable in that condition? Or is it rebooting repeatedly? Is it connected to the Internet via Ethernet? What are the condition of the LEDs? One on and the second slowly blinking? There’s a helpful list of reasons associated with different patterns here:

Do you know its IP address? Can you browse to the IP address adding /info after the IP address? What does it say? Is there any kind of firewall that might be blocking UDP traffic?

Also one of my gateways has the same problem - stuck in a loop.

But two other gateways from an other pledge, which i configure at the same day, same time & same place and conditions, run without any problems .

2 Likes

OK, whatever caused the outage has been resolved and I have the TTN gateway working again now connected to ethernet. I will report its behavior over the next day or two to see if it has changed.

Are you able and willing to swap power supply and lora board between working and non working gw?

And make some detailed pictures of both circuit boards where chip ids are readable?

Peeking in the firmware code I finally understand the weird values in HTTP: Got 1232 bytes: the URLs printed in the logs are not the ones actually used, but ?filter=ttn is appended. And indeed, https://account.thethingsnetwork.org/api/v2/frequency-plans/EU_863_870?filter=ttn is much smaller than https://account.thethingsnetwork.org/api/v2/frequency-plans/EU_863_870

I’m also trying to enable debug logging and add much, much more logging, but then the UART output really starts losing many bytes, making it very hard to decipher my new logging. I tried to double DEBUG_PRINT_BUFFER_SIZE, but then the compiler fails to allocate space. To be continued…

The good news: I do see responses from the LoRa card, like LORA: recv_rpl: 0x23 0x35 0x1 0x0 0x0 0x0 0x0 0x59 0xd.

The bad news: the very first step in the configuration already fails:

status *= configureRXChain(0, appGWActivationData.configuration_sx1301.rfchain[0].enable,
                           appGWActivationData.configuration_sx1301.rfchain[0].freq);
3 Likes

do we already have the ‘programming’ i.e. the command set details for the LoraCard?

UPDATE: one of the good working gateways, now also in an booting loop. Sometimes it receive an sensor packages, but most of the send sensor packages are lost due to rebooting cyclus. For this gateway, i had switch on the automatic update in the console. I changes the power supply, but no change.

At the moment , the gateway without problems, i have switch-off the automatic updates in the console.

So, my situation, only 1 of the three gateways is usable.

Increasing the value of DRV_USART_BAUD_RATE_IDX0 to, say, 921600 helps making the logs a bit more readable, though still not perfect. These problems might also be caused by my UART-to-USB cable. And of course, the same setting must be used when capturing the output, like in LOG="ttn-gateway-date +’%Y%m%d-%H%M%S’.log"; pio serialports monitor --raw -b 921600 | while read l; do echo "[date +’%F %T’] $l" | tee -a $LOG; done, or when using a Raspberry Pi to capture the output.

This allows for changing the log level to SYS_ERROR_DEBUG in:

…and for enabling some disabled logging in the original code, and adding many more calls to SYS_DEBUG(SYS_ERROR_DEBUG, ...).

I added the following at the end of sendCommand in src/app_lora.c to know when things go wrong (not sure why the code uses status *= ... which will still try all next configuration steps if a previous one has already failed):

SYS_DEBUG(SYS_ERROR_WARNING, "LORA: sendCommand %s\r\n", gotresponse ? "OK" : "ERROR");

…yielding:

CNFG: Configuring LoRa module
LORA: Changing state from 2 to 4
LORA: Starting reconfiguration

LORA: send_cmd: 0x23 0x31 0x1 0x0 0x0 0x55 0xd 
LORA: recv_rpl: 0x23 0x31 0x1 0x0 0x0 0x55 0xd
LORA: sendCommand OK

LORA: send_cmd: 0x23 0x3a 0x1 0x0 0x0 0x5e 0xd 
LORA: recv_rpl: 0x23 0x3a 0x10 0x0 0x1 0x1 0x4c 0x47 0x38 0x35 0x30 0x31 0x36 0x30 0x31 0x37 0x38 0x32 0x4 0x1 0xd
LORA: sendCommand OK
LORA: version: 01

LORA: configureRXChain(0, ...)
RF: 0,1,867500000
LORA: send_cmd: 0x23 0x34 0x6 0x0 0x0 0x1 0xe0 0xff 0xb4 0x33 0x24 0xd 
LORA: recv_rpl: 0xd
LORA: sendCommand ERROR

LORA: configureRXChain(1, ...)
RF: 1,1,868500000
LORA: send_cmd: 0x23 0x34 0x6 0x0 0x1 0x1 0x20 0x42 0xc4 0x33 0xb8 0xd 
LORA: recv_rpl: 0x23 0x34 0x1 0x0 0x0 0x58 0xd
LORA: sendCommand OK

LORA: configureIFChainX(0, ...)

Above, the call to configureRXChain(1, ...) actually consistently succeeds. Or maybe the if(appData.rx_uart_buffer[1] == command) in the call to configureRXChain(1, ...) is seeing the response of the earlier call to configureRXChain(0, ...), as it somehow gets in too late? The if(TIMEOUT(4)) is not being hit. After the above lines the log is a mess again, but I see both OKs and ERRORs.

I need to run now; just posting my early findings in case it helps someone right now; will try to trigger the default configuration later, to ensure the config that is fetched from the internet is okay. To be continued…

1 Like

Would it make sense, after the gateway that stopped working properly has been turned on and powered up for you to press the reset button for five seconds so that it needs to be reactivated?. And then reactivate it selecting to not have automatic updates?

Yes! :tada:

Seeing LORA: recv_rpl: 0xd above, which is a newline, I wondered what would happen if I always first read any pending RX from the LoRa module’s UART before sending new commands.

Guess what: in two occasions there was indeed such excessive newline pending in the receive buffer while the firmware was about to send a new command. Even better: discarding those makes the LoRa configuration complete without any error. Next, the activation simply succeeds and it even downloads new firmware, which obviously no longer includes my fixes, but somehow is not booted (yet) as the SD card with my own firmware is still in the gateway, my own firmware is loaded again after the reboot. So, using my own firmware it’s now activated and happily forwarding packets…! And the debug logs no longer show garbage either.

To be continued, but: in my case the reboot loop during activation is apparently totally fixable by just using new firmware.

And I’m not even alarmed by this weird double occurrence of frame counter 9, on two frequencies…

image

(Well, I am…)

10 Likes

@arjanvanb Can you share the image you created, curious to see of this fixes the reboots I have every so many hours.

Here goes, based on today’s develop branch: https://drive.google.com/open?id=15UMxp7voWhCAHY0_xvZDkWDPHf74pPuX

I will create a PR tomorrow have created a PR, so if you can wait a few days for TTP/TWTG to validate it: just wait :wink:

If you don’t want to wait:

  • I don’t have the factory firmware for you, so: no way back!
  • No need to wipe any existing configuration.
  • Unpack the /update folder to the root of a FAT32 formatted SD card.
  • Remove power, insert the SD card, attach power.
  • If you’re using a serial cable for logging: set the baudrate of your monitor to 921600.
  • Leave the SD card in your gateway to avoid any downloaded firmware from overwriting it.
  • After the updated firmware was downloaded, when the SD card is still in place, you’ll see:
    FIRM: Starting download
    FIRM: available bytes: 79
    FIRM: (Downloaded FOTA key) 69 AE B7 78 1F 49 4E 7F BC B6 C7 CD 9C 59 4F 5D FA AA 3D 81 D4 9C 56 90 A6 83 81 98 FF 18 88 6A 
    FIRM: (Stored FOTA key) 69 AE B7 78 1F 49 4E 7F BC B6 C7 CD 9C 59 4F 5D FA AA 3D 81 D4 9C 56 90 A6 83 81 98 FF 18 88 6A 
    FIRM: Firmware is already downloaded
    MAIN: No new firmware available
    
    …which makes me think that even when removing the SD card after that, it won’t overwrite the firmware until something new is released. But I did not test.

This basically adds a bit more logging, plus:

// flushUart removes any pending bytes from the receive buffer.
void flushUart(DRV_HANDLE handle)
{
    bool flushing = false;
    uint8_t buffer[1]; 
    while(DRV_USART_Read(handle, buffer, 1) > 0) 
    {
        if(!flushing)
        {
            SYS_DEBUG(SYS_ERROR_DEBUG, "LORA: flushing: ");
            flushing = true;
        }
        SYS_DEBUG(SYS_ERROR_DEBUG, "%02x ", buffer[0]);
    }
    if(flushing)
        SYS_DEBUG(SYS_ERROR_DEBUG, "\r\n");
}

(Trying to adhere to the existing code style… Also, it would be nice to use, e.g., DRV_USART_ReceiverBufferIsEmpty, but the tooling refuses to find that?)

The above is then invoked at the start of:

bool sendCommand(uint8_t command, uint8_t* payload, uint16_t len)
{  
    flushUart(appData.USARTHandle);

    bool     gotresponse   = false;
    
    ...

    SYS_DEBUG(SYS_ERROR_DEBUG, "LORA: sendCommand %s\r\n", gotresponse ? "OK" : "ERROR");
    return gotresponse;
}

I only get LORA: flushing: 0d (being an empty line) once now.

However, the mystery continues: I doubt getting two lines as a reply is normal, and I don’t understand how the code could ever print the following!? (See followup post for an explanation.)

RF: 1,1,868500000
LORA: send_cmd: 23 34 06 00 01 01 20 42 c4 33 b8 0d 
LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

Okay, the code in my firmware uses "%02x" to print the hexadecimal values, which is different from the "%#x" in the code on GitHub. But how could it ever (consistently!) print two subsequent lines for LORA: recv_rpl for this very command…!?

My logging

This is complete; see next post for an explanation.

CNFG: Configuring LoRa module
LORA: Changing state from 2 to 4
LORA: Starting reconfiguration

LORA: send_cmd: 23 31 01 00 00 55 0d 
LORA: recv_rpl: 23 31 01 00 00 55 0d
LORA: sendCommand OK

LORA: send_cmd: 23 3a 01 00 00 5e 0d 
LORA: recv_rpl: 23 3a 10 00 01 01 4c 47 38 35 30 31 36 30 31 37 38 32 04 01 0d
LORA: sendCommand OK
LORA: version: 01

LORA: configureRXChain(0, ...)
RF: 0,1,867500000
LORA: flushing: 0d 
LORA: send_cmd: 23 34 06 00 00 01 e0 ff b4 33 24 0d 
LORA: recv_rpl: 23 34 01 00 00 58 0d
LORA: sendCommand OK

LORA: configureRXChain(1, ...)
RF: 1,1,868500000
LORA: send_cmd: 23 34 06 00 01 01 20 42 c4 33 b8 0d 
LORA: recv_rpl: 23 34 07 00 00 01 01 80 e5 f9 ff be 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(1, ...)
IF: 1,1,1,-200000
LORA: send_cmd: 23 35 07 00 01 01 01 c0 f2 fc ff 0f 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(2, ...)
IF: 2,1,1,0
LORA: send_cmd: 23 35 07 00 02 01 01 00 00 00 00 63 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(3, ...)
IF: 3,1,0,-400000
LORA: send_cmd: 23 35 07 00 03 01 00 80 e5 f9 ff c0 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(4, ...)
IF: 4,1,0,-200000
LORA: send_cmd: 23 35 07 00 04 01 00 c0 f2 fc ff 11 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(5, ...)
IF: 5,1,0,0
LORA: send_cmd: 23 35 07 00 05 01 00 00 00 00 00 65 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(6, ...)
IF: 6,1,0,200000
LORA: send_cmd: 23 35 07 00 06 01 00 40 0d 03 00 b6 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChainX(7, ...)
IF: 7,1,0,400000
LORA: send_cmd: 23 35 07 00 07 01 00 80 1a 06 00 07 0d 
LORA: recv_rpl: 23 35 01 00 00 59 0d
LORA: sendCommand OK

LORA: configureIFChain8(...)
IF8: 1,1,-200000,250000,7
LORA: send_cmd: 23 36 08 00 01 01 c0 f2 fc ff 02 02 14 0d 
LORA: recv_rpl: 23 36 01 00 00 5a 0d
LORA: sendCommand OK

LORA: configureIFChain9(...)
IF9: 1,1,300000,125000,50000
LORA: send_cmd: 23 37 0b 00 01 01 e0 93 04 00 03 50 c3 00 00 f4 0d 
LORA: recv_rpl: 23 37 01 00 00 5b 0d
LORA: sendCommand OK

LORA: send_cmd: 23 40 01 00 34 98 0d 
LORA: recv_rpl: 23 40 01 00 00 64 0d
LORA: sendCommand OK

LORA: send_cmd: 23 31 01 00 00 55 0d 
LORA: recv_rpl: 23 31 01 00 00 55 0d
LORA: sendCommand OK

LORA: send_cmd: 23 30 01 00 00 54 0d 
MON: SYS Stack size: 2870
MON: heap usage: 152KB (233KB), free: 187KB
LORA: recv_rpl: 23 30 01 00 00 54 0d
LORA: sendCommand OK

LORA: configLora OK
LORA: Configuration succeeded
LORA: Starting operation

Hi Arjan,
I was in for an experiment, so I downloaded your firmware.
It did not work right a way however. I tried reset, new activation etc.
At a certain point I pushed the LORA card and I felt it went a bit further into the connector.
And now it suddenly works.
Could it be a combination of problems?
It is a pity I did not try to push the card before I changed firmware. And since there is no way back…
But at least it seams to work now and that is very nice.
Thanks for your excellent work to find the cause.

The original code is also full of timed waits/delays with comments that they are there to give time to print something. :open_mouth: doesn’t look very stabil.

Do you have a debug cable? Then I wonder if you see the line LORA: flushing. (Using 921600 baud.)