Multitech - TTN connection lost after bad connectivity

Paul_Stewart · February 5, 2018, 12:19pm

Hi All,

I have a few Mulittech Conduits deployed that use internet connections that arent exactly reliable (on remote farms using 3g and 4g connections). All have @kersing 's “new” packet forwarder installed following this guide https://www.thethingsnetwork.org/docs/gateways/multitech/aep.html

Sometimes the internet connection drops out for about an hour or more, and when the connection comes back up, although I can see the gateways in the Multitech deviceshq.com console to show they have an internet connection again, they aren’t seen in the TTN console again.

It normally takes a reboot of the gateway to get it connected again or a complete power cycle.

Could anybody explain what might be happening?

@kersing does it retry several times at different intervals and then give up and won’t try again?

any help appreciated

cheers
Paul

pe1mew · February 5, 2018, 12:41pm

I have a similar problem here.

kersing · February 5, 2018, 4:59pm

No, it continuously retries. At least it should. However there is an obscure issue I’ve been trying to solve for 6 months now that results in this behavior. As I can’t reproduce it myself I’m working with community members to get to the bottom of it. (I’m pushing new sources with additional debugging while typing this)

pferland · February 5, 2018, 6:43pm

A similar rare, intermittent issue occurs with the “Semtech UDP” packet forwarder built into the Conduit. A work around in that packet forwarder is setting an “autoquit_threshold” in global_conf.json under gateway_conf. For example:
{
[…]
“gateway_conf”: {
“autoquit_threshold”: 5,
“server_address”: “router.eu.thethings.network”,
“serv_port_up”: 1700,
“serv_port_down”: 1700,
[…]
}
}

After 5 failures the packet forwarder will exit and the watchdog included in the AEP OS will start another instance of the packet forwarder a few seconds later.

If the connectivity problem is only related to the PPP connection, you could add a line restarting the packet forwarder in /etc/ppp/ip-up. For the built in packet forwarder this could be done by adding the line “/etc/init.d/lora-packet-forwarder restart” to the bottom of the file.

pe1mew · February 6, 2018, 6:02am

I would love to have such feature because it keeps me from watching the gateway daily and resetting it manually. However I prefer to know the cause.

It seems as if the software is suffering from the same bug?

ecoadapt · February 6, 2018, 8:27am

Hi guys,
I have experienced the same issue several times with all my 3 TTN Multitech Conduit.
Regards,
Sylvain

cultsdotelecomatgmai · February 6, 2018, 8:44am

Hi @Paul_Stewart, I run a variety of unattended systems on RPi/Linux including a number of TTN gateways and a number of edge-computing “things”.
I maintain availability using systemd to restart software and watchdog to restart the OS.
I have not used the packet forwarder from @kersing but if the software exits with a non-zero status on encountering an abnormal situation, uses a PID file and has an option to regularly touch a file in /var/run when operating normally then it’s very easy to detect hangs and exits and automate restarts.

jreiss · February 6, 2018, 3:14pm

When the ppp connection is restored the IP address may have changed.
This breaks the UDP socket connection between the gateway and network server.

The packet forwarder has a keepalive_interval to send packets to the network server to allow downlinks to be sent at anytime. But this keepalive is only half configured without an autoquit_threshold to bring down the socket or exit the process.

Using the autoquit_threshold setting, as Peter mentined, will allow the process to exit when the network server is unavailable. When the process is restarted a new socket will be created on the changed interface.

This is also available in Kersing packet forwarder as it is based on Semtech code.

This can be tested over Ethernet by changing the ip address of the gateway.

Without the autoquit_threshold the packet forwarder will keep receiving packets and send them over the broken socket.

When autoquit_threshold is specified the process will quit and the socket reset to restore the connection automatically if there is an angel or monitor process to restart the packet forwarder.

Paul_Stewart · February 7, 2018, 5:35am

@kersing do you think this would work with your forwarder? (as a work around)

I have one gateway in particular in which this is a pretty regular problem, and if I could simply do this for now, that would be great

regards
Paul

Paul_Stewart · February 7, 2018, 6:04am

@cultsdotelecomatgmai sounds like a good solution to the problem but it would be good to diagnose the original issue.

With my level of expertise (which I am already in over my head) I’ll see if the autoquit line will work first.

cheers
Paul

pe1mew · February 7, 2018, 7:20am

Same here. in a couple of weeks I will contact @kersing to see if I can contribute to find the root cause.

kersing · February 7, 2018, 11:51am

I’m working on solving the issue. Temporarily on hold because I’ve got influenza. Reading a few mails can be done on a tablet in bed, coding not…

pe1mew · February 7, 2018, 11:55am

Beterschap!, Doe rustig aan, er is geen haast bij. Het is maar hobby.

Paul_Stewart · February 7, 2018, 12:34pm

Get better soon @kersing

We all really appreciate the work you do, I don’t think I would have 7 gateways operating on TTN if it wasn’t for you.

Cheers Paul

ecoadapt · February 19, 2018, 2:23pm

Hi all,

I still have disconnections of my Multitech AEP gateways from the TTN server. I have to reboot the gateway each time it occurs. Don’t know what is the root cause, probably not even a connectivity issue.
Below are the last packet forwarder logs. Disconnection occured at this time or a bit before. No logs after this, until I rebooted 3 hours later.
Does anyone has an idea of what could be the issue and/or how to recover automaticaly from it ?

Regards,
Sylvain

admin@mtcdt:~# tail -f /var/log/lora-pkt-fwd.log
12:09:42 INFO: Flush output after statistic is disabled
12:09:42 INFO: Flush after each line of output is disabled
12:09:42 INFO: Watchdog is disabled
12:09:42 INFO: Contact email configured to ""
12:09:42 INFO: Description configured to ""
12:09:42 INFO: [Transports] Initializing protocol for 1 servers
12:09:58 ERROR: [TTN] Connection to server “bridge.eu.thethings.network” failed, retry in 30 seconds
12:11:02 ERROR: [TTN] Connection to server “bridge.eu.thethings.network” failed, retry in 60 seconds
12:11:50 ERROR: [TTN] Connection to server “bridge.eu.thethings.network” failed, retry in 30 seconds
12:12:33 INFO: [TTN] server “bridge.eu.thethings.network” connected
^C
admin@mtcdt:~# date
Mon Feb 19 14:54:14 CET 2018

ecoadapt · February 19, 2018, 3:10pm

Actually, same symptoms today roughly at the same time, with my 3 TTN gateways. Located at different places, so not an issue with the local connectivity, but rather related to a root cause on the TTn server side. How to gather information on server faults ? Anyway whaterver the server issue, the gateways should have recovered from this…

kersing · February 19, 2018, 6:08pm

Please be patient for a couple more days, I running (hopefully) final tests on an update that should solve these issues.

Paul_Stewart · February 19, 2018, 10:07pm

great work !

ecoadapt · February 20, 2018, 8:28am

Thanks, I look forward to testing it !

ecoadapt · February 21, 2018, 10:52am

Hi,

Once again connectivity of one my TTN gateways was lost 18 hours ago. But this time -unlike the last issue- I still have logs going (see below). No message regarding connection to TTN though.
Do you think it is another symptom of the same problem?

##### 2018-02-21 10:43:45 GMT #####
### [UPSTREAM] ###
# RF packets received by concentrator: 0
# CRC_OK: 0.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
# RF packets forwarded: 0 (0 bytes)
# PUSH_DATA datagrams sent: 0 (0 bytes)
# PUSH_DATA acknowledged: 0.00%
### [DOWNSTREAM] ###
# PULL_DATA sent: 0 (0.00% acknowledged)
# PULL_RESP(onse) datagrams received: 0 (0 bytes)
# RF packets sent to concentrator: 0 (0 bytes)
# TX errors: 0
### BEACON IS DISABLED!
### [JIT] ###
# INFO: JIT queue contains 0 packets.
# INFO: JIT queue contains 0 beacons.
### GPS IS DISABLED!
### [PERFORMANCE] ###
# Upstream radio packet quality: 0.00%.
# Semtech status report send.
##### END #####

##### 2018-02-21 10:44:16 GMT #####
### [UPSTREAM] ###
# RF packets received by concentrator: 5
# CRC_OK: 100.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
# RF packets forwarded: 5 (108 bytes)
# PUSH_DATA datagrams sent: 0 (0 bytes)
# PUSH_DATA acknowledged: 0.00%
### [DOWNSTREAM] ###
# PULL_DATA sent: 0 (0.00% acknowledged)
# PULL_RESP(onse) datagrams received: 0 (0 bytes)
# RF packets sent to concentrator: 0 (0 bytes)
# TX errors: 0
### BEACON IS DISABLED!
### [JIT] ###
# INFO: JIT queue contains 0 packets.
# INFO: JIT queue contains 0 beacons.
### GPS IS DISABLED!
### [PERFORMANCE] ###
# Upstream radio packet quality: 100.00%.
# Semtech status report send.
##### END #####
11:44:42  INFO: Disabling GPS mode for concentrator's counter...
11:44:42  INFO: host/sx1301 time offset=(1519207867s:282111µs) - drift=-805µs
11:44:42  INFO: Enabling GPS mode for concentrator's counter.

lgw_receive:1165: FIFO content: 1 99 0 5 22
lgw_receive:1184: [2 17]
Note: LoRa packet
lgw_receive:1165: FIFO content: 1 cb 0 5 1d
lgw_receive:1184: [1 17]
Note: LoRa packet
lgw_receive:1165: FIFO content: 1 f8 0 7 61
lgw_receive:1184: [3 17]
Note: LoRa packet

##### 2018-02-21 10:44:46 GMT #####
### [UPSTREAM] ###
# RF packets received by concentrator: 2
# CRC_OK: 100.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
# RF packets forwarded: 2 (48 bytes)
# PUSH_DATA datagrams sent: 0 (0 bytes)
# PUSH_DATA acknowledged: 0.00%
### [DOWNSTREAM] ###
# PULL_DATA sent: 0 (0.00% acknowledged)
# PULL_RESP(onse) datagrams received: 0 (0 bytes)
# RF packets sent to concentrator: 0 (0 bytes)
# TX errors: 0
### BEACON IS DISABLED!
### [JIT] ###
# INFO: JIT queue contains 0 packets.
# INFO: JIT queue contains 0 beacons.
### GPS IS DISABLED!
### [PERFORMANCE] ###
# Upstream radio packet quality: 100.00%.
# Semtech status report send.
##### END #####