Message lora-dl: [TTN] Keepalive timed out

These messages could indicate downlink issues. Are you using downlinks successfully?

It’s still the same. I have forward the port 1700 udp in unifi but i still receive the message in the mikrotik router. I have connected the router with a cat5 cable so no wifi. Do you know of ziggo blocks the traffic?

They didn’t when I was a Ziggo customer a few years back.

Apparently others have noticed in the past that if you have more than one UDP gateway sitting behind a particular NAT, some NAT implementations don’t track these well and so do not end up returning the downstream packets (acks or dowlinks) to the correct client.

In theory, NAT is something that could be applied by your provider, not just your own little route/firewall/NAT appliance in your home or office.

One idea for debugging would if you have access to a mobile phone with tethering capability, or simply to temporarily take the gateway somewhere else, to momentarily try it with a fundamentally different internet connection and see if you get any different result.

Another would be to run tcpdump or tshark or something on the router it is connected through and see what is (and isn’t) actually flowing through. Note it has to be either the gateway itself or the router, another system on the same network will not see the traffic.

It also wouldn’t be very hard to create a little logging proxy for the UDP protocol, a couple of hours work adapting some generic python UDP example.

That remark triggered something. Ziggo has been moving to IPV6 with DS-Lite which uses carrier grade NAT for IPv4. (Dutch article - https://community.ziggo.nl/internetverbinding-102/ipv6-ds-lite-fupc-gebied-en-bridge-mode-waarom-zou-dat-niet-kunnen-30135)

That would probably be hard as the Ziggo hardware is closed. But RouterOS provides a basic packet sniffer in the tools menu (IIRC) so @Thunderwolf007 hould be able to check if any port 1700 packets arrive on the gateway itself.

My network knowledge is basic but I think I see traffic coming in on port 1700 upd and I also see outgoing traffic on port 1700 upd. See also attached print screen of the sniffer
image

@Thunderwolf007 from your last shot seems to be working properly.

Been running an open gateway for months, forwarding around 450 msgs/hour, 8000-10000 msgs per day.

Everything is wired, symmetric 600Mbps fiber line, whose health is monitored constantly.

Found out mikrotik TTN gateways fail a lot, but so do TTN (the worst) and to a lesser degree, the most reliable I have found, DigitalCatapultUK, which still, fails several (handful) times per day.

The Semtech forwarder UDP protocol, and UDP per se, is far from being bomb proof, bear in mind it is a lightweight UDP protocol, with a “fire and forget” approach, so lost packets are part of the equation.

@kersing Most default, modern firewalls will be stateful, with “allow any traffic from LAN -> WAN, AND any responses to such connections” to flow through.

Unless TTN is expected to initiate a connection towards the LoRaWAN gateway (which AFAIK isn’t the case), no need to open any ports on ROS default firewall rules, it will allow local traffic to go to the internet, and any reply/related traffic to flow back.

Downlinks work fine for me.

A different scenario would be several LoRaWan gateways/UDP forwarders behind the same internet connection connecting to the same TTN gw, haven’t come across such scenario yet, but it may be required to program some sort of NAT “helper” or traffic mangle.

ROS already has one that may be repurposed (IP > firewall > services > UDPLite).

Again I haven’t checked this scenario, so just guessing.

This bothers me.

Gateways shouldn’t “fail”.

If there is an incident, it should be a specific identifiable failure.

Most of the software and electrical engineering work put into making our own (and the overwhelming motivation) has not been about getting it to be a gateway, but rather about make sure that when something goes wrong, we know what happened, and can come up with a plan to make sure it doesn’t happen again, or that it is automatically corrected, or else that we automatically notify someone quickly to go address it in person.

A different scenario would be several LoRaWan devices behind the same internet connection connecting to the same TTN gw, haven’t come across such scenario yet, but it may be required to program some sort of NAT “helper” or traffic mangle.

If you mean several UDP gateways connecting through the same router, there’s been concern in the past that some NAT implementations didn’t deal with that well, but wasn’t doing well at finding the origins of this last time I searched the forum for it.

I feel you… and think like you. But in order to assure that on an open network the only option is setting up yours.

Was on the verge of setting my own chirpstack to run my apps in fact, but the error % is so low vs the messages trafficked I didn’t look back.

Public services exposed to the internet get used (and abused) a lot. And then there’s the scans, attacks… mix that with UDP and errors are expectable… what counts is the UDP forwarder implementation retrying the transmission if the gateway doesn’t respond the keepalive.

200% agree with you, I would like to have good monitoring, but also understand that would mean a substantial investment in resources.

I wouldn’t mind giving a hand if someone is in such scenario with a mikrotik router.

Strange, not seeing a lot of downtime impact on my (monitored) gateways and applications.

Yep, that was what I assumed my MikroTik gateway would be (statefull), however two UDP gateways behind one firewalling router didn’t work reliably.

It’s debateable if there’s time for a retry that would still allow hitting a 1 second RX1 delay.

Given the RX1 delay, the de-duplication delay also needs to be very short, so if any other gateway received the packet, it’s going out to the application sooner than the one having trouble could get its signal report added.

Patching up the UDP protocol isn’t really the answer; switching to a more sophisticated protocol is. The only gateways that cannot use something else, are those where the user can’t change the software.

@cslorabox

Couldn’t agree more… I could understand making the protocol so simple (and the choice of using UDP) if it were embedded or low.powered devices the ones acting as TTN gateways, but this obviously won’t be the case as the natural place to run them is on internet connected routers.

Cannot understand why UDP and so simplistic approach either.

The NAT issues are to be expected no matter if ports are open, unless a helper is developed, or some sort of programming/workaround applied so that NAT doesn’t get in the way, the problem most probably isn’t the ports aren’t open, but when the router receives the responses thinks the message is for a given forwarder when is in fact for a different one.

Were both forwarders connecting to the same TTN gw? did you try using different TTN gw for each forwarder?

In other words, the NAT implementation needs to be correct, ie, it needs to track the the source port vs internal IP even when the destination port and IP are the same, such that when it gets a response to that port it can route it back to the correct internal IP.

1 Like

UDP and a simple protocol because the software was originally intended as a tech demo to be used with the Semtech demo server. It was never intended for widespread production use. At least that is what I’ve been told sometime during the last 5 years, However as with all demo software it has taken on a life of its own.

The helper is a requirement, connection tracking provides that. However the the firewall still needs to allowing established/related traffic and that rule might not be in place. Then allowing UDP traffic from port 1700 would help.

I can’t quite make out what you are asking but let me describe the setup:
What I was attempting at that moment was to have two gateways, both running a UDP packet forwarder, forward traffic to the TTN back-end for EU. So four UDP streams from two different internal IPs connect to one destination IP/port using masquerading. (Each packet forwarder has two UDP connections, one uplink and one downlink and TTN uses the same port (1700) for both.)
I have had this running at my home office, but at another location it failed. One of the gateways would not get ACK packets when the other (started first) was running. Once the first one was stopped the second one had no issues. Upon restart the first one had issues as long as the second one was running.
As we didn’t need to run the two gateways behind that single router (just noticed the issues while setting up the gateways) I didn’t get to the bottom of it but my suspicion was a NAT/connection tracking issue. However that was two major ROS releases ago so retesting would be required to see if it can be reproduced.
Around that time at least one other user mentioned he was unable to run two gateways (with UDP packet forwarders) within one NATted network and both connecting to the TTN EU back-end as well.

Agree.

Been going through the ROS changelogs, LoRa is being actively developed/fixed and found this for 6.46.5

*) lora - properly update source address for packets when routing table is changed;

I suspect the problem isn’t the ROS NAT but the mikrotik forwarder, which is own mikrotik’s implementation, maybe a combination.

Has anyone tested this with a recent ROS version? I’d use an stable channel version, 6.47.2 today.

Heheh yes… it happens.

But I think that’s a technological debt and being a key component should be properly implemented. Payloads are supersmall, and routers running forwarders won’t have a problem using TCP or a better suited protocol for reliable signalling.

Default ROS firewall allows established/related connections by default.

Was this two mikrotik packet forwarders using a Mikrotik as router towards the internet?

… after quite a bit of testing seems that logged errors have no influence on TTN traffic. Downlinks are reliable too … even on wifi backbone “community network”. Btw; I have two Mikrotik gateways on the same network and they are NATed. So far no problems observed. Btw: IPv6 would be a solution for CGN networks but unfortunately TTN is hosted on non-IPv6 platform :frowning: . Mikrotiks connect happily to private TheThingsStack instance over IPv6.

2 Likes

No, this was years before Mikrotik released LoRaWAN compatible hardware. (Three years years ago from memory, might be longer)

TLDR; should work, but is partly known problem and mostly debugable…

That should not be a problem. UDP is robust. As long you have valid triples out of any connections the NAT Device should track the connection stateful.

Gateway1/RND-SRCportA —> NAT-DevicePub/portG → TTNEurope/DSTPort1700
Gateway1/RND-SRCportB —> NAT-DevicePub/portH → TTNEurope/DSTPort1700
Gateway2/RND-SRCportC —> NAT-DevicePub/portI → TTNEurope/DSTPort1700
Gateway2/RND-SRCportD —> NAT-DevicePub/portJ → TTNEurope/DSTPort1700

Every reply from TTNEurope/DSTPort1700 is trackable and replies to every source are possible. With DNS this works perfectly this way. But this are queries with immediate replies. Many, even professional Firewalls have the dilemma of restricted source ports on public side versus reserved sessions. 15s session timeout for UDP is quite common. And there lies the problem. NAT-Device generates a new session because it uses a new translated source port. If you have the possibility to dump packets LAN and WAN side i am quite convinced that the sessions are not consistent and the NATting device shuffles source ports and confuses the receiving server over his sessions. With this knowledge and a configurable NAT/Firewall there is a chance to adjust more sticky session timeouts for protocol and NAT-Tables.

Or switch to TCP :wink:

2 Likes