How reliable are downlinks?

dajt · June 12, 2020, 6:19am

I’m developing an application to control a water pump. My dev setup is:

A Multitech Conduit gateway that uses the mobile phone network to talk to the internet;
A Feather M0 915 node;
The MCCI LMIC library, and the as923 frequency.

I have no control over the gateway, and cannot see it’s logs etc - it’s owned by the customer. I am almost certainly the only node using the gateway. The gateway is probably 6 metres or so away from my desk in the workshop (a garage), so further if I’m in the house.

I am finding joins take a long time, sometimes up to an hour. After a fresh install of MCCI LMIC they seem better, but still up to 10 minutes.

I am finding downlinks to be what I consider for this project hopelessly unreliable. They were much worse, but since the reinstall of MCCI LMIC they’ve improved to the point it takes about 4 or 5 uplinks to get a downlink through.

The feather sends an uplink of a single byte every 10 minutes, these are bit flags to give the status of the pump. But for testing I can send them any time, so they are much more frequent.

I expect the downlinks to be 2 messages every few days - turn the pump on, then off a few hours later when the tank is full. But that’s really 8+ downlinks on that day, and could easily be 20+ given how many attempts it takes for one to get through.

Using TTN console I was manually adding downlinks after each uplink until one got through, but I have changed to using the confirmed delivery flag to get this done automatically. Is this going to cause problems because so many of them have to be sent?

If the downlinks take 4 or 5 uplinks before they are received properly, then a command to switch the pump on or off could take 40+ minutes, and that is an open-ended interval because depending on I don’t know what, it could take many more uplinks until the downlink is received. This is pretty poor, but workable. Not much of a user experience though, if they ask the pump to turn on manually and it doesn’t happen for n hours.

So:

Are downlinks just this bad, or is it some local problem I have? Could it be caused by sending too many uplinks while I’m testing?
If TTN console says a downlink has been sent, does that mean the gateway will definitely try to send it to the node, or does the gateway have the freedom to drop it if it feels like I’m taking up too much air time? I’m wondering if the node is having trouble reading the data from the gateway, if the gateway is dropping the downlink, or if the mobile network latency from TTN to the gateway means the gateway gets the downlink too late for the node’s rx window.
Is the problem caused by the Feather/LMIC combination? Uplinks are very reliable.
Is the long joining time caused by the fact downlinks are so bad? I’m assuming there is a downlink for key exchange etc.
Does a gateway using the mobile phone network suffer from latency in the mobile network as opposed to one connected to a wifi or ethernet LAN? Does this latency impact on the 2 second downlink window?
Is LoRaWAN just the wrong technology for control projects? If so, what is the use of downlinks?

Regards, David.

descartes · June 12, 2020, 8:44am

There is definitely something wrong - joins should be a matter of a minute or two, downlinks may miss on an uplink but usually make it on the next uplink. This does imply the GSM connection isn’t all that it should be as the timings are quite sensitive.

LoRaWAN is fine for control projects, but I’d personally stick to sending settings and occasionally non-time-sensitive commands. Commands like open the door or deploy the rocket booster or change the traffic lights wouldn’t be in scope.

In theory your application is fine as long as you have something to shut off the pump if the tank gets full and the device can’t hear the turn off command.

Can you setup a test rig with a gateway that’s on a wired internet connection that’s not rubbish? This will be a quick elimination exercise of which bit is the problem.

kersing · June 12, 2020, 11:21am

A LoRaWAN class A network (which the community network is at this point in time) is definitely not the best solution for control projects. Class C would be ideal, at the moment that requires a TTI instance.
Keep in mind class c nodes use a lot more power because they are continuesly receiving where class a nodes sleep a lot of the time.

cslorabox · June 12, 2020, 12:18pm

I have to agree that if LoRaWAN is going to be used here, it should be class C. The pump sounds like it probably consumes a fair amount of power so there’s likely either a mains power supply or at least a hefty battery solar one that probably wouldn’t be drained by running a LoRa receiver continuously.

That said, if it all possible it’s better to close the control loop locally - running the pump until the tank is full is something that should be done via local control logic.

More generally, while class A operation is not really suitable here it doesn’t sound like the current implementation is really working even for that. Downlinks require timing to be right, and timing with LMiC can definitely be an issue, though the the feather M0 is one of the better supported platforms in the MCCI repo. A mixup of uplink/downlink frequency or other air settings when moving to a less common bandplan could also be at fault.

The asker really, really needs to gain access to the gateway to effectively debug, or even buy another one or hang a concentrator on a pi (make a cost argument, it’s not worth wasting even a few hours of work over this!). There’s a very narrow window of time for the server to get a downlink request back to the gateway (even for a join accept, the server has to wait until just before the deadline as it assumes a queue-less packet forwarder) so moving the gateway to wired Ethernet would be good.

If LMiC is modified so that the transmit frequency is known in advance, then the downlink frequency can be figured out as well and an RTL-SDR dongle tuned there to try to catch the downlink.

But really working out timing bugs should be done with a scope or logic analyzer watching both a GPIO on the node, and the gateway’s transmit LED. Or in theory a simple RF powermeter type received could catch both uplink and downlink from nearby sources irrespective of frequency, and try to see that the interval between the blips is exactly the receive window delay.

mfalkvidd · June 12, 2020, 7:39pm

At only 6m from the gateway, the signal strength of the downlink can overpower the node’s radio front end.

It might be worth trying to move the node further away.

dajt · June 12, 2020, 10:33pm

Thanks for the ideas everyone. It seems this level of performance is not expected which is encouraging.

My testing over the last couple of days has shown it to be pretty good with the latest version of MCCI LMIC as given by the Arduino IDE. I think all the downlinks have arrived within 6 uplinks. It’s just annoying not knowing if the problems are due to the gateway or not so I wanted to see if others were having the same downlink problems meaning that was just how it is.

This is a university project and while the customer really does want automatic control for their pump, I don’t think they’ll spend money on it over and above what they already have buying the gateways and feathers etc. they’re using in other projects.

The virus situation over the last few months has not helped - they’ve generously loaned me the GSM gatway to use at home, but on-site is a different one I hope to get access to when I can go and visit them. I am hoping that solves the downlink problem.

Class C would be good - we have power all the time because the pump is connected to mains electricity. But LMIC doesn’t support class C operation and it looks like that would also require paying for a special ThingsNetwork server which also rules it out.

The pump does have existing automatic switch-off mechanisms so we’re not going to damage it if we don’t get the switch-off command on time. We also have hours to receive the switch-on command because the low-water mark is 50% of tank capacity. So in practice I expect even the the lousy downlink performance we have now would work but it’s annoying not knowing whether the pump will switch on within 10 minutes (first timed status message from the controller), 1 hour (within 6 messages), or some longer time if things go really badly.

Regards, David.

arjanvanb · June 13, 2020, 8:20am

I’d call that really bad. If you find that you need multiple uplinks for your confirmed downlink with the non-GSM gateway as well, then I really feel you should fix it, or look for alternatives. Don’t take a non-optimal system into production. (I’d not even start at all until you can use Class C.)

Did you confirm that the nearby region is used for the gateway’s router? And that the same region is used for your application’s handler? (I’m not sure which component keeps the downlink queue.)

The gateway owner could add you as a “collaborator” for the gateway in TTN Console, so you could at least see the gateway’s Traffic page in TTN Console, and tell if TTN has commanded the gateway to (re-)transmit the downlink. It probably has, so then access to the gateway’s raw logs is invaluable to determine if latency is the issue. Well, @cslorabox explained more above.

Note that a downlink that is scheduled while handling an uplink with that measurement, is likely not even transmitted until the next uplink: My application's downlink is always queued for next uplink.

As for confirmed downlinks, you may also want to check what happens if you replace it while it was not yet confirmed. (Say, the “switch on” downlink was transmitted but not yet confirmed, and meanwhile you determined that a “switch off” downlink needs to be scheduled instead. Does TTN delete the non-confirmed downlink from the downlink queue, or does it wait forever for the confirmation?)

When using the TTN Community Network which is operated on best effort only, you should also take long network outages into account. And maybe in a few years your LoRaWAN gateway or device will just break.

Controlling a pump to fill a tank doesn’t feel like a good LoRaWAN use case to me, not even for Class C. I very much agree with:

Curious: how far away is the tank (or its water level sensor) from the pump?

ame · June 13, 2020, 9:07am

I’m planning something similar, but instead of turning the output on there will be some local smarts in the node that accepts commands to turn on and off, but starts a timer when an “on” command is received. Then, if the network dies the timer will time out and the output will turn off automatically. But, if everything is good and I want the device to be on longer I’ll re-send the “on” command before the timer expires which will re-set it.

arjanvanb · June 13, 2020, 10:13am

Ah, the following was actually confirmed to work:

So, I’m quite sure confirmed downlinks can be replaced with a different confirmed downlink just fine.

dajt · June 13, 2020, 11:35am

Replacing unakc’d confirmed downlinks is an interesting case I had not considered! Glad it works

I agree taking 4-6 attempts to get a downlink is terrible, but given I only managed to get about 3 in a month when this project started my expectations are pretty low at this point and I’m very pleased anything happens at all.

I have written the code to work with either a wifi or lorawan feather because testing with wifi is a lot easier, and in case this just doesn’t work with lorawan I can suggest we use wifi. But I think the project sponsor wants to prove lorawan can either do this or can’t for cases where farms have tanks further away from the house than wifi can reach. GSM is another option but there are already solutions using that. The wifi/lorawan code do not co-exist - you get one or the other so there is no time wasted on a stack not being used. Class C would require either us to write the class C code or move to a different device that does support it, both are way outside the spec for this project.

At the moment there is no “on” automation for the pump so even if the gateway or the feather dies they’re no worse off. The tank level readings are taken from a separate sensor - nothing to do with the feather. There is already at least one automated “off” mechanism, we’re going to allow for timeouts to be sent with the “on” messages, and we also check a couple of input signals that will make the feather switch the pump off without a downlink message. We’re pretty good for switching it off I think. The annoyance will be wondering why it hasn’t switched on when it should have, and sending too many downlinks if the performance isn’t any better with the gateway on-site.

Right now, it’s 10x better than it was 3 days ago and I can demo it without being too embarrassed. We have next semester to hopefully get on-site and see how it goes in place.

cslorabox · June 13, 2020, 2:49pm

It still doesn’t make sense why you are putting the radio in the turn-on/turn-off path at all instead of having that locally automatic and using LoRaWAN only to report status and possibly change control-loop settings.

Even if your water level sensor is remote from the pump, you probably want to connect them via a shorter range point-to-point link (LoRa or otherwise) and not through the LoRaWAN gateway.

That’s wholly apart from the how your LoRaWAN node-gateway interactions don’t yet seem to be working as they should.

ame · June 14, 2020, 12:29am

I am not the OP. Sorry for the confusion. The system I am preparing will be locally controlled, with LoRaWAN used for reporting the state, but I am hoping to use downlink commands to control a relay to modify the state, but even if the downlink fails, or is delayed, or the relay fails, the system will still operate safely.

Besides, what’s the point of a downlink if it can’t be used?

My installation is slightly different to the OP: I have a tank on a hill. There is no LoRaWAN converage there, but there is cellphone coverage. I have a pump in a valley, 1km away. There is no LoRaWAN or cellphone coverage there.

Phase one is to install a water level sensor in the tank connected to a LoRaWAN analogue node. Next to the tank I will install a LoRaWAN gateway with a cellular modem. The sensor, node, gateway, and modem will be powered by a small solar array and battery bank. The node will be only a few metres from the gateway, so it’s a bit pointless, but it allows me to start getting data in a consistent way.

Phase two is to install a sensor on the pump in the valley and connect it to a LoRaWAN digital node. It will report the status of the pump (running/not running) by connecting to the gateway on the hill that was installed in phase one. 1km is still not that far for LoRaWAN. A digital output on the node will do something with the pump (but I don’t recall what it is just now…), but it’s not critical as the pump operates automatically based on a pressure switch. In other words, my use of the downlink to control an output is not part of a control loop.

arjanvanb · June 14, 2020, 8:45am

Downlinks are very useful: for OTAA, for ADR, for remote configuration, maybe even for a remote reboot to allow for joining a different network. (Unfortunately, downlinks for confirmed uplinks apparently have a design or implementation flaw.) Downlinks should work properly, and if not then one should fix that. But controlling things is just not a good use case for Class A LoRaWAN.

Just for the sake of completeness, though already mentioned in many other places: even if downlinks work fine, there’s also the limit of 10 downlinks per day. I’d assume that retries for confirmed downlinks count against the Fair Access Policy as well, as the network cannot be blamed for that.

ame · June 14, 2020, 9:34am

Yes, I have borne that in mind. The node is a class C device, and we plan to turn it on (or off) at most once a day, but probably not very often at all.

arjanvanb · June 14, 2020, 10:20am

Class C devices are just Class A on the TTN Community network:

cslorabox · June 14, 2020, 2:25pm

Beware that LoRaWAN gateways are quite power hungry due to the multichannel DSP baseband receiver chip (some sellers will claim the 8-channel cards are 49-channel due to the number of distinct combinations that could be demodulated, ironically their power consumption isn’t far off from 49x that of a node radio). You may need a bit larger solar setup than you’d expect to keep this up across cloudy days.

Personally I’d look into a custom point-to-point LoRa link using lower power node-class radios. Your box on the hill can wakeaup the mobile data modem periodically and report in, along with water level. And then it can command the pump via LoRa, probably something like “this repeatable message means run for 15 minutes”.

The one possibly tricky part is having the two node-class radios “find each other” if you conclude you need to use multiple channels but you presumably have a fair amount of power at the pump and can keep that radio receiving during searches, and once you establish communication you can keep a schedule of windows which grow wider if a transmission is missed.

bluejedi · June 14, 2020, 9:01pm

FYI: I have updated the topic title.

Downlinks themselves are not unreliable. It is what you want to use them for and whether LoRaWAN as technology is suitable for what you want to apply it for. “How unreliable are downlinks?” implies that downlinks would standard be unreliable, which is not a correct statement.

ame · June 14, 2020, 9:51pm

This is all great information. I am using TTN for test/development, and it might be appropriate for deployment. If I need something “better” I can pay for TTI, or use our incumbent telecom operator’s offerings.

The MikroTik gateway I am using consumes 7W maximum. The Teltonika modem uses <5W. The Ursalink node uses <2.5W and the sensor about 1W. The solar power system will be sized accordingly.

ame · June 15, 2020, 10:26pm

Ok. I have clarified what we want the digital output for. It is to prevent the pump from running when we are not expecting to use water.

If the tank is full the pump will automatically shut-off (because of a pressure switch). But, it will attempt to restart every 15 minutes. To reduce wear and tear we want to have an override switch on the pump. This will be turned on or off at most once a day, probably with a few days between each switching event, i.e. when we turn it off it’s because we don’t want water for a while, and when we turn it on, we’ll leave it on for a while. Basically this output is permitting the system to run (with its own automatic controls) or not, and is not part of the control loop.

Is this an acceptable use case?

descartes · June 16, 2020, 9:59pm

But it is if it overrides the automatic filling of the tank.

Only you can decide if this is an acceptable use case - if you send an OFF command and then someone uses the water but the override is still OFF and for whatever reason downlinks aren’t getting through, is this OK?