Need Best Method to receive 1000 node data on gateway

dovov · February 18, 2022, 4:55am

Hello,

In our application, about 1000 node is powered on same time and remains of 1-2 hour max. How should I manage to get all data from devices.

Please suggest best mechanism to handle and receive all data without missing.

Thank You

Jeff-UK · February 18, 2022, 9:55am

Job one is avoid such a situation - dependig on how many GW can service the nodes you will run into several problems. Easiest to explain if running with one GW. Also recommend you read up on the documentation and check out the LoRaWAN 101 as problems then become obvious.

Basically when a node starts up it will initialise then start its join process (assuming OTAA as best practice here). A node then initiates a join request and all being well a short time later the NS will issue a join accept via the GW. 1000 nodes all literally doing same thing at same time impractical due to conflict - GW can only listen to and react to a limit number at a time. The next issue - even if a given node does get its message through - is that the response will be a join accept downlink. That TX of the downlink renders the GW briefly deaf - to ALL other nodes that may be sending uplinks - including by then likely re-trys of the join request from ALL the not yet joined nodes. So you can see that all the none joined nodes will back off and try again shortly there after and so on…you get a cascade blockage where the nodes if the are lucky will gradually succeed in joining but where the GW also struggles to send all the messages back in a reasonable time. The 3rd issue is the GW is itself a transmitter and - depending on where in the world will also have rstrictions on its operation - in EU regions we will see a duty cycle limit - and with lots of nodes joining at the same time there is a good chance the GW will be legally throttled from sending further join accepts until such time as it falls back within its DC limits. So the drawn out process above gets stretched out even further. In the meantime also your 1000 nodes are also then effecctively engaging on a DDoS attack on all the other nodes in range of that community GW - and that is not a good thing. Also depending on your location and the density and activity of the other community nodes your nodes will have to share access and wont just be competing with each other but will be competing with the community so even the drawn out process above will be an ideal and likely the community nodes will cause your overall join process to be drawn out even further!

Ok, thats the problem, what is best practice?

limit the number joining in close time frame
ensure that any given group join with a truely random delay & dither - many nodes running same firmware/code and perhaps with same psuedo random mechanisms will still end up largely synched - bad!
Ensure that if initial join request fails they dont all have the same retry timing - again use a truely random delay before retry and even then also add a random ‘dither’ if you think timing will be close
Use an extended back off period - make the time between each retry gradually longer and longer (with a random dither element) to give GW time to breathe and recover from DC restrictions and to limit clashing nodes

Note ABP may solve part of the problem (at least part of the GW TX capacity problem) but will cause others and if assumption is “I transmit and therefore I will be heard” with no way to manage at the node or via application it can be far worse…even where some random timings and dither are introduced

TL::DR 1000 nodes all turning on at same time cannot be handled in one hit with just one or only a few GWs. Further, with normal LoRaWAN timings and event handling, if they start up and only run for 1-2 hours it is possible some stragglers may never join in time to send a useful message before turning off again! (do the maths once you understand all the individual timings)

One other thought if deploying such a fleet for a short period of operation before shut down - like your 1-2 hrs, that doesnt allow much time for the network to balance and optimise operation (SF & TX power), as mechanisms like ADR take time to adjust. It may be worth setting a random range of SF’s for initial operation for any given joining group (note the LoRa Alliance, who define the LoRaWAN specs, dont allow networks to accept and join devices at SF12 so avoid that and if in US SF11/12 cant be used anyhow due to dwell time limits), if running for more time then be sure to enable ADR, however, as NS can then start process of optimising operation… (if running for min say 30-100 Tx’s per node, as it usually takes 20-25 uplinks on TTN for ADR to start to cut in and improve/optimise node behaviour)

LoRaTracker · February 18, 2022, 10:23am

A lot of people might wonder what circumstance requires that 1000 nodes are all turned on at the exact same time …

descartes · February 18, 2022, 10:23am

Context is everything - if you can tell us what the application is, that would give us some ideas.

If not the big picture, some of the details, like payload size etc.

And how do you manage to turn on 1,000 devices at the same time? Are they mains or command line powered?

It should be possible to write firmware that can appropriately stagger ABP transmissions but more information would help inform that view.

dovov · February 19, 2022, 8:29am

I am using ABP and 1000 nodes are in a larger area where main AC powered on and start on same time.

Can CAD will work here?

bluejedi · February 19, 2022, 9:11am

With larger installations, especially when mains powered, nodes should wait some random amount of time before starting to transmit (several different schemes/solutions are possible).

This will however (depending on your nodes) require modification of the firmware on the nodes (end devices).
If not, race conditions and congestion will occur.
1000 nodes all powering up and starting to transmit at the same time is comparable to a DDOS attack.

Also use more than one gateway to increase concurrent handling capacity (and prevent single point of failure).

descartes · February 19, 2022, 9:24am

Well, yes, but only because that’s how the concentrator works anyway.

Good side step of the other question about payload size, makes it hard to offer ideas!

I’ll add another, can you update the devices firmware?

bluejedi · February 19, 2022, 10:25am

Maybe we should make consultancy a payed option on the forum unless at least people do provide all relevant information needed…

descartes · February 19, 2022, 11:29am

Or a refundable deposit with a standard deduction per missing answer (which can be “I don’t know” or “I’m not allowed to say”).

We’ll make millions!

LoRaTracker · February 19, 2022, 11:50am

It has been hard since TTN stopped funding the necessary maintenance on our crystal balls.

jreiss · February 19, 2022, 4:49pm

Theoretically 1000 nodes sending one 50 byte DR5 packet perfectly time-synchronized over 8 channels. Takes 12.5 seconds.

1000 nodes sending each assigned one 3.6 sec tx window can tx in one hour.

Reduce by factor of 8 using sync’d psuedo random channel assignment to 7.5 minutes using same 3.6sec window. Rotate through channel list with assigned starting position.

Reduce 3.6sec window to 1 second for completion in 2 minutes.

However this will near 100% utilization and not be friendly to others wanting to use the ISM band.

But 1000 nodes sending blindly into the ether for 2 hours appears to present room to compromise.

Local regulation may prohibit transmitter cooperation and LoRaWAN is traditionally an ALOHA protocol.

See page 25, FDMA and TDMA

descartes · February 19, 2022, 5:20pm

Nice outline, but without @dovov giving us the payload size and, to try to factor in SF/DR, a rough geographical spread, it is indeed theoretically.

If half the nodes need to run at DR4 and 100 of them run at DR3, the calculations get stupid messy.

I’ve no doubt I could come up with a scheme for some scenarios, trouble is I can’t think of one that has 1,000 nodes on mains power that only runs for 2 hours so it’s not just theoretical, for me it’s academic.

I have this picture in my head of someone running round a site plugging in 1,000 nodes, using the big on switch on the site power, recovering and then 2 hours later running round retrieving the nodes!

Jeff-UK · February 19, 2022, 5:59pm

Nice ‘theroretical’ back of envelope paper analysis but sadly nothing like what happens in a realworld deployment!

A few spanners in the works off top of head:

In a class A node deployment its never going to happen - clock drift from the nodes (assuming ABP to limit loss of capacity from GW due to join process, MAC commands etc.) means they will gradually drift away from each other over time in the field and discipline will be lost and conflicts will start to happen

A class B deployment (using GW beacons to correct/reset timing) might help discipline, but improbably deployment and few TTN deployed GWs can support class B, some support class C. As nodes appear to be mains powered that might be an option?

Most probably nodes will be distributed over varying distances away from GW and so 1 fixed DR not realistic, and even if theoretically at suitable distances real world issues - reflections, absorbers, building clutter, walking sacks of salt-water, moving vehicles, topology, etc. will purturb RF propogation and actual useable reception at both ends of the link
One of the characteristics of LoRa (and LoRaWAN) is that overall spectrum and GW capacity is best utilised when running with an idealised model of distributions of channels and particularly DR’s for a given GW (think of area around a GW as like an archery target or dart board with GW at the bullseye- and that idealised model will vary by node configuration and external factors out of your control (*such as). GW’s never run with idealised capacity

4*) Other nodes in range of your GW will also be transmitting, potentially cause uncontrolled conflicts/collisions, even with the resiliance of LoRa modulation

5*) Other nodes in the area may be initiating or recipients of GW downlinks - Join processes, ADR updates, MAC command, esp for mis-behaving nodes outside your control, commanded downlinks, etc. all of which will turn the GW deaf to your nice scheduled TX’s and wiping out timeslots.

6*) Given enough other nodes in the area and even allowing for you limiting the need for DL’s for your own use, there is a severe risk that in higher traffic areas the GW may max out its duty cycle - whether for RX1 responses on same channel & DR as TX nodes (generally 1% limit where applicable) or even the higher duty cycle RX2 window (10% on TTN where applicable)

7*) Rogue/badly behaved nodes can really bugger things up in line with above!

8*) Even if network appears to settle into a reasonable near steady state (though less performant than your ideal by definition), it can still be further purturbed and upset by transitory nodes passing through your area for a period from which the system will then have to recover back to prior steady stable state - think cold chain/logistics/asset tracking type nodes passing by your GW - especially if they are hving to do a rejoin due to a link loss and all rejoin close to your GW forcing it to send downlinks. See that regularly with trains/rolling stock and cargo coming through some areas. Also a common sight with road freight truck rolls.

9*) How resiliant is system/application to packet loss? Across my GW and node estate - atleast what I have bothered to monitor at various times - I see typicall 0.1 - 1.5% loss, though some deployemnts can peak out at 2-3%, TTI suggest you should actually model for up to 10% packet loss! (due to some of the issues flagged above, but also due to external RF phenomenen and interferers, plus what about packets received ok by GW but then subsequently lost over backhaul/the internet on route to the LNS back end!) How do the numbers look if you start to model for such losses?

Of course none of this reflects some mode of operation where 1000 devices come on together and then run for just 1-2hrs…per my original post I know from experience that there will be some nodes (possibly many!) that will never get to send a received useful payload during the operating window - even if able to join! (they may send but GW and hence back end may never see…). Densifying the network with many more GW’s is the only hope for this…

There are some legitimate use cases that might fall in scope but unlikely to use this model and more likely will stay live for much longer, and which likely will retain state and then not have to rejoin once they have joined - I’m thinking e.g. wide area street lighting deployments after 1st switch on etc…but they use lots of GW’s and ‘bring up’ by the dozen or a couple of hundred at most if they want quick reliable 1st start up - they have learnt the hard way not to do it by the 1,000 off! Some years back I saw a potential application using hundreds, and potentially thousands, but again with several GW’s to monitor a very wide area - for microclimate monitoring and path evaluation and personnel/asset tracking & monitoring in forest fire management - you need as many sensors as poss as quickly as poss…but they used other mechanisms to manage start up vs a big bang.

LoRaTracker · February 19, 2022, 6:37pm

I am still curious, even despite the total lack of information.

If you were planning for a situation where suddenly you want 1000 nodes to connect, and presumably pass on some information (another guess), shirly you would design the system so that exactly this did not happen in the first place ?

descartes · February 19, 2022, 8:00pm

The challenge being that if using ABP they need to use a lower DR / higher SF for the initial connection to be sure, depending on their location to get a message through - which means the first few uplinks will be largely lost.

cslorabox · February 19, 2022, 9:30pm

In the long term, no.

But the OP described a situation where power is restored to a bunch of nodes all at the same time.

Without intentional randomization, with such a common trigger they’re all going to transmit a the same time, too, assuming they’re all running the same firmware - divergeance would only happen over time, and with LoRaWAN packet durations could take a while to naturally de-conflict. And probably with the same spreading factor. So the only diversity is <=8 channels to hopefully randomly pick from - assuming that even that spec required randomness actually works.

Normally, channel activity detection could help some, but when things are as synchronous as they could readily be for the first transmission after a common trigger without any intentional randomness, it won’t - channel activity only works if one party started using a channel enough advance that another can notice that, before also trying to use that channel - it fails in the case of actually simultaneous start as you’d have with the same firmware on multiple copies of the same hardware triggered by the same power restoration.

Needless to say, in addition to some random holdoff uniquely needed by this particular application, the base fact remains that all LoRaWAN nodes subject to frequent power failure must remember session details and continue forward from their past history, without trying to do a new OTAA join or illicit reset of ABP counters each time they get power again.

bluejedi · February 21, 2022, 6:58am

Given that lack of information, the activity in this thread is surprisingly high.

dovov · February 21, 2022, 8:45am

Hello guys,

Thanks for your replies, I am sorry that I didn’t give all information needed to suggest best.
Here is application:

In a village, I am doing some sensor measurement and power monitoring for about 1000 farms. Here In village, farms get 2 hours electricity in a day at same time.
Payload Length: 37 bytes.
Upload Interval : 1 Min. (No dutycycle limit for IN865).
ADR Enabled.
And use ABP.
Thank You

LoRaTracker · February 21, 2022, 9:22am

Thanks for the information.

1000 farms, suggests a big area.

What are the distances between Gateway and farms ?

dovov · February 21, 2022, 1:46pm

I will arrange gateway and node in such a way that distance between any node and any Gateway would be max 4km. So will use 5- 10 gateway. Total area of 1000 farms would be around 20KM radius area.