Lmic plus rfm95w staying awake for long periods

Ah, good point. Now I know what you mean. Well, I will have to deal with it and reinitialize the I2C to avoid more new problems.

Ah, that code explains much of the logging you posted earlier!

Sounds silly, but what if you’d print a bit more to see if changing timer0_overflow_count makes the expected difference? Untested:

Serial.print(micros());
Serial.print(F(" / "));
Serial.println(os_getTime() / OSTICKS_PER_SEC);

Serial.println(F("-RTC-"));
cli();
timer0_overflow_count += 60 * 64 * clockCyclesPerMicrosecond(); //give back 60 seconds of sleep
sei();

Serial.print(micros());
Serial.print(F(" / "));
Serial.println(os_getTime() / OSTICKS_PER_SEC);

Also note the following from @matthijskooijman, which also changes timer0_millis:

A while ago, I implemented a workaround by updating the Arduino timer value. This uses internal Arduino variables, so is likely to break at some point if Arduino changes its implementation. It is also specific to the AVR Arduino core, other architectures likely need something else. Still, here it is (from here):

ATOMIC_BLOCK(ATOMIC_RESTORESTATE) {
      extern volatile unsigned long timer0_millis;
      extern volatile unsigned long timer0_overflow_count;
      timer0_millis += slept;
      // timer0 uses a /64 prescaler and overflows every 256 timer ticks
      timer0_overflow_count += microsecondsToClockCycles((uint32_t)slept * 1000) / (64 * 256);
}

I’ve no idea if that’s needed… The code that is referenced by Matthijs, mentions:

// Update the millis() and micros() counters, so duty cycle
// calculations remain correct. This is a hack, fiddling with
// Arduino's internal variables, which is needed until
// https://github.com/arduino/Arduino/issues/5087 is fixed.
1 Like

As silly as it may sound, I had a similar suspicion and checked that. It changes os_time.
Anyway, I added the code to have a better view.

Here is how the debug looks like:

-RTC-
1605335192 / 1605
2136670000 / 2136
28:0 … go to sleep
woken … IRQ
-RTC-
2136679776 / 2136
2668014600 / 2668
29:0 … go to sleep
woken … IRQ

The really interesting thing is the other part you dug out. :face_with_raised_eyebrow:
What makes me wonder is that it not only changes timer0_overflow_count but also timer0_millis. One could suspect that not changing one of the two might get something out of sync.
This is definitely worth a try… test is running.

As far as I understand, LMiC only uses micros(), where timer0_overflow_count might suffice. But your other code might trigger usage of millis() too?

I was expecting a difference of 60 seconds there? You’re seeing 2,668,014,600 - 2,136,679,776 = 531,334,824 microseconds, hence 531 seconds? (Same for 2,668 - 2,136 = 532 seconds.)

:roll_eyes:
You are good! You really are!

Oops, this happens when you are careless and assign a value of 60000 milliseconds to a 16bit integer :rofl:
Fixed it, now it’s 60 seconds:

-RTC-
72687736 / 72
132687248 / 132

I did a couple of minutes adding 600 seconds per minute. This might be nonsense but doing so without updating also millis caused the problem to occur within minutes instead of hours. So far, safe and sound. I will monitor the proper time correction over night. I will keep you posted.

Update: The problem is still there.

It took some time to show up, but LMIC still starts to delay the sending occasionally. This keeps the RFM95 awake for anything between 2 and up to 5 minutes while using using 8 mA or so. With this behaviour, really low power applications with the 328P are impossible.
I can see here https://www.thethingsnetwork.org/forum/t/full-arduino-mini-lorawan-below-1ua-sleep-mode/8059 that @Charles obviously did something similar and I am wondering what I do fundamentally wrong?

You’re not seeing this when not sleeping at all, right? (Not sleeping, hence no code to fix the timers either.) What’s the payload length and SF? (I wonder if we can figure out if the 2 to 5 minutes is related to duty cycles.)

Also: do you get any larger downlinks (larger in size, hence including some server-side MAC command that’s piggy-backing on the downlink of your timestamp)? (I wonder if some earlier MAC command might have messed with LMiC’s state.)

For your delay i think you should track when LMIC really want to send data , for that you can add some printf in “engineUpdate” for the variable “txbeg”

Someting like

#endif // !DISABLE_BEACONS

        // Trace to show the delay due to duty cycle. 
        LMIC_DEBUG_PRINTF("%ul: engineUpdate, tx will happen at %ul\n", os_getTime(), txbeg);

        // Earliest possible time vs overhead to setup radio
        if( txbeg - (now + TX_RAMPUP) < 0 ) {

And if the delay is due to txbeg then you will be able to track where it come from.

1 Like

I fancy, I need to go back two steps here at least checking more basic things.
I have to admit, I never let the “same” code on the same node run for long without the sleep. Did not feel the necessity, though it seems to be better now.
I have now set up a very simplified node, sending 2 bytes every 5 minutes not getting downlinks.
First test will be with a delay(2000), 150 times between the sends. No sleep and no tinkering with the timer.

Edit:
You also asked for the SF. This is what the data says for one example:

{
  "time": "2019-01-06T17:44:41.269366308Z",
  "frequency": 868.5,
  "modulation": "LORA",
  "data_rate": "SF7BW125",
  "coding_rate": "4/5",
  "gateways": [
    {
      "gtw_id": "eui-b827ebfffe032416",
      "timestamp": 1130906507,
      "time": "",
      "channel": 2,
      "rssi": -83,
      "snr": 7.8,
      "rf_chain": 1,
      "latitude": 47.53158,
      "longitude": 7.81542,
      "altitude": 395
    }
  ]
}

And from the gateway:

{
  "gw_id": "eui-b827ebfffe032416",
  "payload": "QCQuASaAAgABmJwwYjTD",
  "f_cnt": 2,
  "lora": {
    "spreading_factor": 7,
    "bandwidth": 125,
    "air_time": 46336000
  },
  "coding_rate": "4/5",
  "timestamp": "2019-01-06T17:53:34.110Z",
  "rssi": -78,
  "snr": 9.2,
  "dev_addr": "26012E24",
  "frequency": 868300000
}
1 Like

Here I am a bit at a loss.
How do I get the LMIC_DEBUG_PRINTF to put something out to serial?
I can increase debug_level to 1 which gives me out of memory in variable space. Bad.
So I changed most of the #if LMIC_DEBUG_LEVEL > 0 to #if LMIC_DEBUG_LEVEL > 5 just to make them not print.
Now it compiles, but nothing of the PRINTFs is sent to serial. :thinking:

To get LMIC_DEBUG_PRINTF to work, you have to uncomment

//#define LMIC_PRINTF_TO Serial

Or you can replace the LMIC_DEBUG_PRINTF with some Serial.print().

Serial.print(os_getTime());
Serial.print(F(" : engineUpdate => tx will happen at "));
Serial.println(txbeg);

2 Likes

OK, got it. Thank you a lot for helping me here.
Had to change the line to that, though, so to get the actual values:

LMIC_DEBUG_PRINTF("%lu: engineUpdate, tx will happen at %lu\n", os_getTime(), txbeg);

Btw: I did some other test. If I send packets with a payload of 25 bytes as fast as I can, the LMIC soon delays it to 12 seconds between the packets. They have an airtime of about 100 ms, so that’s about 1% airtime.
Now, reverting to one send every 5 minutes without sleep, I can sit and wait what happens.

After having run some code for 6h and 9h respectively, I have more information but nothing conclusive yet.
I changed the time output to be divided by OSTICKS_PER_SEC to get seconds which are easier to read

So, 6h of sending every 5 minutes without sleep (only 150x 2s delay) and no tinkering with the timers:
Debug looks like this from beginning to end:
go to sleep … … woke up …sending…

22345: Queue packet
22345: engineUpdate, tx at 21738; now is: 22345
22345: EV_TXSTART
22345: tx done
22351: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep … … woke up …sending…
22651: Queue packet
22651: engineUpdate, tx at 22085; now is: 22651
22651: EV_TXSTART
22651: tx done
22656: EV_TXCOMPLETE (includes waiting for RX windows)
sent

Then changed to sleep for 2 seconds instead of the delay. First 6h no difference, no inexpected delays:

go to sleep … … woke up …sending…
33976: Queue packet
33976: engineUpdate, tx at 33369; now is: 33976
33976: EV_TXSTART
33976: tx done
33982: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep … … woke up …sending…
34282: Queue packet
34282: engineUpdate, tx at 33716; now is: 34282
34282: EV_TXSTART
34282: tx done
34288: EV_TXCOMPLETE (includes waiting for RX windows)
sent

Then, after 9.52 hours, this happens:

go to sleep … … woke up …sending…
34282: Queue packet
34282: engineUpdate, tx at 33716; now is: 34282
34282: EV_TXSTART
34282: tx done
34288: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep … … woke up …sending…
-34131: Queue packet
4294933165: engineUpdate, tx at 33981; now is: 4294933165
-34131: EV_TXSTART
4294933165: tx done
-34126: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep … … woke up …sending…
-33826: Queue packet
4294933470: engineUpdate, tx at 34328; now is: 4294933470
-33826: EV_TXSTART
4294933470: tx done
-33819: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep … … woke up …sending…
-33519: Queue packet
4294933777: engineUpdate, tx at 4294933170; now is: 4294933777
-33519: EV_TXSTART
4294933777: tx done
-33513: EV_TXCOMPLETE (includes waiting for RX windows)
sent

So, rollover happend, first for now, then for txbeg. No problem so far, as still no send delays.
Next test will be sending not every 5 minutes but every 1 hour. Maybe this causes the problem to show up.

1 Like

I don’t know if rollover occurred yet; it seems the debug print is just thinking it’s a signed value, rather than an unsigned value. For an unsigned number, the value seems to have passed half its maximum then. But sure: at some point it will rollover.

Ah, forget the negative printed values, these are the lines that you’re referring to:

34282: engineUpdate, tx at 33716; now is: 34282
...
4294933165: engineUpdate, tx at 33981; now is: 4294933165
...
4294933470: engineUpdate, tx at 34328; now is: 4294933470
...
4294933777: engineUpdate, tx at 4294933170; now is: 4294933777

Are all times in each line divided by OSTICKS_PER_SEC?

Something I noticed which is a bit weird, but might not affect things, os_getTime is defined as:

ostime_t os_getTime () {
    return hal_ticks();
}

…with hal_ticks expected to return an unsigned 32 bits number:

u4_t hal_ticks () {
    ...
}

…but ostime_t being a signed 32 bits number:

typedef uint32_t           u4_t;
typedef int32_t            s4_t;
...
// the HAL needs to give us ticks, so it ought to know the right type.
typedef              s4_t  ostime_t;

Yes, all are divided by OSTICKS_PER_SEC.
Regarding the negative numbers. Next time I stop the script, I will check if I can/have to cast some printouts as unsigned long not to look different because of them flipping over to negative.

And now, having it run with one wakeup per hour and 1800 times a 2 second sleep with same number of 2 seconds “corrections” on the timer, I can see “bad thing happening” after only a couple of hours:

Starting
 ...sending...
0: Queue packet
0: EV_JOINING
0: engineUpdate, tx at 3; now is: 0
0: next engine update in 3
3: engineUpdate, tx at 3; now is: 3
3: EV_TXSTART
3: tx done
8: EV_JOINED
8: engineUpdate, tx at 8; now is: 8
8: EV_TXSTART
8: tx done
13: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep ... ... woke up ...sending...
3612: Queue packet
3612: engineUpdate, tx at 8; now is: 3612
3612: EV_TXSTART
3612: tx done
3617: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep ... ... woke up ...sending...
2921: Queue packet
2921: engineUpdate, tx at 54; now is: 2921
2921: EV_TXSTART
2921: tx done
2928: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep ... ... woke up ...sending...
2231: Queue packet
2231: engineUpdate, tx at 2967; now is: 2231
2231: next engine update in 2967
2967: engineUpdate, tx at 2967; now is: 2967
2967: EV_TXSTART
2967: tx done
2973: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep ... ... woke up ...sending...
2277: Queue packet
2277: engineUpdate, tx at 3014; now is: 2277
2277: next engine update in 3014
3014: engineUpdate, tx at 3014; now is: 3014
3014: EV_TXSTART
3014: tx done
3019: EV_TXCOMPLETE (includes waiting for RX windows)
sent
go to sleep ... 

Time seems to be, well, running very slow or even backwards in between. And all of a sudden, tx is after now and it delays.

Again, this is the code used instead of a simple delay(2000);:

  LowPower.powerDown(SLEEP_2S, ADC_OFF, BOD_OFF);
  //Give the AVR back the 2 slept seconds
  unsigned long slept=2*1000;
  ATOMIC_BLOCK(ATOMIC_RESTORESTATE) {
    extern volatile unsigned long timer0_millis;
    extern volatile unsigned long timer0_overflow_count;
    timer0_millis += slept;
    // timer0 uses a /64 prescaler and overflows every 256 timer ticks
    timer0_overflow_count += microsecondsToClockCycles((uint32_t)slept * 1000) / (64 * 256);
  }

Strange, isnt’ it?

I tried a different approach with a different node running on a AA battery (no serial though):
If I do not try to give it back all the time it slept, but only just enough so that the duty cycle is within limits (e.g. 1 minute per send or per hour, probably 10 sec would also suffice), it seems to run nice.
I do not know if I just delay the problem doing so, or if it is now gone completely. But the node is running for >24h now without showing any problems. It might take weeks to find out, but with now no longer showing the power-hungry delays, the battery should last about a year now on that node, according to my estimates :wink:
I know this is just a nasty patch, but unless somebody has a good idea what causes these time-effects, I am lost in getting it fixed properly.

1 Like

As an aside: at least for EU868, it seems the value for tx at 8 is taken from the earliest time any of the channels is available, which indeed can be in the past. So that looks okay.

Some theory…

os_time is based on millis, it is a scaled down version of millis to avoid to overflow too often.
To keep the os_time counter accurate, it needs to detect millis overflow, which happens after approx 71 minutes.
The detection algorithm looks at a flip of the most significant bit of millis.

Now if you sleep one hour, most of the time (1 out of 6) you will have a millis overflow, and it won’t be detected as the increment is big enough to keep the msb unchanged!
And if you are out of luck, it will go backwards…

This is not a proven theory, it is just my gut feeling reading the code, but you could validate it easily by calling os_getTime() each time you wake up (every 2 secs) to allows overflow detection.

4 Likes

Nice find! And that also explains why the correction code might work just fine when waking up from code, rather than from an external interrupt, as then the maximum sleep time is so much smaller? (Often people need a loop to sleep repeatedly to achieve longer times.)

1 Like

I guess if you really sleep (that is: delay()), you would get the same behaviour as the IBM co-routine mechanism does not benefit from yield()
The thing is in non-hardware sleep, you would just re-schedule a task, not slepping.

But we see that in the 5 min test the clocks are behaving better because we call os_getTime() more often.

As the overflow if about 71 minutes, we need to ensure we update at least twice in that timeframe, so if we update every 30 mins, we shoudl be good…

(again: to be tested…)

The code you quoted is essentially the basis for my code and it works as long as the total sleep time between sends does not exceed a certain limit. This limit is based on the explanations from @Amedee in the previous post. I will summarise my findings in a separate post now.