Hi @johan - all good. Thanks to everyone that helped.
Great to hear, thanks for letting me know!
Hi Johan, re " we had a configuration issue since Thursday." I was watching the TTN staus page closely but didnt see any posting regarding the configuration issue you were working on during the outage. Is that status page meant to record these type of outages, or if not where is the best place to stay aware of sytem wide issues being worked on? @Maj and @descartes also seemed to be unaware there was an outage. Rgds Nick
No one was aware there was an outage because @SteveLeigh-AUS & myself were figuring out the details. At the time two, possibly three, individuals had reported a problem so not identifiable as an outage as such which I’d define as being system wide (to a region at the least) impacting all users in a clearly identifiable way. Once we’d concluded we’d exhausted all reasonable tests it was escalated to ops.
It is a curious situation - a sort of Heisenberg / Schrödinger moment when you cross the threshold of investigation to conclude that there is an issue - in this instance we could only come to a Holmesian conclusion that in the absence of evidence, the only logical explanation was that there was an issue. My namesake would no doubt be pleased with this kind of zen of debugging, I think there is an issue, therefore there might be. As opposed to “it’s not working so it must be the servers”.
I’ll leave @johan to go in to the details of when an issue becomes an outage and the circumstances that trigger a post to the status page.
PS, Whilst you were watching the status page, did you see the Data Storage retrieve issue on v2 that was system wide - it was fixed whilst I was still providing the details …
@descartes Hi Nick, mate you managed to put a lot of big words in your response and I’m not really smart enough to understand them all. But even I could see instantly that you were making the classic debugging mistake of “assuming the most convenient option first”. We all do it, all the time. If theres a fancy word for it let me know, and I’ll be sure to use it. ("McClouding the issue"maybe?)
Thats my narky little jab done. Aside from that, I do appreciate your help and we did get to a solution within a few days. Actually I took Johan’s response to mean he’d been aware of and working on the configuration issue since Thursday. Reading more closely I may have got that wrong there, so apologies for my misunderstanding.
Don’t understand this “most convenient option” - not hugely convenient as a volunteer to debug an issue I can’t replicate with some one on the other side of the planet. But I get plenty of support from others, so it’s just giving back.
How is it the most convenient option to do some testing before coming to a conclusion that there is a problem, let alone declare the problem serious? If you hang out more on the forum, you’ll see a lot of posts that start out with someone assuming that the servers are wrong when it is something they’ve done or not done.
But I did enjoy the concept that I wasn’t aware of the issue because TTI hadn’t told me about it, when they weren’t aware of it either.
I think that’s the crux of the matter, there was a subtle config issue that manifested itself on Thursday but wasn’t uncovered until later.
PS, Please be aware that we like the discussions to remain ad hominem free.
Sorry Nick I take it back. I promise no ad hominems in the future:) I didnt realsie you were a volunter, I though you said you were a paid troubleshooter or something. Bad move on my part. ( i had to google ad hominems btw)
Expanding on this for clarification - I can touch type so I can bang out suggestions on debugging strategies and pointers to solutions. Sometimes I lose track of the details (like “Do you have or can create v2 devices”), forgetting at that point that v2 is now read-only. I try to facilitate rather than be prescriptive as material discovered is learnt, material handed out is just acted on, so I rattle off some ideas as avenues of investigation.
My “day” job is varied but is essentially IoT / Security Electronics & supporting systems. Some of it is relatively safety critical so the docs for testing can be quite detailed. Hence the contrasting comment about being paid to supply the detail.
It’s all good, but time for sleep.
My two cents - have been developing software since 1977 and it took a while but I now always believe the fault is in my software before I look elsewhere… school of hard knocks!
When the service is seriously degraded in performance or availability. Could be message throughput or the availability of certain features. So it’s mostly that if the issue doesn’t fall in this category, we don’t update status page. But there’s a big gray area here.
The reason why we didn’t update status page this time, is that this was an issue only with downlink and only within one V2 region (Australia) which is not operated by TTI. The configuration issue was on us though, just for the record, but we never had an Australia component on the TTN status page because TTI doesn’t operate it. Now, this was one of the operational transparency issues that we had with V2, which is why V3 is being operated by TTI and things would be different there.
We received some reports and we could see that some metrics were not entirely healthy (that is, less downlink in one region but still flowing). We could then trace that back to Thursday when we did some configuration changes globally. So we started digging and found the issue quite fast. As usual, a combination of several factors. Had we known before the weekend that there was something not right that needed investigation, especially with infrastructure TTI manages, we may have updated status page though.
In the 1980’s I wrote hundreds of thousands of lines of Macro 32 assembler code that were 100% perfect until they were issued to the users. Since then I’ve decided to only code for myself.