Is there a TTN status page?

This morning all our gateways appeared disconnected in the TTN console. Yet, the gateway log files didn’t indicate any obvious errors. That made me wonder: is there a TTN status page? Put differently, what resource would have told me right then whether TTN was experiencing issues that might be related to the behavior I was seeing?
Side note: a few hours later all gateways were listed as connected again - w/o us intervening in any way.

3 Likes

I suggested long time ago to make the blue cloud in the logo, left top corner, red when something’s down or not working as it should :innocent:

The closest is probably the #ops channel on Slack. (If you don’t have access, then request an invite on https://account.thethingsnetwork.org)

1 Like

Today @egourlao and me deployed a status page to https://status.thethings.network. We’re still working on hooking up our automated system checks, but from now on we’ll start posting incidents there.

5 Likes

great work ! you 2 deserve a cold beer :slight_smile:

need a ‘back to forum button’

And really nice you’re showing history too.

Well done, really nice work!

Except…that 1860 × 480 banner at the top. Why inline with a data URI and thus not cachable by user agents?
100 extra points for offering RSS, thanks.

Oh, this should read “really nice work, cachethq.io”. Anyway, thanks for setting this up :smile:

Can’t get my nodes to join atm. Have tried rebooting the gateway and nodes.

I also tried my TTN node - Noticed that the data page in the console updated a few minutes later when it used to be instant.
Anyone else seeing teh same this afternoon?

I see this in my Multitech TTN logs:

13:19:04  INFO: [down] TTN received downlink with payload
13:19:04  INFO: [TTN] downlink 33 bytes
src/jitqueue.c:323:jit_enqueue(): ERROR: Packet REJECTED, timestamp seems wrong, too much in advance (curren    t=96132780, packet=705017259, type=0)
13:19:04  ERROR: [down] Packet REJECTED (jit error=2 Too early to queue this packet)
lgw_receive:1165: FIFO content: 1 52 1 5 17
lgw_receive:1184: [2 17]

This is presumably receiving Join responses as I’m seeing join failures on a node I’m trying to test.

I think some clock has got out of whack on the servers.

Andrew

Timestamps in the response packets are not (wall) clock based. They’re calculated based on the timestamps in the request packets.

I think the central infra is suffering from delays, last time this happened things were overloaded.

@htdvisser it might be a good idea to have checks based on end-to-end processing for the back-end to base the status on. Based on the logs I’ve seen this week we’ve had two instances where EU region was not able to process data in a timely manner while there are no incidents listed on the status page. Do you agree? If so we (== the community) could look into creating them.

4 Likes

My RAK811 based node is not able to join via OTAA at the moment.

excellent idea …

Back to normal now:) -
I know its a lot of work for someone, but it would answer a lot of questions if this “time delay” metric could be measured and the averages plotted per hour on a status page graph (spanning last 7 days).

I know the status page has to be updated manual and that there is some delay between a reported known networkproblem.
Yesterday there was definitely something wrong.

I was in the middle of testing something and noticed it… first thing I do is visit the status page.
Nope… everything fine, so I restarted gateway(s) checked node battery’s/circuits/code ect.

satprobs
no incidents yesterday ?

In the end I came to the conclusion there was something wrong and not on my side.
The invisible NOC people fixed it very fast… kudo’s for that.

My hope is that in the future there will be a more reliable way to detect networkprobs and to update the status page as fast as possible.

1 Like

Sure. Any volunteers who want to build that?

Could easily be done by looking at the message traces in the bridge. JoinAccept and Downlink messages often contain traces that also include the first “receive” of the bridge. If that bridge was the same instance (compare service_id/service_name) you can calculate the full uplink-downlink time. Unfortunately the clocks in our backend are not always enough in sync to calculate between different instances.

Another idea could be to build some tool for collecting end-to-end metrics. That tool could re-use code from ttnctl, and create a device, join that device, receive event on MQTT, schedule downlink, send uplink, receive event on MQTT, receive downlink and delete the device, counting latency for each step.

If those metrics could be scraped with Prometheus, I don’t think it will be a lot of work to add a graph to the status page.

The Things Network Foundation unfortunately doesn’t have the (financial or personal) resources to have someone on call 24/7 to jump onto these issues. If there are severe issues requiring someone to take action, we try to consistently update the status page. If the issues resolve themselves, this may indeed not happen.

Keep in mind that the public community network is provided for free and on a best-effort basis. Although we do our best to make everything work smoothly, we can’t fully control what people do on our network, and sometimes this causes incidents. In such cases, we can’t always have guarantee that (1) we spot the incident in a timely manner and (2) there is someone available who can fix the issue. This is why there are no guarantees on service level or incident resolution time on the public community network. If that is something you do need, you can consider the private network solutions offered by The Things Industries or any of the other commercial partners.