Dealing with incidents within TTN

johan · September 17, 2019, 1:30pm

First of all, apologies for the late response from my side. I was traveling last week and off yesterday, but I’ve been following internal and public (#ops, twitter, here) messages closely. I appreciate the constructive feedback, nice words and support a lot, thanks @LanMarc77 for raising this, also “special thanks” to the person that spent a few hours on a Twitter post with the TTN logo in flames and a skull suggesting we’re hacked. It shows, positively and negatively, that people care. And that’s a good thing after all.

There’s an unfortunate mix of circumstances that caused the long downtime;

what: the busiest cluster (EU operated by The Things Network Foundation *)
why: issues with component interconnections that are notoriously vulnerable and hard to do right due to the design of V2, add to that doubling the network year-on-year growth and shifting focus to V3 since two years now
when: weekend (and actually enjoying it)

* small note on operations; yes, the TTN core team is on the TTI payroll, but TTN is not a TTI operated service and will not be. We allow for spending time on non-profit Foundation activities, including operations, but it is best effort.

That being said, I see a task for me for bridging the gap between limited operational availability of TTI staff and the seemingly unlimited willingness of the TTN community to maximize service uptime.

I’m replying to a few suggestions above to steer the discussion in that direction. Please feel free to make other suggestions.

This assumption is correct; I would say that “fixing” over 90% of operational issues (in V2) is simply a matter of restarting Docker containers in the right order. Unfortunately, we cannot give direct access to Docker as that implies system root access, access to keys and password which implies user data which results in privacy and GDPR issues, etc.

How about selecting 10-15 active and trusted community members spread across timezones who have access to custom Slackbot command (i.e. /restart ttn-eu) that triggers a predefined restart sequence? Only if, say, 3 individual members executed that command independently within 15 minutes, it actually happens. And to avoid abuse, at most every hour. I know it’s monkey patching, but it gives control.

Yes, and I’m happy to cover this topic on the other 2019 Conferences where I’ll be (India and Adelaide).

Right. We need to improve this as well. So, as we speak, we’re spending time to improve (real-time) communications on TTN operations, including a new status page. If automated reporting works well, we’ll be using the @ttnstatus twitter account too.

TTN is a special network and is operationally very different from TTI networks; TTN clusters are fully connected and TTN Foundation only operates a part of it (a big part, but not all), there’s heavy load, there’s lots of experimentation and testing on device and gateway level that would flag alerts in an enterprise network (and hence put unnecessary burden on TTI ops), etc. That being said, as we migrate to V3, we will converge the operational setup of TTN Foundation’s clusters and TTI Cloud Hosted. It will not be the same infrastructure, but it will be better manageable.

Yes. This is what Packet Broker is going to provide. In V3, Packet Broker is the backbone of LoRaWAN traffic between public TTN clusters, private TTI clusters, anyone running the open source The Things Stack, and even other network servers (through LoRaWAN stateless passive roaming or implementing the Packet Broker protocol).

As we’ve been promising for a long time, it will be easier for the community to operate TTN clusters as well (cc @kersing). I have several requests from all parts of the world for operating one. There is nothing wrong with overlapping regions. However, we do need Packet Broker for this. Also, we do need to structurally start measuring uptime of all TTN clusters in a way for the community to choose which TTN cluster to pick for connecting a gateway and registering devices. If it turns out that the public TTN Groningen community cluster has better uptime than the TTN Foundation’s EU cluster, so be it. We’ll hand out an award for the best community contributed cluster at the Conference. We’ll figure it out; willingness is not the issue but technology is not there yet.

Now, I hear you thinking; “but who operates Packet Broker and how can we make that redundant?” PB will be TTI operated. But APIs are open and V3 peering allows for redundancy (i.e. multiple independent upstreams). We are open to working with neutral parties (i.e. universities, internet exchanges, domain registrars, etc) to provide redundancy for peering in the TTN community network. That is regardless of community-contributed TTN clusters as described above: both provide decentralization.

I don’t think that throwing money at it will help. Also to whom would TTI provide the SLA? What if we don’t meet the SLA? This, plus the operational differences as outlined above, make me believe that we should keep TTN really a community initiative, where TTI is a contributor on a best effort bases, like, hopefully, many other contributors that will offer a public TTN cluster. Sure, TTN Foundation operated clusters are a showcase of our TTI commercial abilities, but only to a certain extent, like we do and will allow for other community network contributors to optionally (!) upsell services.

I’m not against donations though, in fact I think it’s good to consider it to cover the operational costs of TTN Foundation. Currently, TTI pays the exponentially increasing Azure bills and pays for the time spent on ops, and it would be great if TTN can be financially self-sustaining (through donations) and operationally independent (through decentralization and potentially Slackbots?).

BoRRoZ · September 17, 2019, 1:37pm

Jeff-UK · September 17, 2019, 2:04pm

I think Reading UK comes up 1st - can we get started there with TTN Core/TTI delegate(s) even if you aren’t there yourself? May be useful to continue process (from this thread) of gathering ideas/needs/concerns/issues/strawman solution suggestions?

johan · September 17, 2019, 2:07pm

Sure, please do and post any findings or things to discuss here. @wienkegiezeman is attending and is a good delegate of TTN core team and TTI as well I’ll keep an eye on this thread too

tve · September 17, 2019, 9:35pm

This has been an interesting an shocking thread. I’ve not paid much attention in recent months, but to me this thread clearly states that I cannot rely on TTN at all. It’s useful to do experiments and that’s it.

As someone who has built and operated a large cloud service with a significant free offering I’m rather shocked by the statement “our internal on-call system only alerts the team about incidents with the public community network during working hours, and not at night or during weekends”. It would have never occurred to us to leave the free offering broken for a whole week-end and then just say “hey, best effort!”.

What I would expect is for the free offering to be lowest priority, and of course there’s no recourse/compensation nor 24x7 support line to call, etc. But to leave it completely abandoned, wow.

I supposed what it really means is that TTN has become a pure liability to TTI, i.e., it provides zero benefits, and that’s why TTI can’t provide a minimum of commitment?

The best thing that could come out of this discussion would be a true community network that is operated by volunteers that care about the mission and that peers with TTN. Then get a majority of GWs to switch to the community network and let TTN die…

OK, I’m sure I’m now going to get flamed off the forum…

BoRRoZ · September 17, 2019, 10:01pm

and you see that happen ?

I am still impressed how this community grows, how TTI grows world wide, I see it as an accomplishment and I accept the ‘grow pains’ , that’s the reason I am still volunteer.

cslorabox · September 18, 2019, 1:33am

In the sense that the alternative for those who need to be sure they can fix anything that may break is local networks that don’t peer with TTN, you are correct, peering is the best outcome.