TryStack Outage
Wednesday Oct 28, 2015
Impact -
Earlier Wednesday morning, TryStack (
http://x86.trystack.org/ )
experienced an outage for several hours beginning in the early hours of
the day. The outage impacted all tenants and appears to have been
caused due to exhaustion of services related to tenant networks building
up over the course of several months. In order to return services to
normal, resources (networks, router ports, etc) for tenants without any
running VMs were manually deleted freeing up system resources on our
neutron host and returning TryStack back to normal operations.
Per Tenant Fix -
If you have occasion to use TryStack as a sandbox environment, you may
need to delete and recreate your router in your tenant if you find your
launched guests are not acquiring a DHCP address correctly or able to be
connected with over an associated floating IP address.
Ongoing Resource Management -
In order to prevent exhaustion of system resources, we have been
automatically deleting VMs 24 hours after they are
created. Additionally, we clear router gateways as well as floating IP
allocations 12 hours after they are set (the public subnet is a /24
network and anyone with an account can use the public subnet free of
charge, hence the need for aggressively culling resources)
Until today we had not been purging other resources, and over the course
of the last three to four months, the tenant/project count has grown to
just over 1300 tenants. Many users login a few times, create their
networks and routers, and launch some test VMs and may not revisit
TryStack for some time. As such the qrouter and qdhcp network
namespaces are created, and ports created in OVS, along with associated
dnsmasq processes for each subnet the tenant creates. We are adding
management and culling of these additional resource types using the
ospurge utility ( see:
https://github.com/openstack/ospurge )
IRC Alerting -
We have also added IRC bots that can announce alerts in the #trystack
channel in Freenode. Alerts are sent to the IRC bot via a nagios
instance monitoring the environment.
Grafana / Graphite -
We are currently working on building dashboards using grafana, using a
graphite backend, and collectd agents sending data to graphite. Will
Foster has built an initial dashboard to see resource utilization and
trending at a glance (Thanks Will!). The dashboard(s) are not yet ready
for public consumption, but we plan on making a read-only grafana
interface available in the near future. For a sample of what the
dashboard will look like, see :
http://ibin.co/2Kf8i9WxsWIl
(The image is only depicting part of the dashboard as it is only a
screenshot).
--
Red Hat, Inc.
100 East Davie Street
Raleigh, NC 27601
"All tyranny needs to gain a foothold is for people of good conscience
to remain silent." --Thomas Jefferson