Interesting report,
I was discussing this with Joe Talerico a few days ago, but
starting at
Liberty, you could also leverage the neutron QoS service to cap down
the tenant traffic.
https://www.openstack.org/summit/tokyo-2015/videos/presentation/qos-a-neu...
http://www.ajo.es/post/126667247769/neutron-qos-service-plugin
It's still on it's early stages, and will only let you to limit
egress traffic from ports
via network attachment to a policy or port attachment to a policy.
You can't still setup a default policy, but for example you could
periodically
list new created networks, and attach those networks to a policy
limiting egress
to 500kbps or less.
Another strategy could be watching for routers creation, and create
limit
on the internal ports of the routers (effectively limiting the tenant
ingress from
the public network).
That would prevent the "chatty neighbor" cases, or the abuse of
trystack, but of course
that would have to be noted somewhere , otherwise people could think
openstack
performs horribly on network ;)
Cheers,
Kambiz Aghaiepour wrote:
TryStack Outage
Wednesday Oct 28, 2015
Impact -
Earlier Wednesday morning, TryStack (
http://x86.trystack.org/ )
experienced an outage for several hours beginning in the early hours of
the day. The outage impacted all tenants and appears to have been
caused due to exhaustion of services related to tenant networks building
up over the course of several months. In order to return services to
normal, resources (networks, router ports, etc) for tenants without any
running VMs were manually deleted freeing up system resources on our
neutron host and returning TryStack back to normal operations.
Per Tenant Fix -
If you have occasion to use TryStack as a sandbox environment, you may
need to delete and recreate your router in your tenant if you find your
launched guests are not acquiring a DHCP address correctly or able to be
connected with over an associated floating IP address.
Ongoing Resource Management -
In order to prevent exhaustion of system resources, we have been
automatically deleting VMs 24 hours after they are
created. Additionally, we clear router gateways as well as floating IP
allocations 12 hours after they are set (the public subnet is a /24
network and anyone with an account can use the public subnet free of
charge, hence the need for aggressively culling resources)
Until today we had not been purging other resources, and over the course
of the last three to four months, the tenant/project count has grown to
just over 1300 tenants. Many users login a few times, create their
networks and routers, and launch some test VMs and may not revisit
TryStack for some time. As such the qrouter and qdhcp network
namespaces are created, and ports created in OVS, along with associated
dnsmasq processes for each subnet the tenant creates. We are adding
management and culling of these additional resource types using the
ospurge utility ( see:
https://github.com/openstack/ospurge )
IRC Alerting -
We have also added IRC bots that can announce alerts in the #trystack
channel in Freenode. Alerts are sent to the IRC bot via a nagios
instance monitoring the environment.
Grafana / Graphite -
We are currently working on building dashboards using grafana, using a
graphite backend, and collectd agents sending data to graphite. Will
Foster has built an initial dashboard to see resource utilization and
trending at a glance (Thanks Will!). The dashboard(s) are not yet ready
for public consumption, but we plan on making a read-only grafana
interface available in the near future. For a sample of what the
dashboard will look like, see :
http://ibin.co/2Kf8i9WxsWIl
(The image is only depicting part of the dashboard as it is only a
screenshot).