[Rdo-list] TryStack Outage Report 2015-10-28

Miguel Angel Ajo mangelajo at redhat.com
Fri Oct 30 09:23:12 UTC 2015


Interesting report,

     I was discussing this with Joe Talerico a few days ago, but 
starting at
Liberty, you could also leverage the neutron QoS service to cap down
the tenant traffic.

https://www.openstack.org/summit/tokyo-2015/videos/presentation/qos-a-neutron-n00bie

http://www.ajo.es/post/126667247769/neutron-qos-service-plugin

    It's still on it's early stages, and will only let you to limit 
egress traffic from ports
via network attachment to a policy or port attachment to a policy.

    You can't still setup a default policy, but for example you could 
periodically
list new created networks, and attach those networks to a policy 
limiting egress
to 500kbps or less.

     Another strategy could be watching for routers creation, and create 
limit
on the internal ports of the routers (effectively limiting the tenant 
ingress from
the public network).

    That would prevent the "chatty neighbor" cases, or the abuse of 
trystack, but of course
that would have to be noted somewhere , otherwise people could think 
openstack
performs horribly on network ;)


    Cheers,


Kambiz Aghaiepour wrote:
> TryStack Outage
> Wednesday Oct 28, 2015
>
> Impact -
>
> Earlier Wednesday morning, TryStack ( http://x86.trystack.org/ )
> experienced an outage for several hours beginning in the early hours of
> the day.  The outage impacted all tenants and appears to have been
> caused due to exhaustion of services related to tenant networks building
> up over the course of several months.  In order to return services to
> normal, resources (networks, router ports, etc) for tenants without any
> running VMs were manually deleted freeing up system resources on our
> neutron host and returning TryStack back to normal operations.
>
> Per Tenant Fix -
>
> If you have occasion to use TryStack as a sandbox environment, you may
> need to delete and recreate your router in your tenant if you find your
> launched guests are not acquiring a DHCP address correctly or able to be
> connected with over an associated floating IP address.
>
> Ongoing Resource Management -
>
> In order to prevent exhaustion of system resources, we have been
> automatically deleting VMs 24 hours after they are
> created. Additionally, we clear router gateways as well as floating IP
> allocations 12 hours after they are set (the public subnet is a /24
> network and anyone with an account can use the public subnet free of
> charge, hence the need for aggressively culling resources)
>
> Until today we had not been purging other resources, and over the course
> of the last three to four months, the tenant/project count has grown to
> just over 1300 tenants.  Many users login a few times, create their
> networks and routers, and launch some test VMs and may not revisit
> TryStack for some time.  As such the qrouter and qdhcp network
> namespaces are created, and ports created in OVS, along with associated
> dnsmasq processes for each subnet the tenant creates.  We are adding
> management and culling of these additional resource types using the
> ospurge utility ( see: https://github.com/openstack/ospurge )
>
> IRC Alerting -
>
> We have also added IRC bots that can announce alerts in the #trystack
> channel in Freenode.  Alerts are sent to the IRC bot via a nagios
> instance monitoring the environment.
>
> Grafana / Graphite -
>
> We are currently working on building dashboards using grafana, using a
> graphite backend, and collectd agents sending data to graphite. Will
> Foster has built an initial dashboard to see resource utilization and
> trending at a glance (Thanks Will!).  The dashboard(s) are not yet ready
> for public consumption, but we plan on making a read-only grafana
> interface available in the near future.  For a sample of what the
> dashboard will look like, see :
>
>  http://ibin.co/2Kf8i9WxsWIl
>
> (The image is only depicting part of the dashboard as it is only a
> screenshot).
>
>
>




More information about the dev mailing list