On 3/11/2016 8:01 AM, Alan Pevec wrote:
10.03.2016 20:54, Tristan Cacqueray wrote:
> Hello folks,
>
> here is a little note about what happened... The underlying cloud
> (called rcip-dev) experienced an outage related to rabbitmq cluster
> inconsitency:
> * No service was not able to connect to rabbitmq on port 5672
> * Restarting controllers didn't helped until the rabbitmq cluster was
> killed and recreated.
> * From that point, qrouter lost the VIP and no VRRP packets was sent.
> * Restarting the controller one by one after cleaning rabbitmq cluster
> solved that issue.
>
> Timeline:
> * 00:00 UTC: API started to timeout
> * 17:30 UTC: Network connections was lost
> * 18:15 UTC: Network connectivity restored
>
> Services' API and instances are now nominals.
> Special kudos to Cédric Lecomte for fixing the cloud and saving the day!
Thanks Tristan and Cedric!
Dogfooding is important so we can improve what we ship.
Do you do have any logs that we could give to Fabio and his team to
analyze what happened with those crazy wabbits, so we might get it fixed
and avoided in the future?
If you could please escalate the issue in a bugzilla and assign it to Petr.
We need sosreports from the nodes involved in the accident and there
should be potentially rabbitmq core dumps (but I can´t remember the
location tho).
Thanks
Fabio