[rdo-users] TripleO Monitoring Tool/Method

Fri Oct 23 13:36:56 UTC 2020

Hi,

yes of course I'm using STF, and it's not complicated.
It's always a good idea to separate your monitoring stack from the
monitored infrastructure. How would you know your stack is down, if
notifications are also sent from that stack?

With the tripleo-heat-templates you linked, you basically enable legacy
telemetry (ceilometer, aodh, gnocchi).

If you are running 40 computes, that is not a small stack anymore. I
would suggest (recommend) to use ceph as backend.

Also, depending on your use-case and your settings (for collectd) you
may want to lower the interval, the parameter is
CollectdDefaultPollingInterval, I have set it here to something like 5
secs, but in your case, I would suggest to use 600 (same as for Ceilometer).

Matthias

On 23/10/2020 11:09, Khodayar Doustar wrote:
> Matthias,
> 
> Thanks a lot for your answer.
> Yes, you win the bet :) I've used swift and currently struggling to
> disable collectd to make my cloud usable again! :))
> 
> I've seen this STF (Service Telemetry Framework) but it seems a little
> bit too complicated. I should implement an OKD cluster to monitor my
> openstack, isn't it too much work?
> Have you tried it yourself?
> 
> If I understand correctly, with your first and main opinion you mean
> adding this files to my overcloud deploy command:
> 
> /usr/share/openstack-tripleo-heat-templates/environments/enable-legacy-telemetry.yaml
> /usr/share/openstack-tripleo-heat-templates/environments/services/collectd.yaml
> 
> and for performance tuning I've checked this page:
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/deployment_recommendations_for_specific_red_hat_openstack_platform_services/config-recommend-telemetry_config-recommend-telemetry#config_telemetry-small-overcloud_config-recommend-telemetry
> <https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/deployment_recommendations_for_specific_red_hat_openstack_platform_services/config-recommend-telemetry_config-recommend-telemetry#config_telemetry-small-overcloud_config-recommend-telemetry>
> 
> Is that what you mean?
> If so I should make my cloud usable again and just change GnocchiBackend
> to a path to a file on a shared file system (i.e. NFS) because I have 4
> controller nodes, because the rest is exactly what I've done up to now.
> 
> Thanks a lot,
> Khodayar
> 
> On Fri, Oct 23, 2020 at 10:01 AM Matthias Runge <mrunge at redhat.com
> <mailto:mrunge at redhat.com>> wrote:
> 
>     On 22/10/2020 17:46, Khodayar Doustar wrote:
>     > Hi everybody,
>     >
>     > I am searching for a good and useful method to monitor my 40 nodes
>     cloud.
>     >
>     > I have tried
>     >
>     > - Prometheus + Grafana (with
>     > https://github.com/openstack-exporter/openstack-exporter
>     <https://github.com/openstack-exporter/openstack-exporter>
>     > <https://github.com/openstack-exporter/openstack-exporter
>     <https://github.com/openstack-exporter/openstack-exporter>>) but it
>     > cannot monitor nodes load and cpu usage etc.
>     > and 
>     > - Gnocchi +Collectd + Grafana but it enforces unbelievable load on
>     nodes
>     > and make the whole cloud completely unusable!
>     >
>     > I've tried to use Graphite + Grafana but I failed.
>     >
>     > Do you have any suggestions?
> 
> 
>     Hi,
> 
>     yes, I have some opinions here.
> 
>     My proposal here is:
> 
>     - use collectd to collect low level metrics from your baremetal machines
>     - use ceilometer to collect OpenStack related info, like project usage,
>     etc. That is nothing you'd get by using node-exporter
>     - hook them both together and send metrics over to something called
>     Service Telemetry Framework. The configuration *is* included in tripleo.
>     The website has documentation available
>     https://infrawatch.github.io/documentation
>     <https://infrawatch.github.io/documentation>
>     - graphite + grafana (plus collectd) is also a single node setup and
>     won't provide you reliability.
>     - collectd also provides the ability to send events, which can be acted
>     on. That is not included if you use node-exporter, openstack-exporter
>     etc. Prometheus monitoring creates events from metrics, but will be slow
>     to detect failed components.
> 
>     Since prometheus is meant to be single server, there is no HA per se in
>     prometheus. That makes handling prometheus on standalone machines a bit
>     awkward, or you'd have a infrastructure taking care of that.
> 
>     In your tests with gnocchi, collectd and grafana, I bet you used swift
>     as backend for gnocchi storage. That is not a good idea and may lead to
>     bad performance.
> 
>     Matthias
> 
>     -- 
>     Matthias Runge <mrunge at redhat.com <mailto:mrunge at redhat.com>>
> 
>     Red Hat GmbH, http://www.de.redhat.com/ <http://www.de.redhat.com/>,
>     Registered seat: Grasbrunn,
>     Commercial register: Amtsgericht Muenchen, HRB 153243,
>     Man.Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael
>     O'Neil
> 
>     _______________________________________________
>     users mailing list
>     users at lists.rdoproject.org <mailto:users at lists.rdoproject.org>
>     http://lists.rdoproject.org/mailman/listinfo/users
>     <http://lists.rdoproject.org/mailman/listinfo/users>
> 
>     To unsubscribe: users-unsubscribe at lists.rdoproject.org
>     <mailto:users-unsubscribe at lists.rdoproject.org>
> 

-- 
Matthias Runge <mrunge at redhat.com>

Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Man.Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neil