Monitoring VMWare with InfluxDB and Grafana

Monitoring VMWare with InfluxDB and Grafana

Grafana dashboards offer a visually appealing and efficient way to track server metrics. In this guide, we’ll walk through the process of setting up Grafana with VMWare to monitor various metrics on your servers effectively.

Last year, I shared a guide on setting up Grafana for Proxmox. Having transitioned primarily to a VMWare stack, it’s the perfect time to explore a similar setup for VMWare.

Please Note: This guide assumes that you have vCenter operational, as it relies on its API.

At the end of this post, you’ll find links to some fantastic premade Grafana dashboards to kickstart your VMWare monitoring.

Creating Your Container Stack

I run my monitoring stack using Docker containers, orchestrated with Docker Compose. We'll set up three containers:

  • Telegraf: Collects data from vCenter's API and sends it to InfluxDB.
  • InfluxDB: The database that stores your metrics.
  • Grafana: Visualizes metrics from InfluxDB.


Populate a docker-compose.yml file with the following contents:

version: "3"
services:
  grafana:
    image: grafana/grafana
    container_name: grafana_container
    restart: always
    ports:
      - 3000:3000
    networks:
      - monitoring_network
    volumes:
      - grafana-volume:/var/lib/grafana

  influxdb:
    image: influxdb
    container_name: influxdb_container
    restart: always
    ports:
      - 8086:8086
      - 8089:8089/udp
    networks:
      - monitoring_network
    volumes:
      - influxdb-volume:/var/lib/influxdb

  telegraf:
    image: telegraf
    container_name: telegraf_container
    restart: always
    networks:
      - monitoring_network
    volumes:
      - ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro

networks:
  monitoring_network:
    external: true

volumes:
  grafana-volume:
    external: true
  influxdb-volume:
    external: true

Create the necessary Docker volumes and network by running:

docker volume create influxdb-volume
docker volume create grafana-volume
docker network create monitoring_network

I prefer separate volumes for InfluxDB and Grafana, primarily for the performance benefits.

Start the containers with docker compose up -d in the same directory as your docker-compose.yml file.

Addressing Startup Errors:
Upon starting the containers, you might encounter an error message about Telegraf's configuration. Here's how to fix it:

  1. Navigate to the telegraf directory.
  2. Remove the existing telegraf.conf file and create a new one:
rm -rf telegraf.conf
sudo touch telegraf.conf

Configuring InfluxDB

Start by visiting http://your_host:8086

Follow the straightforward setup process, making note of all the details for use in later steps.

Once submitted, you will be presented with an API token. Keep this stashed away, as well.

Note that this InfluxDB instance is for demonstration purposes and will be destroyed after creating this guide. Under normal circumstances, you would absolutely not want to share this token with anyone

Configuring Telegraf

Now it is time to configure Telegraf. To do this, you will need to go back to your terminal and revisit the telegraf.conf file we created earlier. Start by populating it with the following:

[outputs.influxdb_v2]
   urls = ["http://your_influxdb_host:8086"]
   ## Token for authentication.
   token = "your_token_here"
   ## Organization is the name of the organization you wish to write to; must exist.
   organization = "Homelab"
   ## Destination bucket to write into.
   bucket = "vmware"

Additionally, after the outputs block, add the following inputs block:

# Read metrics from VMware vCenter
 [[inputs.vsphere]]
 ## List of vCenter URLs to be monitored. These three lines must be uncommented
 ## and edited for the plugin to work.
 vcenters = [ "https://<vcenter_hostname>/sdk" ]
    username = "[email protected]"
    password = "password_here"
 #
 ## VMs
 ## Typical VM metrics (if omitted or empty, all metrics are collected)
 vm_metric_include = [
      "cpu.demand.average",
      "cpu.idle.summation",
      "cpu.latency.average",
      "cpu.readiness.average",
      "cpu.ready.summation",
      "cpu.run.summation",
      "cpu.usagemhz.average",
      "cpu.used.summation",
      "cpu.wait.summation",
      "mem.active.average",
      "mem.granted.average",
      "mem.latency.average",
      "mem.swapin.average",
      "mem.swapinRate.average",
      "mem.swapout.average",
      "mem.swapoutRate.average",
      "mem.usage.average",
      "mem.vmmemctl.average",
      "net.bytesRx.average",
      "net.bytesTx.average",
      "net.droppedRx.summation",
      "net.droppedTx.summation",
      "net.usage.average",
      "power.power.average",
      "virtualDisk.numberReadAveraged.average",
      "virtualDisk.numberWriteAveraged.average",
      "virtualDisk.read.average",
      "virtualDisk.readOIO.latest",
      "virtualDisk.throughput.usage.average",
      "virtualDisk.totalReadLatency.average",
      "virtualDisk.totalWriteLatency.average",
      "virtualDisk.write.average",
      "virtualDisk.writeOIO.latest",
      "sys.uptime.latest",
    ]
 # vm_metric_exclude = [] ## Nothing is excluded by default
 # vm_instances = true ## true by default
 #
 ## Hosts
 ## Typical host metrics (if omitted or empty, all metrics are collected)
 host_metric_include = [
      "cpu.coreUtilization.average",
      "cpu.costop.summation",
      "cpu.demand.average",
      "cpu.idle.summation",
      "cpu.latency.average",
      "cpu.readiness.average",
      "cpu.ready.summation",
      "cpu.swapwait.summation",
      "cpu.usage.average",
      "cpu.usagemhz.average",
      "cpu.used.summation",
      "cpu.utilization.average",
      "cpu.wait.summation",
      "disk.deviceReadLatency.average",
      "disk.deviceWriteLatency.average",
      "disk.kernelReadLatency.average",
      "disk.kernelWriteLatency.average",
      "disk.numberReadAveraged.average",
      "disk.numberWriteAveraged.average",
      "disk.read.average",
      "disk.totalReadLatency.average",
      "disk.totalWriteLatency.average",
      "disk.write.average",
      "mem.active.average",
      "mem.latency.average",
      "mem.state.latest",
      "mem.swapin.average",
      "mem.swapinRate.average",
      "mem.swapout.average",
      "mem.swapoutRate.average",
      "mem.totalCapacity.average",
      "mem.usage.average",
      "mem.vmmemctl.average",
      "net.bytesRx.average",
      "net.bytesTx.average",
      "net.droppedRx.summation",
      "net.droppedTx.summation",
      "net.errorsRx.summation",
      "net.errorsTx.summation",
      "net.usage.average",
      "power.power.average",
      "storageAdapter.numberReadAveraged.average",
      "storageAdapter.numberWriteAveraged.average",
      "storageAdapter.read.average",
      "storageAdapter.write.average",
      "sys.uptime.latest",
    ]
 # host_metric_exclude = [] ## Nothing excluded by default
 # host_instances = true ## true by default
 #
 ## Clusters
 cluster_metric_include = [] ## if omitted or empty, all metrics are collected
 # cluster_metric_exclude = [] ## Nothing excluded by default
 # cluster_instances = false ## false by default
 #
 ## Datastores
 datastore_metric_include = [] ## if omitted or empty, all metrics are collected
 # datastore_metric_exclude = [] ## Nothing excluded by default
 # datastore_instances = false ## false by default for Datastores only
 #
 ## Datacenters
 datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
# datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
 # datacenter_instances = false ## false by default for Datastores only
 #
 ## Plugin Settings
 ## separator character to use for measurement and field names (default: "_")
 # separator = "_"
 #
 ## number of objects to retreive per query for realtime resources (vms and hosts)
 ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 # max_query_objects = 256
 #
 ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
 ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 # max_query_metrics = 256
 #
 ## number of go routines to use for collection and discovery of objects and metrics
 # collect_concurrency = 1
 # discover_concurrency = 1
 #
 ## whether or not to force discovery of new objects on initial gather call before collecting metrics
 ## when true for large environments this may cause errors for time elapsed while collecting metrics
 ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
 # force_discover_on_init = false
 #
 ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
 # object_discovery_interval = "300s"
 #
 ## timeout applies to any of the api request made to vcenter
 # timeout = "60s"
 #
 ## Optional SSL Config
 # ssl_ca = "/path/to/cafile"
 # ssl_cert = "/path/to/certfile"
 # ssl_key = "/path/to/keyfile"
 ## Use SSL but skip chain & host verification
 insecure_skip_verify = true

Note that using insecure_skip_verify = true is for use with a self signed certificate. You will likely already know if this needs to be set to false for your environment.

Restart the Telegraf container to apply the new configuration:
docker compose restart telegraf

Let's check to see if data is flowing correctly. To do this, you can run docker ps to get the container ID of Telegraf:

tcude@monitoring02:~/monitoring$ docker ps
CONTAINER ID   IMAGE             COMMAND                  CREATED          STATUS          PORTS                                                                                  NAMES
7d9bf5c686ce   telegraf          "/entrypoint.sh tele…"   32 minutes ago   Up 16 seconds   8092/udp, 8125/udp, 8094/tcp                                                           telegraf_container
718bb1181c55   grafana/grafana   "/run.sh"                32 minutes ago   Up 32 minutes   0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                              grafana_container
55949c1ea88b   influxdb          "/entrypoint.sh infl…"   32 minutes ago   Up 32 minutes   0.0.0.0:8086->8086/tcp, :::8086->8086/tcp, 0.0.0.0:8089->8089/udp, :::8089->8089/udp   influxdb_container

For my setup, Telegraf has an ID of 7d9bf5c686ce. Knowing the container ID, you can now run docker logs <container_id>. You should see something similar to this:

2023-12-08T15:28:09Z I! Loading config: /etc/telegraf/telegraf.conf
2023-12-08T15:28:09Z W! DeprecationWarning: Option "force_discover_on_init" of plugin "inputs.vsphere" deprecated since version 1.14.0 and will be removed in 2.0.0: option is ignored
2023-12-08T15:28:09Z I! Starting Telegraf 1.28.5 brought to you by InfluxData the makers of InfluxDB
2023-12-08T15:28:09Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-12-08T15:28:09Z I! Loaded inputs: vsphere
2023-12-08T15:28:09Z I! Loaded aggregators:
2023-12-08T15:28:09Z I! Loaded processors:
2023-12-08T15:28:09Z I! Loaded secretstores:
2023-12-08T15:28:09Z I! Loaded outputs: influxdb_v2
2023-12-08T15:28:09Z I! Tags enabled: host=7d9bf5c686ce
2023-12-08T15:28:09Z W! Deprecated inputs: 0 and 1 options
2023-12-08T15:28:09Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"7d9bf5c686ce", Flush Interval:10s
2023-12-08T15:28:09Z I! [inputs.vsphere] Starting plugin

Telegraf picked up the telegraf.conf file and everything looks good.

With InfluxDB up and Telegraf providing data to it, we can now set up Grafana!

Setting up Grafana

Access Grafana at http://<your_hostname>:3000 (default credentials: admin/admin).

Change the default password for security.

Once logged in, visit the left hand menu of the home page. Under the Connections section, select Data sources and then Add data sources

You will now be presented with a list of possible data source options. Select InfluxDB

You will now want to configure your InfluxDB data source similar to what I have here:

There are a couple of things worth noting here:

The Query language will default to InfluxQL. I have changed this to Flux. InfluxQL is a SQL-like query language for interacting with InfluxDB, focusing on simplicity and ease of use, whereas Flux is a newer, more powerful and functional data scripting language designed for complex data processing, transformations, and analytics with InfluxDB.

Other than that, you will just need to provide your InfluxDB username and password in the Basic Auth Details section.

Under InfluxDB Details, you will use the same values we used earlier. The Token field needs to be populated with the API token value that we received earlier.

With that all out of the way, you should now be able to click Save & test

Assuming everything is configured correctly, you should see something similar to the image above.

Setting up a Dashboard in Grafana

With the hard part out of the way, it is time to load up a dashboard! To start, using the same left-hand menu, select Dashboards

Select Create Dashboard

While you can choose the route of manually creating a dashboard, there are also many pre-made dashboards that you can use to get up and running immediately. These can be viewed by clicking Import dashboard

From here, you can search for the ID of the dashboard you want to use. For my purposes, I'm going to start with this dashboard, which can be searched for using the id of 8159.

All that is left is to, under InfluxDB, select your InfluxDB data source. Then click Import

Conclusion

With that, you should now have a working Grafana dashboard!

Here are some of the other great VMWare dashboards I have found:

Special thanks to Jorge de la Cruz for creating these fantastic dashboards! Check out their blog here:

The Blog of Jorge de la Cruz
Everything about VMware, Veeam, InfluxData, Grafana, Zimbra, etc.