Troubleshoot Sensu

Service logging

Logs produced by Sensu services (sensu-backend and sensu-agent) are often the best place to start when troubleshooting a variety of issues.

Log levels

Each log message is associated with a log level that indicates the relative severity of the event being logged:

Log level Description
panic Severe errors that cause the service to shut down in an unexpected state
fatal Fatal errors that cause the service to shut down (status 0)
error Non-fatal service error messages
warn Warning messages that indicate potential issues
info Information messages that represent service actions
debug Detailed service operation messages to help troubleshoot issues
trace Confirmation messages about whether a rule authorized a request

You can configure these log levels by specifying the desired log level as the value of log-level in the service configuration file (agent.yml or backend.yml) or as an argument to the --log-level command line flag:

sensu-agent start --log-level debug

You must restart the service after you change log levels via configuration files or command line arguments. For help with restarting a service, see the agent reference or backend reference.

Increment log level verbosity

To increment the log level verbosity at runtime for the backend, run:

kill -s SIGUSR1 $(pidof sensu-backend)

To increment the log level verbosity at runtime for the agent, run:

kill -s SIGUSR1 $(pidof sensu-agent)

When you increment the log at the trace level (the most verbose log level), the log will wrap around to the error level.

Log file locations

Linux

Sensu services print structured log messages to standard output. To capture these log messages to disk or another logging facility, Sensu services use capabilities provided by the underlying operating system’s service management. For example, logs are sent to the journald when systemd is the service manager, whereas log messages are redirected to /var/log/sensu when running under sysv init schemes. If you are running systemd as your service manager and would rather have logs written to /var/log/sensu/, see forwarding logs from journald to syslog.

For journald targets, use these commands to follow the logs. Replace the <service> variable with the name of the desired service (for example, backend or agent).

journalctl --follow --unit sensu-<service>
journalctl --follow --unit sensu-<service>
journalctl --follow --unit sensu-<service>

For log file targets, use these commands to follow the logs. Replace the <service> variable with the name of the desired service (for example, backend or agent).

tail --follow /var/log/sensu/sensu-<service>
tail --follow /var/log/sensu/sensu-<service>
tail --follow /var/log/sensu/sensu-<service>

NOTE: Platform versions are listed for reference only and do not supersede the documented supported platforms.

Narrow your search to a specific timeframe

Use the journald keyword since to refine the basic journalctl commands and narrow your search by timeframe.

Retrieve all the logs for sensu-backend since yesterday:

journalctl -u sensu-backend --since yesterday | tee sensu-backend-$(date +%Y-%m-%d).log

Retrieve all the logs for sensu-agent since a specific time:

journalctl -u sensu-agent --since 09:00 --until "1 hour ago" | tee sensu-agent-$(date +%Y-%m-%d).log

Retrieve all the logs for sensu-backend for a specific date range:

journalctl -u sensu-backend --since "2015-01-10" --until "2015-01-11 03:00" | tee sensu-backend-$(date +%Y-%m-%d).log
Logging edge cases

If a Sensu service experiences a panic crash, the service may seem to start and stop without producing any output in journalctl. This is due to a bug in systemd.

In these cases, try using the _COMM variable instead of the -u flag to access additional log entries:

journalctl _COMM=sensu-backend.service --since yesterday

Windows

The Sensu agent stores service logs to the location specified by the log-file configuration flag (default %ALLUSERSPROFILE%\sensu\log\sensu-agent.log, C:\ProgramData\sensu\log\sensu-agent.log on standard Windows installations). For more information about managing the Sensu agent for Windows, see the agent reference. You can also view agent events using the Windows Event Viewer, under Windows Logs, as events with source SensuAgent.

If you’re running a binary-only distribution of the Sensu agent for Windows, you can follow the service log printed to standard output using this command:

Get-Content -  Path "C:\scripts\test.txt" -Wait

Sensu backend startup errors

The following errors are expected when starting up a Sensu backend with the default configuration:

{"component":"etcd","level":"warning","msg":"simple token is not cryptographically signed","pkg":"auth","time":"2019-11-04T10:26:31-05:00"}
{"component":"etcd","level":"warning","msg":"set the initial cluster version to 3.3","pkg":"etcdserver/membership","time":"2019-11-04T10:26:31-05:00"}
{"component":"etcd","level":"warning","msg":"serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!","pkg":"embed","time":"2019-11-04T10:26:33-05:00"}

The serving insecure client requests warning is an expected warning from the embedded etcd database. TLS configuration is recommended but not required. For more information, see etcd security documentation.

Permission issues

The Sensu user and group must own files and folders within /var/cache/sensu/ and /var/lib/sensu/. You will see a logged error like those listed here if there is a permission issue with either the sensu-backend or the sensu-agent:

{"component":"agent","error":"open /var/cache/sensu/sensu-agent/assets.db: permission denied","level":"fatal","msg":"error executing sensu-agent","time":"2019-02-21T22:01:04Z"}
{"component":"backend","level":"fatal","msg":"error starting etcd: mkdir /var/lib/sensu: permission denied","time":"2019-03-05T20:24:01Z"}

Use a recursive chown to resolve permission issues with the sensu-backend:

sudo chown -R sensu:sensu /var/cache/sensu/sensu-backend

or the sensu-agent:

sudo chown -R sensu:sensu /var/cache/sensu/sensu-agent

Handlers and event filters

Whether implementing new workflows or modifying existing workflows, you may need to troubleshoot various stages of the event pipeline.

Create an agent API test event

In many cases, generating events using the agent API will save you time and effort over modifying existing check configurations.

Here’s an example that uses cURL with the API of a local sensu-agent process to generate test-event check results:

curl -X POST \
-H 'Content-Type: application/json' \
-d '{
  "check": {
    "metadata": {
      "name": "test-event"
    },
    "status": 2,
    "output": "this is a test event targeting the email_ops handler",
    "handlers": [ "email_ops" ]
  }
}' \
http://127.0.0.1:3031/events

Use a debug handler

It may also be helpful to see the complete event object being passed to your workflows. We recommend using a debug handler like this one to write an event to disk as JSON data:

---
type: Handler
api_version: core/v2
metadata:
  name: debug
spec:
  type: pipe
  command: cat > /var/log/sensu/debug-event.json
  timeout: 2
{
  "type": "Handler",
  "api_version": "core/v2",
  "metadata": {
    "name": "debug"
  },
  "spec": {
    "type": "pipe",
    "command": "cat > /var/log/sensu/debug-event.json",
    "timeout": 2
  }
}

With this handler definition installed in your Sensu backend, you can add the debug to the list of handlers in your test event:

curl -X POST \
-H 'Content-Type: application/json' \
-d '{
  "check": {
    "metadata": {
      "name": "test-event"
    },
    "status": 2,
    "output": "this is a test event targeting the email_ops handler",
    "handlers": [ "email_ops", "debug" ]
  }
}' \
http://127.0.0.1:3031/events

The observability event data should be written to /var/log/sensu/debug-event.json for inspection. The contents of this file will be overwritten by every event sent to the debug handler.

NOTE: When multiple Sensu backends are configured in a cluster, event processing is distributed across all members. You may need to check the filesystem of each Sensu backend to locate the debug output for your test event.

Manually execute a handler

If you are not receiving events via a handler even though a check is generating events as expected, follow these steps to manually execute the handler and confirm whether the handler is working properly.

  1. List all events:

    sensuctl event list

    Choose an event from the list to use for troubleshooting and note the event’s check and entity names.

  2. Navigate to the /var/cache/sensu/sensu-backend/ directory:

    cd /var/cache/sensu/sensu-backend/

  3. Run ls to list the contents of the /var/cache/sensu/sensu-backend/ directory. In the list, identify the handler’s dynamic runtime asset SHA.

    NOTE: If the list includes more than one SHA, run sensuctl asset list. In the response, the Hash column contains the first seven characters for each asset build’s SHA. Note the hash for your build of the handler asset and compare it with the SHAs listed in the /var/cache/sensu/sensu-backend/ directory to find the correct handler asset SHA.

  4. Navigate to the bin directory for the handler asset SHA. Before you run the command below, replace <handler_asset_sha> with the SHA you identified in the previous step.

    cd <handler_asset_sha>/bin

  5. Run the command to manually execute the handler. Before you run the command below, replace the following text:

    • <entity_name>: Replace with the entity name for the event you are using to troubleshoot.
    • <check_name>: Replace with the check name for the event you are using to troubleshoot.
    • <handler_command>: Replace with the command value for the handler you are troubleshooting.
    sensuctl event info <entity_name> <check_name> --format json | ./<handler_command>

If your handler is working properly, you will receive an alert for the event via the handler. The response for your manual execution command will also include a message to confirm notification was sent. In this case, your Sensu pipeline is not causing the problem with missing events.

If you do not receive an alert for the event, the handler is not working properly. In this case, the manual execution response will include the message Error executing <handler_asset_name>: followed by a description of the specific error to help you correct the problem.

Dynamic runtime assets

Use the information in this section to troubleshoot error messages related to dynamic runtime assets.

Incorrect asset filter

Dynamic runtime asset filters allow you to scope an asset to a particular operating system or architecture. You can see an example in the asset reference. An improperly applied asset filter can prevent the asset from being downloaded by the desired entity and result in error messages both on the agent and the backend illustrating that the command was not found:

Agent log entry

{
  "asset": "check-disk-space",
  "component": "asset-manager",
  "entity": "sensu-centos",
  "filters": [
    "true == false"
  ],
  "level": "debug",
  "msg": "entity not filtered, not installing asset",
  "time": "2019-09-12T18:28:05Z"
}

Backend event

---
timestamp: 1568148292
check:
  command: check-disk-space
  handlers: []
  high_flap_threshold: 0
  interval: 10
  low_flap_threshold: 0
  publish: true
  runtime_assets:
  - sensu-plugins-disk-checks
  subscriptions:
  - caching_servers
  proxy_entity_name: ''
  check_hooks:
  stdin: false
  subdue:
  ttl: 0
  timeout: 0
  round_robin: false
  duration: 0.001795508
  executed: 1568148292
  history:
  - status: 127
    executed: 1568148092
  issued: 1568148292
  output: 'sh: check-disk-space: command not found'
  state: failing
  status: 127
  total_state_change: 0
  last_ok: 0
  occurrences: 645
  occurrences_watermark: 645
  output_metric_format: ''
  output_metric_handlers:
  output_metric_tags:
  env_vars:
  metadata:
    name: failing-disk-check
    namespace: default
metadata:
  namespace: default
{
  "timestamp": 1568148292,
  "check": {
    "command": "check-disk-space",
    "handlers": [],
    "high_flap_threshold": 0,
    "interval": 10,
    "low_flap_threshold": 0,
    "publish": true,
    "runtime_assets": [
      "sensu-plugins-disk-checks"
    ],
    "subscriptions": [
      "caching_servers"
    ],
    "proxy_entity_name": "",
    "check_hooks": null,
    "stdin": false,
    "subdue": null,
    "ttl": 0,
    "timeout": 0,
    "round_robin": false,
    "duration": 0.001795508,
    "executed": 1568148292,
    "history": [
      {
        "status": 127,
        "executed": 1568148092
      }
    ],
    "issued": 1568148292,
    "output": "sh: check-disk-space: command not found\n",
    "state": "failing",
    "status": 127,
    "total_state_change": 0,
    "last_ok": 0,
    "occurrences": 645,
    "occurrences_watermark": 645,
    "output_metric_format": "",
    "output_metric_handlers": null,
    "output_metric_tags": null,
    "env_vars": null,
    "metadata": {
      "name": "failing-disk-check",
      "namespace": "default"
    }
  },
  "metadata": {
    "namespace": "default"
  }
}

If you see a message like this, review your asset definition — it means that the entity wasn’t able to download the required asset due to asset filter restrictions. To review the filters for an asset, use the sensuctl asset info command with a --format flag:

sensuctl asset info sensu-plugins-disk-checks --format yaml
sensuctl asset info sensu-plugins-disk-checks --format wrapped-json

Conflating operating systems with families

A common asset filter issue is conflating operating systems with the family they’re a part of. For example, although Ubuntu is part of the Debian family of Linux distributions, Ubuntu is not the same as Debian. A practical example might be:

filters:
- entity.system.platform == 'debian'
- entity.system.arch == 'amd64'
{
  "filters": [
    "entity.system.platform == 'debian'",
    "entity.system.arch == 'amd64'"
  ]
}

This would not allow an Ubuntu system to run the asset.

Instead, the asset filter should look like this:

filters:
- entity.system.platform_family == 'debian'
- entity.system.arch == 'amd64'
{
  "filters": [
    "entity.system.platform_family == 'debian'",
    "entity.system.arch == 'amd64'"
  ]
}

or

filters:
- entity.system.platform == 'ubuntu'
- entity.system.arch == 'amd64'
{
  "filters": [
    "entity.system.platform == 'ubuntu'",
    "entity.system.arch == 'amd64'"
  ]
}

This would allow the asset to be downloaded onto the target entity.

Running the agent on an unsupported Linux platform

If you run the Sensu agent on an unsupported Linux platform, the agent might fail to correctly identify your version of Linux and could download the wrong version of an asset.

This issue affects Linux distributions that do not include the lsb_release package in their default installations. In this case, gopsutil may try to open /etc/lsb_release or try to run /usr/bin/lsb_release to find system information, including Linux version. Since the lsb_release package is not installed, the agent will not be able to discover the Linux version as expected.

To resolve this problem, install the lsb_release package for your Linux distribution.

Etcd clusters

Some issues require you to investigate the state of the etcd cluster or data stored within etcd. In these cases, we suggest using the etcdctl tool to query and manage the etcd database.

Sensu’s supported packages do not include the etcdctl executable, so you must get it from a compatible etcd release.

Configure etcdctl environment variables

To use etcdctl to investigate etcd cluster and data storage issues, first run these commands to configure etcdctl environment variables:

export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/sensu/ca.pem
export ETCDCTL_ENDPOINTS="https://backend01:2379,https://backend02:2379,https://backend03:2379"

If your etcd uses client certificate authentication, run these commands too:

export ETCDCTL_CERT=/etc/sensu/cert.pem
export ETCDCTL_KEY=/etc/sensu/key.pem

View cluster status and alarms

Use the commands listed here to retrieve etcd cluster status and list and clear alarms.

To retrieve etcd cluster status:

etcdctl endpoint status

To retrieve a list of etcd alarms:

etcdctl alarm list

To clear etcd alarms:

etcdctl alarm dearm

Restore a cluster with an oversized database

The etcd default maximum database size is 2 GB. If you suspect your etcd database exceeds the maximum size, run this command to confirm cluster size:

etcdctl endpoint status

The response will list the current cluster status and database size:

https://backend01:2379, 88db026f7feb72b4, 3.3.22, 2.1GB, false, 144, 18619245
https://backend02:2379, e98ad7a888d16bd6, 3.3.22, 2.1GB, true, 144, 18619245
https://backend03:2379, bc4e39432cbb36d, 3.3.22, 2.1GB, false, 144, 18619245

To restore an etcd cluster with a database size that exceeds 2 GB:

  1. Get the current revision number:

    etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'

  2. Compact to revision and substitute the current revision for $rev:

    etcdctl compact $rev

  3. Defragment to free up space:

    etcdctl defrag

  4. Confirm that the cluster is restored:

    etcdctl endpoint status

    The response should list the current cluster status and database size:

    https://backend01:2379, 88db026f7feb72b4, 3.3.22, 1.0 MB, false, 144, 18619245
    https://backend02:2379, e98ad7a888d16bd6, 3.3.22, 1.0 MB, true, 144, 18619245
    https://backend03:2379, bc4e39432cbb36d, 3.3.22, 1.0 MB, false, 144, 18619245

Datastore performance

In a default deployment, Sensu uses etcd datastore for both configuration and state. As the number of checks and entities in your Sensu installation increases, so does the volume of read and write requests to etcd database.

One trade-off in etcd’s design is its sensitivity to disk and CPU latency. When certain latency tolerances are regularly exceeded, failures will cascade. Sensu will attempt to recover from these conditions when it can, but this may not be successful.

To maximize Sensu Go performance, we recommend that you:

As your Sensu deployments grow, preventing issues associated with poor datastore performance relies on ongoing collection and review of Sensu time-series performance metrics.

Symptoms of poor performance

At the default “warn” log level, you may see messages like these from your Sensu backend:

{"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/handlers/default/keepalive\\\" limit:1 \" with result \"range_response_count:0 size:6\" took too long (169.767546ms) to execute","pkg":"etcdserver","time":"..."}

The above message indicates that a database query (“read-only range request”) exceeded a 100-millisecond threshold hard-coded into etcd. Messages like these are helpful because they can alert you to a trend, but these occasional warnings don’t necessarily indicate a problem.

However, a trend of increasingly long-running database transactions will eventually lead to decreased reliability. You may experience symptoms of these conditions as inconsistent check execution behavior or configuration updates that are not applied as expected.

As the etcd tuning documentation states:

An etcd cluster is very sensitive to disk latencies. Since etcd must persist proposals to its log, disk activity from other processes may cause long fsync latencies. […] etcd may miss heartbeats, causing request timeouts and temporary leader loss.

When Sensu’s etcd component doesn’t recieve sufficient CPU cycles or its file system can’t sustain a sufficient number of IOPS, transactions will begin to timeout, leading to cascading failures.

A message like this indicates that syncing the etcd database to disk exceeded another threshold:

{"component":"etcd","level":"warning","msg":"sync duration of 1.031759056s, expected less than 1s","pkg":"wal","time":"..."}

These subsequent “retrying of unary invoker failed” messages indicate failing requests to etcd:

{"level":"warn","ts":"...","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6f6bfc7e-cf31-4498-a564-78d6b7b3a44e/localhost:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}

On busy systems you may also see output like “message repeated 5 times” indicating that failing requests were retried multiple times.

In many cases, the backend service detects and attempts to recover from errors like these, so you may see a message like this:

{"component":"backend","error":"error from keepalived: internal error: etcdserver: request timed out","level":"error","msg":"backend stopped working and is restarting","time":"..."}

This may result in a crash loop that is difficult to recover from. You may observe that the Sensu backend process continues running but is not listening for connections on the agent websocket, API, or web UI ports. The backend will stop listening on those ports when the etcd database is unavailable.