# Diagnostics Service

The `tbot` process can optionally expose a diagnostics service. This is disabled by default, but once enabled, allows useful information about the running `tbot` process to be queried via HTTP.

## Configuration

To enable the diagnostics service, you must specify an address and port for it to listen on.

For security reasons, you should ensure that access to this listener is restricted. In most cases, the most secure thing to do is to bind the listener to `127.0.0.1`, which will only allow access from the local machine.

You can configure the diagnostics service using the `--diag-addr` CLI parameter:

```
$ tbot start -c my-config.yaml --diag-addr 127.0.0.1:3001
```

Or directly within the configuration file using `diag_addr`:

```
diag_addr: 127.0.0.1:3001

```

## Endpoints

The diagnostics service exposes the following HTTP endpoints.

### `/livez`

The `/livez` endpoint always returns with a 200 status code. This can be used to determine if the `tbot` process is running and has not crashed or hung.

If deploying to Kubernetes, we recommend this endpoint is used for your Liveness Probe.

### `/readyz` and `/readyz/{service}`

The `/readyz` endpoint returns the overall health of `tbot`, including all of its internal and user-defined services. If all services are healthy, it will respond with a 200 status code. If any service is unhealthy, it will respond with a 503 status code.

```
$ curl -v http://127.0.0.1:3001/readyz

HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{
  "status": "unhealthy",
  "services": {
    "ca-rotation": {
      "status": "healthy"
    },
    "heartbeat": {
      "status": "healthy"
    },
    "identity": {
      "status": "healthy"
    },
    "aws-roles-anywhere": {
      "status": "unhealthy",
      "reason": "access denied to perform action \"read\" on \"workload_identity\""
    }
  },
  "pid": 42344
}
```

If deploying to Kubernetes, we recommend this endpoint is used for your Readiness Probe.

You can also use the `/readyz/{service}` endpoint to query the health of a specific service.

```
$ curl -v http://127.0.0.1:3001/readyz/aws-roles-anywhere

HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "healthy"
}
```

By default, `tbot` generates service names based on their type (e.g. `application-output-1`). You can override this by providing your own name in the `tbot` configuration file.

```
services:
  - type: identity
    name: my-service-123

```

### `/metrics`

The `/metrics` endpoint returns a Prometheus-compatible metrics snapshot.

See [Prometheus Metrics](#prometheus-metrics) below for more information.

### `/debug/pprof`

These endpoints allow the collection of pprof profiles for debugging purposes. You may be asked by a Teleport engineer to collect these if you are experiencing performance issues.

They will only be enabled if the `-d`/`--debug` flag is provided when starting `tbot`. This is known as **debug mode**.

### `/wait` and `/wait/{service}`

The `/wait` endpoint returns the same content as `/readyz`, but delays its response until the bot or service has reported its initial status, regardless of if it reports itself as healthy or unhealthy. As with `/readyz`, if the bot or service is not healthy, it responds with a 503 status code. It will delay potentially indefinitely waiting for the initial readiness report, so clients should configure a reasonable timeout if necessary.

This endpoint is useful when integrating synchronously with bots to ensure workflows that depend on bot outputs or services are in fact ready to serve requests, without needing to explicitly implement a readiness checking loop in your own app. It can be useful in CI/CD environments, or any other situation in which you need to be certain `tbot` is fully initialized.

The [`tbot wait`](https://goteleport.com/docs/reference/cli/tbot.md#tbot-wait) helper command makes use of this endpoint internally, but handles additional error cases (`tbot` not yet started, waits explicitly for services to become healthy). Most users should prefer the CLI helper when possible.

```
$ curl -v http://127.0.0.1:3001/wait

HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{
  "status": "unhealthy",
  "services": {
    "ca-rotation": {
      "status": "healthy"
    },
    "heartbeat": {
      "status": "healthy"
    },
    "identity": {
      "status": "healthy"
    },
    "aws-roles-anywhere": {
      "status": "unhealthy",
      "reason": "access denied to perform action \"read\" on \"workload_identity\""
    }
  },
  "pid": 42344
}
```

You can also use the `/wait/{service}` endpoint to wait for a particular service to report its initial status.

```
$ curl -v http://127.0.0.1:3001/wait/aws-roles-anywhere

HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "healthy"
}
```

## Prometheus metrics

The `tbot` process exposes a number of Prometheus metrics via the `/metrics` endpoint of the diagnostics service.

In addition to exporting the standard Go runtime metrics, `tbot` also exports custom metrics that reflect the health and performance of the various configurable services.

## Advice

When monitoring the health of `tbot`, there are three categories of metrics you should consider:

- The health of the `tbot` process itself. For example, how much CPU time and memory is it using? These can be strong indicators of overall health and provide early warning signs of potential issues (e.g. memory leaks).
- The health of the internal services that `tbot` relies on. For example, has `tbot` been able to successfully renew its internal identity? If these internal services have become unhealthy, then it is likely that user-defined services within `tbot` will also become unhealthy.
- The health of the services you configured within `tbot`. This will indicate whether `tbot` has been able to successfully perform its intended functions.

For monitoring the health of the `tbot` process itself, a large number of metrics are provided by the Go runtime.

For monitoring the health of internal and user-defined services, there are two key metrics:

- `tbot_task_iterations_failed`: the total number of task iterations that have failed. This will have a `service` label indicating which service within the `tbot` process the task belongs to.
- `tbot_task_iterations_successful`: the total number of task iterations that have succeeded. This will also have a `service` label. This metric is a histogram, and will also indicate the number of retries that were required before the task succeeded. For a perfectly healthy service, you would expect this number of retries to be zero, or close to zero.

## Metrics

### Generic

These metrics are generated by more than one service within `tbot` or may be generated by the core supervisor within `tbot` itself.

| Name                                    | Description                                                                                                                                                                                                                                        |
| --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tbot_task_iterations_total`            | The total number of task iterations that have been performed. This will have a `service` and `name` label to specify which task.                                                                                                                   |
| `tbot_task_iterations_failed`           | The total number of task iterations that have failed. This will have a `service` and `name` label to specify which task.                                                                                                                           |
| `tbot_task_iterations_successful`       | The total number of task iterations that have succeeded. This will have a `service` and `name` label to specify which task. This metric is a histogram, and will also indicate the number of retries that were required before the task succeeded. |
| `tbot_task_iterations_duration_seconds` | The duration of the time taken to perform an iteration of the task. This will have a `service` and `name` label to specify which task. This metric is a histogram.                                                                                 |

### `ssh-multiplexer`

These metrics are generated by the SSH multiplexer service.

| Name                                          | Description                                                                                                                                                                   |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tbot_ssh_multiplexer_requests_started_total` | The total number of SSH multiplexing requests that have been started.                                                                                                         |
| `tbot_ssh_multiplexer_requests_handled_total` | The total number of SSH multiplexing requests that have completed. The `status` label indicates whether the request completed successfully (`OK`) or with an error (`ERROR`). |
| `tbot_ssh_multiplexer_requests_in_flight`     | The number of SSH multiplexing requests that are currently in progress.                                                                                                       |
