Diagnosing and troubleshooting issues when writing metrics via the Remote API
The most common issue is where data sent to remote storage lags behind data written to WAL. This usually happens after you restart Prometheus or reboot the configuration
If you are experiencing a lag when writing metrics via the Remote API:
- Analyze the collected metrics to find what causes the lag.
- Try using one of the methods for troubleshooting common issues.
Diagnosing issues when writing metrics via the Remote API
If you suspect that your data goes to remote storage with a lag:
- Check if the
prometheus_wal_watcher_current_segment
andprometheus_tsdb_wal_segment_current
metrics match. If they do not, it means the data sent to remote storage lags behind the new data written to WAL. - Compare
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
toprometheus_remote_storage_highest_timestamp_in_seconds
. This will give you an idea of how big the lag is. - If there is a lag, compare
prometheus_remote_storage_shards
toprometheus_remote_storage_shards_max
. If their values are equal, the remote write uses the maximum possible number of shards. - Compare
prometheus_remote_storage_shards_desired
toprometheus_remote_storage_shards_max
. If the desired number of shards is greater than the maximum, the current maximum number of shards is not enough for an efficient remote write operation. - Analyze the
prometheus_remote_storage_samples_retried_total
metric values. Long-term high readings of this metric may indicate network or remote storage issues. In this case, you may need to reduce the remote write throughput to reduce the load.
Troubleshooting common issues when writing metrics via the Remote API
You can configure the remote write operation by changing the Prometheus configuration parameters in the queue_config
section inside the remote_write
section.
Non-optimal limits on the number of shards
The Prometheus remote writes use sharding to increase data throughput. Data is distributed based on a set of shards, the number of which is calculated based on recent throughput and the amount of incoming data. If a remote write cannot use the optimal number of shards, this may cause delays in sending data to remote storage.
Changing the max_shards
parameter allows you to increase or decrease the maximum data transfer throughput to remote storage. If there is a delay, check if the desired number of shards exceeds the maximum number of shards. If so, you should increase max_shards
in the Prometheus configuration.
The min_shards
parameter is responsible for the minimum number of shards in a remote write operation. Usually there is no need to change it. The exception is when your remote write operation always uses a particular minimum number of shards. In which case by increasing the min_shards
parameter you can improve recovery time after rebooting Prometheus.
Insufficient buffer size
Before it goes to remote storage, the WAL data gets sharded, after which each shard sends its data to remote storage. If the memory buffer of one of the shards gets full, Prometheus will suspend WAL reads for all shards.
The capacity
parameter is responsible for the shard buffer size. The recommended capacity
value should be between three and ten times the max_samples_per_send
value. You can raise the capacity
parameter to increase each shard's throughput, but if set too high, it can lead to excessive memory consumption.
If the network load is high, try increasing capacity
while simultaneously decreasing max_shards
. This way you can reduce the network load without affecting throughput.
Deadline for sending data to remote storage is too short
There is a send deadline for data that goes to remote storage. Once this deadline is reached, the data gets sent even if the data packet is not completely full. You can change the deadline via the batch-send-deadline
parameter, which sets the maximum time between data send operations for each shard.
Generally, you do not want to change this setting, but it can be useful for diagnosing some remote write problems. If the deadline gets missed on a regular basis, this may indicate that the throughput is too high given the current configuration.
Data packet size is too small
If the maximum number of points in a data packet is too small, it may cause the network to get overloaded with many small data packets. If this problem occurs, increase the max_samples_per_send
parameter. This will increase throughput for both shards and data packets.
Too many retries to send data to remote storage
If there is a fixable error when sending data to remote storage, an attempt to resend will be made. To avoid data loss, the resend attempts will continue indefinitely. The min_backoff
and max_backoff
parameters are responsible for the minimum and maximum delays before resending.
If too frequent, the resend attempts may cause the network to get overloaded. You can diagnose this condition using the prometheus_remote_storage_samples_retried_total
metric. In which case increase the maximum delay before sending by increasing the max_backoff
parameter.
© 2025 Linux Foundation. All rights reserved. The Linux Foundation owns and uses registered trademarks. For a list of Linux Foundation trademarks, see Trademark Usage