Troubleshooting issues when writing metrics via the Remote API
The most common issue is where data sent to remote storage lags behind data written to the WAL. This usually happens after you restart Prometheus or reload the configuration
If you are experiencing a lag when writing metrics via the Remote API:
- Analyze the collected metrics to find what causes the lag.
- Try using one of the methods for troubleshooting common issues.
Diagnosing issues when writing metrics via the Remote API
If you suspect that your data goes to remote storage with a lag:
- Check if the
prometheus_wal_watcher_current_segmentandprometheus_tsdb_wal_segment_currentvalues match. If they do not, it means the data sent to remote storage lags behind the new data written to the WAL. - Compare
prometheus_remote_storage_queue_highest_sent_timestamp_secondstoprometheus_remote_storage_highest_timestamp_in_seconds. This will give you an idea of how big the lag is. - If there is a lag, check the
prometheus_remote_storage_shardsandprometheus_remote_storage_shards_maxvalues. If they match, the remote write is using the maximum possible number of shards. - Compare
prometheus_remote_storage_shards_desiredtoprometheus_remote_storage_shards_max. If the number of shards you need (prometheus_remote_storage_shards_desired) is greater than the maximum (prometheus_remote_storage_shards_max), the current maximum number of shards is not enough for efficient remote writing. - Analyze the
prometheus_remote_storage_samples_retried_totalvalues. Long-term high readings of this metric may indicate network or remote storage issues. In this case, you may need to reduce the remote write throughput to reduce the load.
Troubleshooting common issues when writing metrics via the Remote API
To configure remote writing, update the Prometheus configuration parameters in the queue_config section under remote_write.
Suboptimal shard allocation limits
Prometheus remote writes use sharding to increase data throughput. Data is distributed across shards, and the number of these shards is calculated based on recent throughput and incoming data size. If a remote write cannot use the optimal number of shards, this may cause delays in sending data to remote storage.
Updating max_shards allows you to increase or decrease the maximum throughput for data transfer to remote storage. If there is a delay, check if the number of shards you need exceeds the maximum value. If so, you should increase max_shards in the Prometheus configuration.
The min_shards parameter defines the minimum number of shards for remote writes. Usually, there is no need to change it. However, if your remote writes always use a specific minimum number of shards, increasing the min_shards value may help reduce recovery time after restarting Prometheus.
Insufficient buffer size
Before going to remote storage, the WAL data gets sharded, with each shard then individually sending its data to remote storage. If the memory buffer of one of the shards gets full, Prometheus will suspend WAL reads for all shards.
The capacity parameter sets the shard buffer size. The recommended capacity value should be between three and ten times the max_samples_per_send value. You can raise the capacity value to increase the throughput of each shard; however, setting it too high may cause excessive memory consumption.
If the network load is high, try increasing capacity while simultaneously decreasing max_shards. This way you can reduce the network load without affecting the throughput.
Remote storage send deadline is too short
There is a send deadline for data going to remote storage. Once this deadline is reached, the data will be sent even if the data packet is not full. You can change the deadline in the batch-send-deadline parameter, which sets the maximum time between data sends for each shard.
Generally, you should avoid modifying this parameter; however, it can be helpful for diagnosing certain remote write issues. If the deadline gets missed on a regular basis, this may indicate that the throughput is too high for the current configuration.
Data packet size is too small
If the maximum number of points in a data packet is too small, it may cause the network to get overloaded with many small data packets. If this issue occurs, increase the max_samples_per_send value. This will increase the throughput for both shards and data packets.
Too many retries to send data to remote storage
If a recoverable error occurs when sending data to remote storage, it will trigger a retry. To avoid data loss, retries will continue indefinitely. The min_backoff and max_backoff parameters define the minimum and maximum delays before a retry.
Too frequent retries may lead to network overload. You can diagnose this condition using the prometheus_remote_storage_samples_retried_total metric. If this is the case, increase the max_backoff value to extend the maximum delay before a retry.
© 2025 Linux Foundation. All rights reserved. The Linux Foundation owns and uses registered trademarks. For a list of Linux Foundation trademarks, see Trademark Usage