About Monium
Monium is a platform you can use to monitor and analyze Yandex Cloud services or your own infrastructure and applications.
System monitoring and analysis are based on collecting telemetry, visualizing it on dashboards, and setting up alerts for automated notification of issues and anomalies. A unified tool helps you navigate from anomalies on a chart to error logs and traces of specific requests, allowing you to quickly identify the root cause of incidents.
Telemetry types
Monium supports collection of application and Yandex Cloud resource telemetry:
-
Metrics: Numerical indicators measured over time (e.g., RPS, CPU load). Used for charts and alerts.
Yandex Cloud resource metrics are sent to Monium automatically. Many services also provide dashboards with ready-made widgets showing the status of your cloud resources.
-
Logs: Structured records of events in an application or infrastructure (e.g., start messages, error messages). Used for system diagnostics.
To collect Yandex Cloud resource logs, you will need to enable logging when creating or modifying a resource.
-
Traces: Linked chain of operations for a specific request, showing the path and execution time of each step. Used to monitor distributed systems.
Telemetry transmission
If telemetry transmission is pre-configured for Yandex Cloud resources, you will need to manually configure the delivery of data from your applications and third-party infrastructure.
Monium accepts data in OpenTelemetry (OTLP)
You can use the following for telemetry transmission:
-
OpenTelemetry-compatible agents, e.g., OTel Collector for all telemetry types (recommended) or Fluent Bit for logs and metrics.
-
Unified Agent, Yandex's data collection and delivery agent (currently works with metrics only).
Note
Currently, Unified Agent supports only metrics. We will later be adding the OpenTelemetry format for metrics, logs, and traces.
-
Sending directly from the application via the OpenTelemetry SDK.
For metric collection, Prometheus integration is supported via Yandex Managed Service for Prometheus®.
Going forward, the platform is going to get more observability tools.
Telemetry distribution
Monium achieves logical separation of telemetry data using the following concepts:
-
Project: Top-level logical entity. You can use projects to aggregate telemetry from associated applications and microservices and restrict data access for development teams. Some examples include an online store, billing system, or security services.
-
Cluster: Allows you to isolate an environment or independent service installations, e.g., production and test clusters, clusters in different regions.
-
Service: Standalone client application generating telemetry data. This may be a microservice or its component, e.g., Nginx, Envoy, or Compute Cloud VM instance.
-
Shard: Container for data of a specific service-cluster pair and data storage settings, e.g., TTL
.
The project, cluster, and service objects define the data source, and the shard defines the storage rules.
Find the description of other Monium objects and terms in the Basic terms section.
Platform feature overview
The platform provides a full cycle of telemetry management, from data collection to visualization and alerting.
Data delivery
The platform supports flexible configuration of telemetry delivery:
- Automatic collection of metrics for Yandex Cloud resources.
- Integration with applications via OpenTelemetry.
- Support for Prometheus via Yandex Managed Service for Prometheus®.
Learn more about telemetry transmission.
Metrics
Metrics are real time numerical indicators of system performance. They are commonly used for:
- Monitoring CPU, memory, and network usage.
- Analyzing trends and performance.
- Detecting anomalies and bottlenecks.
Logs
Logs are structured records of events and messages that help you:
- Investigate specific incidents in detail.
- Analyze errors and exceptions.
- Audit user and system activity.
Traces
Traces are visual representations of request paths in distributed applications, which enable you to:
- Identify bottlenecks in microservice chains.
- Analyze latency between system components.
- Understand dependencies in complex architectures.
- Examine requests and responses during LLM agent monitoring.
Alerts
Alerts are automated notifications for critical events, for which you can set up the following:
- Event trigger rules, e.g., a sudden change in a metric.
- Notifications via messaging apps, email, phone call, or cloud function.
Alerts help you respond to issues before they affect users or at least minimize that impact.
Visualization
Create dashboards for system monitoring:
- Bring together metrics, logs, and traces in a single dashboard.
- Use ready-made service dashboards for Yandex Cloud resources.
- Customize charts, tables, and other widgets.
- Drill down to analyze issues in detail.
Data integration
Link different types of telemetry for comprehensive analysis:
- Navigate from metrics to logs and traces within a single interface.
- Use
trace_idandspan_idto link logs to traces. - Leverage data from multiple sources to analyze incidents.