Smart log processing
Apps generate logs to enable diagnostics. However, it is not enough just to have logs for analysis: you need to be able to store and process them in a convenient way. For this purpose, logs are sent to storage systems such as Hadoop, ClickHouse®, or specialized cloud systems such as Cloud Logging.
Applications do not usually write logs to storage systems directly. Instead, they send them to intermediate aggregator applications. These aggregators can receive logs by intercepting stdout/stderr streams, read log files from disk, get them via syslog or over HTTP, and in many other ways.
After receiving logs, aggregator applications accumulate them and then send them to different targets using plugins. This approach lets application developers focus on writing code and delegate log delivery to specially allocated systems.
Standard log delivery systems are fluentd
Though aggregator applications can write data to storage systems directly, to enhance reliability, data is first sent to an intermediate buffer (a data streaming bus, message broker), that is Yandex Data Streams, and then to a storage system from it.
Logs often contain too much data or restricted information. You can mask unnecessary or confidential information by adding more processing steps, e.g., in Cloud Functions.
Benefits
Reliability
To increase reliability, applications only need to configure a log aggregator to deliver data to a bus as quickly as possible, and the bus will ensure reliable data storage up to the point when the data is processed and written to storage systems.
Multiple storage systems
The same logs are often stored in multiple storage systems at once: in ClickHouse® for fast analysis and in Object Storage for long-term storage. To implement this, you can set up your aggregator applications so that they send two data streams: one to ClickHouse® and the other one to Object Storage.
With data buses, this can be done even more easily: just send a log once to a data bus and then, from it, run two data transfer processes inside Yandex Cloud. This solution will also let you add a third storage system, such as Greenplum® or Elasticsearch, at any time.
The multiple storage system approach is very convenient for ensuring compliance with FZ-152, PCI DSS, and other standards requiring that logs be stored for at least one year. In which case the last month's logs can go to one storage system for quick access, and the rest of the data to a cold storage in Object Storage for long time storage.
Masking data and processing logs
Some logs are not accessible to all employees. For example, logs may contain users' personal data access to which must be limited.
Transmitted logs can be sent to Cloud Functions where they can by masked or handled in any other way.
Once processed, the logs can be sent to multiple target systems at once: access to the logs containing masked personal data can be granted to all employees while access to the full logs to administrators only.
Setup
To configure smart log processing:
-
Create a data stream Data Streams.
-
Set up a log aggregator: fluentd, logstash, or any other aggregator that supports the Kinesis Data Streams API.
-
Set up Yandex Data Transfer to transfer data to the selected storage system.
An example of setting up data delivery from Data Streams is given in the tutorial on how to save data to ClickHouse®.
-
Connect an arbitrary data processing function to Yandex Data Transfer. The example
shows a sample function code.
ClickHouse® is a registered trademark of ClickHouse, Inc