Smart log processing
Apps generate logs to enable diagnostics. However, these logs alone are not enough for analysis: you need to be able to store and process them in a convenient way. This is why logs go to storage systems, such as Hadoop, ClickHouse®, or specialized cloud systems like Cloud Logging.
Applications do not usually write logs to storage systems directly. Instead, they send them to intermediate log aggregators. These aggregators can capture logs from stdout/stderr, read log files from disk, get them via syslog or over HTTP, and in many other ways.
After receiving logs, aggregators buffer them and then send them to different targets via plugins. This approach enables app developers to stay focused on coding while delegating log delivery to dedicated systems.
Standard log delivery systems are fluentd
Aggregators can write data directly to storage systems; however, for greater reliability, data first goes to an intermediate buffer (data streaming bus, message broker), i.e., Yandex Data Streams, and only from there, to storage systems.
Logs often contain too much data or restricted information. You can mask irrelevant or confidential information by adding more processing steps, e.g., in Cloud Functions.
Benefits
Reliability
To increase reliability, applications only need to configure a log aggregator to deliver data to a bus as quickly as possible, and the bus will ensure reliable data storage up to the point when the data is processed and written to storage systems.
Multiple storage systems
The same logs are often stored across multiple storage systems at once: ClickHouse® manages rapid analytics, while Object Storage, long-term storage. To implement this, you can set up your aggregators to send two data streams: one to ClickHouse® and the other one to Object Storage.
Using a data bus makes this easier: you only need to send the log once to the data bus and from there, run two data transfer processes within Yandex Cloud. This solution will also enable you to add a third storage system, such as Greenplum® or Elasticsearch, at any time.
The multiple storage system approach is very convenient for ensuring compliance with FZ-152, PCI DSS, and other standards that require log retention for at least one year. In which case, while the last month's logs go to a quick access storage system, the rest of the data may be sent to a long-term "cold" storage in Object Storage.
Masking data and processing logs
Log access is limited across employees. For example, certain logs may include personal user information that requires restricted access.
You can send logs to Cloud Functions for masking or any additional data processing as needed.
Once processed, the logs can be sent to multiple target systems at once: access to the logs containing masked personal data can be granted to all employees, while access to the full logs, to administrators only.
Setup
To configure smart log processing:
-
Create a data stream in Data Streams.
-
Set up a log aggregator: fluentd, logstash, or any other aggregator that supports the Kinesis Data Streams API.
-
Configure Yandex Data Transfer to transfer data to the selected storage system.
For an example of setting up data delivery from Data Streams, see the tutorial on how to save data to ClickHouse®.
-
Connect any data processing function to Yandex Data Transfer. This GitHub example
illustrates the function code.
ClickHouse® is a registered trademark of ClickHouse, Inc