Ingesting data into storage systems
Mobile phones, various smart devices, and external services are increasingly replacing application components as data sources.
Such sources supply data in massive numbers of small batches. The communication circuits used for transmission are often slow, and the communication time limited. Under such conditions, you want to quickly save the data you receive. Its processing can wait till later. That is why the system first sends the data to data streaming buses and then pulls it from there for processing.
Acting like a streaming bus, Yandex Data Streams provides optimal operating conditions for both sources and targets:
- Receives high-frequency and high-speed incoming data without blocking the sources.
- Saves the received data in its own storage.
- Groups data into batches and sends them to the target systems, thus reducing their load.
Benefits
When working with external devices or services, you want to quickly save the data you receive. You can fetch the saved data from Data Streams through direct reads or by setting up data delivery to Yandex Cloud storage systems using Yandex Data Transfer.
Receiving data
Data Streams receives data over HTTP. Using Yandex API Gateway, you can implement any data ingestion protocol. After ingestion in API Gateway, the data is ready to move to Data Streams.
Data Streams is highly scalable and can accept data from thousands of data sources at the same time.
Reliability
A data streaming bus is a critical infrastructure component that is tolerant to all kinds of Yandex Cloud failures. Data Streams stores ingested data across at least three Yandex Cloud availability zones.
Batching
Data storage and processing systems perform best when data is written in batches. Data batching is most effective at a single point where all your data flows together. Data buses typically serve as that single point.
Rewinding data
Unlike message queues, data buses store data until the retention period expires without deleting the data after it is read. This allows you to move across the stored data in any direction: from the oldest to the most recent. For example, if a new data format appears and gets written to the target system incorrectly, you can rewind the data stored in the bus to the beginning and then reread and rewrite it to the target system correctly.
Multiple storage systems
The same data is often stored across multiple storage systems at once: ClickHouse® manages rapid analytics, while Object Storage, long-term storage. With data buses, you can easily handle this: as different apps can read data concurrently, you can set up sending the same data to both storage systems, i.e., ClickHouse® and Object Storage. This solution will also enable you to add a third storage system, such as Greenplum® or Elasticsearch, at any time.
The multiple storage system approach is very convenient for ensuring compliance with FZ-152, PCI DSS, and other standards that require data retention for at least one year. In which case, while the last month's data goes to a quick access storage system, the rest of the data may be sent to a long-term "cold" storage in Object Storage.
Masking data and processing logs
Data access is limited across employees. For example, certain data may include personal user information that requires restricted access.
You can send the data to Cloud Functions for masking or any additional processing as needed.
Once processed, the data can be sent to multiple target systems at once: access to the data containing masked personal data can be granted to all employees, while access to the full data, to administrators only.
Reading data
Data Streams supports automatic processing of stored data through code. Data Streams is compatible with the Amazon Kinesis Data Streams API, allowing you to use SDKs for different programming languages: C++, Java, Go, Python, and more.
Setup
To set up data ingestion into storage systems:
-
Create a data stream in Data Streams.
-
Configure the AWS SDK.
-
Configure Yandex Data Transfer to transfer data to the selected storage system.
For an example of setting up data delivery from Data Streams, see the tutorial on how to save data to ClickHouse®.
-
Connect any data processing function to Yandex Data Transfer. This GitHub example
illustrates the function code. Alternatively, you can use the SDK to read data directly from Data Streams:
ClickHouse® is a registered trademark of ClickHouse, Inc