Entering data into storage systems
Mobile phones, various smart devices, and external services are increasingly replacing application components as data sources.
Data from these sources is received in a great number of small batches. Often, slow communication channels are used for data transmission, while the communication time is limited. Under these conditions, it's important to promptly save the data received. Processing of this data can be performed later. Therefore, data is first sent to data streaming buses and then taken for processing from there.
As a data streaming bus, Yandex Data Streams provides optimal operation modes for sources and targets:
- Accepts incoming data with high frequency and speed without blocking the sources.
- Saves the received data in its own storage.
- Generates data batches and sends them to target systems reducing the load on them.
Benefits
When working with external devices or services, it's important to promptly save the data received. The saved data can be fetched from Data Streams through direct reads or by setting up data delivery to Yandex Cloud storage systems using Yandex Data Transfer.
Receiving data
Data is transmitted to Data Streams over HTTP. Using Yandex API Gateway, you can implement any protocol for incoming data. Data received in API Gateway can be forwarded to Data Streams as well.
Data Streams provides high scalability and can accept data from thousands of data sources at the same time.
Reliability
A data streaming bus is an important infrastructure component. It's tolerant to any type of Yandex Cloud failures. Data input in Data Streams is saved to at least three Yandex Cloud availability zones.
Batching
Data storage and processing systems are most efficient if data is written to them in batches. The most efficient place for generating batches is a single entry point where all data arrives. This function is fulfilled by data buses.
Rewinding data
Unlike message queues, data buses store data until the retention period expires without deleting the data after it's read. This lets you move across the stored data in any direction: from the oldest to the most recent. For example, if a new data format appears and it's written to a target system incorrectly, you can rewind the data stored in a bus to the beginning and then reread and rewrite it correctly to the target system.
Multiple storage systems
The same data is often stored in multiple storage systems at once: in ClickHouse® for fast analysis and in Object Storage for long-term storage. With data buses, you can easily handle this: as different apps can read data concurrently, you can set up sending the same data to both storage systems: ClickHouse® and Object Storage. This solution will also let you add a third storage system, such as Greenplum® or Elasticsearch, at any time.
The approach using multiple storage systems is very convenient to ensure compliance with Federal Law No. 152-FZ, PCI DSS, and other standards that stipulate that data shall be stored for at least a year. In this case, data for the past month can be sent to one storage system so that it can be accessed whenever required, while data to be stored for a long term can be sent to Object Storage cold storage.
Masking data and processing logs
Some data is not accessible to all employees. For example, data with users' personal data access to which must be limited.
Transmitted data can be sent for processing to Cloud Functions where it can be masked or handled in any other way.
Once processed, the data can be sent to multiple target systems at once: access to the data containing masked personal data can be granted to all employees while access to the full data to administrators only.
Reading data
Data stored in Data Streams can be processed programmatically. Data Streams is compatible with the Amazon Kinesis Data Streams API, which lets you use SDKs for different programming languages such as C++, Java, Go, and Python.
Setup
To set up data input to storage systems:
-
Create a data stream Data Streams.
-
Set up the AWS SDK.
-
Set up Yandex Data Transfer to transfer data to the selected storage system.
An example of setting up data delivery from Data Streams is given in the tutorial on how to save data to ClickHouse®.
-
Connect an arbitrary data processing function to Yandex Data Transfer. See the function code in the example
or use SDKs to directly read data from Data Streams:
ClickHouse® is a registered trademark of ClickHouse, Inc