Unified analysis of streaming and analytical data
In this example, you will calculate the cost of taxi rides in specific locations using a single query against analytical and streaming data.
The text of the SQL query used to process both data types is the same: only the connections and data bindings for the bucket and stream are different.
Data for analytical processing was placed in the Yandex Object Storage bucket, in Parquet
In both cases, we use a reference stored in Object Storage to filter our query data.
To run this example:
- Get things ready.
- Analyze the data from Object Storage.
- Analyze the streaming data from Data Streams.
Note
Yandex Cloud provides the New York City taxi trips dataset as is. Yandex Cloud makes no representations, express or implied, warranties, or conditions pertaining to your use of the specified dataset. To the extent allowed by your local laws, Yandex Cloud shall not be liable for any loss or damage, including direct, consequential, special, indirect, incidental, or exemplary, resulting from your use of the dataset.
NYC Taxi and Limousine Commission (TLC):
The data was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The taxi trip data is not generated by the TLC, and the TLC makes no representations whatsoever about the accuracy of this data.
Take a look at the dataset source
Get started
- Log in or sign up to the management console
. If not signed up yet, navigate to the management console and follow the instructions. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account yet, create one. - If you do not have a folder yet, create one.
- We will connect to our data stream using a service account. Create a service account named
datastream-connection-account
with theydb.editor
role. - Data streams use Yandex Managed Service for YDB. You will need to create a serverless database.
Analyze the data from Object Storage
Connect to analytical data
-
In the management console
, select the folder where you want to create a connection. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select
Tutorial. -
Under Create infrastructure for tutorial, click Create connection.
A new connection creation page will open. View the default parameter values, but do not edit them.
-
Click Create.
The data binding page will open. View the default parameter values, but do not edit them.
-
Click Create.
Run the query
-
In the query editor in the Query interface, click New analytics query.
-
Enter the query text in the text field:
$data = SELECT * FROM `tutorial-analytics`; $locations = SELECT PULocationID FROM `tutorial-analytics`.`nyc_taxi_sample/example_locations.csv` WITH ( format=csv_with_names, SCHEMA ( PULocationID String ) ); $time = SELECT HOP_END() AS time, rides.PULocationID AS PULocationID, SUM(total_amount) AS total_amount FROM $data AS rides INNER JOIN $locations AS locations ON rides.PULocationID=locations.PULocationID GROUP BY HOP(CAST(tpep_pickup_datetime AS Timestamp?), "PT1M", "PT1M", "PT1M"), rides.PULocationID; SELECT * FROM $time;
-
Click Run.
Review the result
Once the analytical query is complete, you will see the result: distribution of taxi ride costs in specific locations.
# | time | PULocationID | total_amount |
---|---|---|---|
1 | 2017-12-31T22:24:00.000000Z | 120 | 7.54 |
2 | 2018-01-01T00:13:00.000000Z | 120 | 48.8 |
3 | 2018-01-01T03:25:00.000000Z | 120 | 30.8 |
4 | 2018-01-01T11:29:00.000000Z | 120 | 32.88 |
5 | 2018-01-01T15:13:00.000000Z | 120 | 9.8 |
6 | 2018-01-01T22:03:00.000000Z | 120 | 14.8 |
7 | 2018-01-02T19:28:00.000000Z | 120 | 7.3 |
8 | 2018-01-03T10:17:00.000000Z | 120 | 81.3 |
Analyze the Data Streams streaming data
Create a data stream
- In the management console
, select the folder where you need to create a data stream. - Select Data Streams.
- Click Create stream.
- Specify the Yandex Managed Service for YDB database created previously.
- Enter the name of the stream:
yellow-taxi
. - Click Create.
Set up data generation
-
Create a connection.
- In the management console
, select the folder where you want to create a connection. - In the list of services, select Yandex Query.
- In the left-hand panel, select
Tutorial. - Go to Streaming.
- Under Create infrastructure for tutorial, click Create connection.
- In the window that opens, under Connection type parameters, select the database and service account that you created previously.
- Click Create.
- In the management console
-
Create a data binding:
- A page for creating a data binding will open.
- Under Binding parameters, select the
yellow-taxi
stream created previously. - Click Create.
Data generation to the yellow-taxi
stream will start. Use the Stop and Start buttons to control the data generator.
Run the query
-
In the query editor in the Query interface, click New streaming query.
-
Enter the query text in the text field:
$data = SELECT * FROM bindings.`tutorial-streaming`; $locations = SELECT PULocationID FROM `tutorial-analytics`.`nyc_taxi_sample/example_locations.csv` WITH ( format=csv_with_names, SCHEMA ( PULocationID String ) ); $time = SELECT HOP_END() AS time, rides.PULocationID AS PULocationID, SUM(total_amount) AS total_amount FROM $data AS rides INNER JOIN $locations AS locations ON rides.PULocationID=locations.PULocationID GROUP BY HOP(cast(tpep_pickup_datetime AS Timestamp?), "PT1M", "PT1M", "PT1M"), rides.PULocationID; SELECT * FROM $time;
-
Click Run.
Review the result
Once you run the query to the streaming data, you will see the result with the total cost of rides (total_amount
) in specific PULocationID
locations taken after running the query.
# | PULocationID | time | total_amount |
---|---|---|---|
1 | 125 | 2022-02-15T12:03:00.000000Z | 1275.4084 |
2 | 129 | 2022-02-15T12:03:00.000000Z | 1073.0449 |
3 | 126 | 2022-02-15T12:03:00.000000Z | 202.85883 |
4 | 121 | 2022-02-15T12:03:00.000000Z | 636.8784 |
5 | 124 | 2022-02-15T12:03:00.000000Z | 923.87805 |
6 | 127 | 2022-02-15T12:04:00.000000Z | 2105.3125 |
... |