Running computations on a schedule in DataSphere
You can set up regular run scenarios in Yandex DataSphere
In this tutorial, you will collect information about the most discussed stocks on Reddit
Information is collected and analyzed in DataSphere. Regular cell execution is triggered by a timer created in Cloud Functions.
To set up regular runs of Jupyter Notebook:
- Prepare your infrastructure.
- Create a notebook.
- Upload and process data.
- Create a Cloud Functions.
- Create a timer
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Required paid resources
The cost of implementing regular runs includes:
- Fee for DataSphere computing resource usage.
- Fee for the number of Cloud Functions function calls.
Prepare the infrastructure
Log in to the Yandex Cloud management console
If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page
Note
If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.
Create a folder
- In the management console
, select a cloud and click Create folder. - Name your folder, e.g.,
data-folder
. - Click Create.
Create a service account for the DataSphere project
To access a DataSphere project from a Cloud Functions function, you need a service account with the datasphere.community-projects.editor
and functions.functionInvoker
roles.
- Go to
data-folder
. - In the list of services, select Identity and Access Management.
- Click Create service account.
- Enter a name for the service account, e.g.,
reddit-user
. - Click Add role and assign the
datasphere.community-projects.editor
andfunctions.functionInvoker
roles to the service account. - Click Create.
Add the service account to a project
To enable the service account to run a DataSphere project, add it to the list of project members.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - In the Members tab, click Add member.
- Select the
reddit-user
account and click Add.
Configure the project
To reduce DataSphere usage costs, configure the time to release the VM attached to the project.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Go to the Settings tab.
- Under General settings, click
Edit. - To configure Stop inactive VM after, select
Custom
and specify 5 minutes. - Click Save.
Create a notebook
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- In the top panel, click File and select New → Notebook.
- Select a kernel and click Select.
- Right-click the notebook and select Rename. Enter the name:
test_classifier.ipynb
.
Upload and process data
To upload information on the most discussed stocks on Reddit and the sentiment of the discussion, paste the code to the test_classifier.ipynb
notebook cells. You will use it to select the top three most discussed stocks and save them to a CSV file in project storage.
-
Open the DataSphere project:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
-
Import the libraries:
import pandas as pd import requests import os.path
-
Initialize the variables:
REQUEST_URL = "https://tradestie.com/api/v1/apps/reddit" FILE_NAME = "/home/jupyter/datasphere/project/stock_sentiments_data.csv" TICKERS = ['NVDA', 'TSLA', 'AAPL']
-
Create a function that sends a request to the Tradestie API and returns a response as
pandas.DataFrame
:def load_data(): response = requests.get(REQUEST_URL) stocks = pd.DataFrame(response.json()) stocks = stocks[stocks['ticker'].isin(TICKERS)] stocks.drop('sentiment', inplace=True, axis=1) return stocks
-
Set a condition that defines a file to write stock information to:
if os.path.isfile(FILE_NAME): stocks = pd.read_csv(FILE_NAME) else: stocks = load_data() stocks['count'] = 1 stocks.to_csv(FILE_NAME, index=False)
-
Upload the updated stock data:
stocks_update = load_data()
-
Compare the updated and existing data:
stocks = stocks.merge(stocks_update, how='left', on = 'ticker') stocks['no_of_comments_y'] = stocks['no_of_comments_y'].fillna(stocks['no_of_comments_x']) stocks['sentiment_score_y'] = stocks['sentiment_score_y'].fillna(stocks['sentiment_score_y'])
-
Update the arithmetic average count and sentiment scores:
stocks['count'] += 1 stocks['no_of_comments_x'] += (stocks['no_of_comments_y'] - stocks['no_of_comments_x'])/stocks['count'] stocks['sentiment_score_x'] += (stocks['sentiment_score_y'] - stocks['sentiment_score_x'])/stocks['count'] stocks = stocks[['no_of_comments_x', 'sentiment_score_x', 'ticker', 'count']] stocks.columns = ['no_of_comments', 'sentiment_score', 'ticker', 'count'] print(stocks)
-
Save the results to a file:
stocks.to_csv(FILE_NAME, index=False)
Create a Cloud Functions
To start computations without opening JupyterLab, you need a Cloud Functions that will trigger computations in a notebook via the API.
- In the management console
, select the folder where you want to create a function. - Select Cloud Functions.
- Click Create function.
- Enter a name for the function, e.g.,
my-function
. - Click Create function.
Create a Cloud Functions version
Versions contain the function code, run parameters, and all required dependencies.
-
In the management console
, select the folder containing the function. -
Select Cloud Functions.
-
Select the function to create a version of.
-
Under Last version, click Сreate in editor.
-
Select the
Python
runtime environment. Do not select the Add files with code examples option. -
Choose the Code editor method.
-
Click Create file and specify a file name, e.g.,
index
. -
Enter the function code by inserting your project ID and the absolute path to the project notebook:
import requests def handler(event, context): url = 'https://datasphere.api.cloud.yandex.net/datasphere/v2/projects/<project_ID>:execute' body = {"notebookId": "/home/jupyter/datasphere/project/test_classifier.ipynb"} headers = {"Content-Type" : "application/json", "Authorization": "Bearer {}".format(context.token['access_token'])} resp = requests.post(url, json = body, headers=headers) return { 'body': resp.json(), }
Where:
<project_ID>
: ID of the DataSphere project displayed on the project page under its name.notebookId
: Absolute path to the project notebook.
-
Under Parameters, set the version parameters:
- Entry point:
index.handler
. - Service account:
reddit-user
.
- Entry point:
-
Click Save changes.
Create a timer
To run a function every 15 minutes, you will need a timer.
-
In the management console
, select the folder where you want to create a timer. -
Select Cloud Functions.
-
In the left-hand panel, select
Triggers. -
Click Create trigger.
-
Under Basic settings:
- Enter a name and description for the trigger.
- In the Type field, select
Timer
. - In the Launched resource field, select
Function
.
-
Under Timer settings, set the function invocation schedule to
Every 15 minutes
. -
Under Function settings, select a function and specify:
- Function version tag.
reddit-user
Service account used to call the function.
-
Click Create trigger.
From now on, the stock_sentiments_data.csv
file will be updated every 15 minutes. You can find it next to the notebook.
How to delete the resources you created
To stop paying for the resources you created: