Migrating databases from Google BigQuery to Yandex Managed Service for ClickHouse®
You can migrate a database from Google BigQuery to Yandex Managed Service for ClickHouse® and then use Yandex DataLens to analyze it.
A table is compressed and moved to a Google Storage bucket, from where it is transferred to an Yandex Object Storage bucket. After that, the data is imported to a Managed Service for ClickHouse® cluster where you can analyze it using Yandex DataLens.
This migration method offers the following benefits:
- You can specify the export format and the compression ratio.
- Significantly reduced data volume leads to faster migration and lower costs.
However, in this case, the data is migrated as is without transforming or copying the updated increments.
To migrate a database from Google BigQuery to Managed Service for ClickHouse®:
- Transfer data from Google BigQuery to Yandex Object Storage.
- Configure data mapping from Yandex Object Storage to the Managed Service for ClickHouse® cluster.
- Analyze your data with Yandex DataLens.
If you no longer need the resources you created, delete them.
Required paid resources
The support cost for this solution includes:
- Managed Service for ClickHouse® cluster fee, which covers the use of computing resources allocated to hosts (including ZooKeeper hosts) and disk space (see Managed Service for ClickHouse® pricing).
- Fee for public IP addresses if public access is enabled for cluster hosts (see Virtual Private Cloud pricing).
- Object Storage bucket fee, which covers data storage and data operations (see Object Storage pricing).
- Fee for using Yandex DataLens (see DataLens pricing).
Getting started
To migrate your database, first create the following Google Cloud and Yandex Cloud resources.
Create Google Cloud resources
-
Create a Google Cloud service account
with theBigQuery Data EditorandStorage Object Adminroles. -
Create an access key for the service account
and save it as a.jsonfile. -
Install the Google BigQuery Python SDK
. This package requires Python version 3.7 or higher. -
Prepare a dataset for Google BigQuery. For this example, we use a Google BigQuery’s public dataset
google_trends, which includes theinternational_top_termstable with the following columns:rankcountry_namecountry_coderegion_nameweekscoreregion_codetermrefresh_date
Create Yandex Cloud resources
-
Create a service account with the
storage.uploaderrole to access the Object Storage bucket. -
Create a static access key for the service account. Save the key ID and secret key, as you will need them later.
-
Create a Managed Service for ClickHouse® cluster of any suitable configuration. When creating your cluster:
-
Specify the service account you created earlier.
-
Enable DataLens access.
-
Add ZooKeeper hosts to ensure high cluster availability.
-
-
Create an Object Storage bucket. When creating the bucket, activate public read permissions for objects and the bucket listing.
Migrate data from Google BigQuery to Yandex Object Storage
-
Create a file named
credentials.botowith access credentials for Google Cloud and Yandex Cloud resources:[Credentials] gs_service_client_id =<Google_Cloud_service_account> gs_service_key_file =<absolute_path_to_JSON_file> aws_access_key_id =<service_account_key_ID> aws_secret_access_key =<service_account_secret_key> [GSUtil] default_project_id =<Google_Cloud_project_ID> [s3] calling_format=boto.s3.connection.OrdinaryCallingFormat host=storage.yandexcloud.netWhere:
-
gs_service_client_id: Google Cloud service account name inservice-account-name@project-id.iam.gserviceaccount.comformat. -
gs_service_key_file: Absolute path to the JSON file containing your Google Cloud service account’s access key. -
aws_access_key_id: Yandex Cloud service account key ID. -
aws_secret_access_key: Yandex Cloud service account secret key. -
default_project_id: Google Cloud project ID .
-
-
Create a
main.pyscript for data compression and migration:main.py
from google.cloud import bigquery import sys import argparse import time import subprocess import os os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="<absolute_path_to_JSON_file_with_Google_Cloud_service_account_access_key>" os.environ["BOTO_CONFIG"]="<absolute_path_to_credentials.boto_file>" def parse_args(): parser = argparse.ArgumentParser(description='Export data from Google Big Query to Yandex Cloud object storage') parser.add_argument('--bq_project', type=str, help='GBQ project ID') parser.add_argument('--bq_location', type=str, help='GBQ table AND GS location') parser.add_argument('--gs_bucket', type=str, help='GS export destination bucket') parser.add_argument('--yc_bucket', type=str, help='YC copy destination bucket') parser.add_argument('--gsutil_path', type=str, help='GSutil exec path', default='gsutil') return parser.parse_args() def select_from_list(message, elements): print(message) print("\t{}. {}".format(0, "Export all")) for ind in range(len(elements)): if isinstance(elements[ind].reference, bigquery.DatasetReference): print("\t{}. {}".format(ind+1, elements[ind].reference.dataset_id)) elif isinstance(elements[ind].reference, bigquery.TableReference): print("\t{}. {}".format(ind+1, elements[ind].reference.table_id)) try: return int(input("(any letter for cancel) >> ")) except ValueError: print("Exiting") sys.exit() if __name__ == '__main__': args = parse_args() client = bigquery.Client() datasets = list(client.list_datasets(args.bq_project)) dataset_selector = select_from_list("Datasets in project {}".format(args.bq_project), datasets) export_list = [] for i in range(len(datasets)): dataset_ref = datasets[i].reference if dataset_selector == 0: export_list += list(client.list_tables(dataset_ref)) else: if i == dataset_selector - 1: tables = list(client.list_tables(dataset_ref)) table_selector = select_from_list("Tables in dataset {}".format(dataset_ref.dataset_id), tables) for j in range(len(tables)): if table_selector == 0 or j == table_selector - 1: export_list.append(tables[j]) print("Starting tables export") for n in range(len(export_list)): table_ref = export_list[n].reference # Creating Extract Job config. Selecting compression level and data format. job_config = bigquery.job.ExtractJobConfig() job_config.compression = bigquery.Compression.GZIP job_config.destination_format = bigquery.DestinationFormat.PARQUET print("Exporting {} table".format(table_ref.table_id)) extract_job = client.extract_table( source=table_ref, destination_uris="gs://{}/{}".format(args.gs_bucket, "{}-*".format(table_ref.table_id)), job_id="export-job-{}-{}".format(table_ref.table_id, round(time.time() * 1000)), location=args.bq_location, job_config=job_config) extract_job.result() print("Tables export done") # Calling gsutil rsync to synchronize source and destination buckets. source_uri = "gs://{}/".format(args.gs_bucket) destination_uri = "s3://{}/".format(args.yc_bucket) print("Synchronizing {} with {}...".format(source_uri, destination_uri)) proc = subprocess.Popen([args.gsutil_path, "-m", "rsync", source_uri, destination_uri], stdout=sys.stdout, stderr=sys.stderr) proc.communicate() print("Buckets synchronization done") -
Run the
main.pyscript to migrate data from Google BigQuery to the Google Storage, with subsequent transfer to the Yandex Object Storage bucket:python main.py \ --bq_project=<Google_Cloud_project_ID> \ --bq_location=US \ --gs_bucket=<Google_Cloud_Storage_bucket_name> \ --yc_bucket=<Object_Storage_bucket_name>Wait for the data migration to complete.
Configure data mapping from Yandex Object Storage to the Managed Service for ClickHouse® cluster
-
To create a view of the imported data, connect to the Managed Service for ClickHouse® cluster database and run the following SQL query:
CREATE view db1.v$google_top_rising_terms on cluster on cluster '{cluster}' AS (SELECT term, score, rank, country_name, country_code, region_name, region_code, week, refresh_date FROM s3Cluster( '<cluster_ID>', 'https://storage.yandexcloud.net/<Object_Storage_bucket_name>/top_terms-*', 'Parquet', 'rank Int32, country_name String, country_code String, region_name String, week Timestamp, score Nullable(Int32), region_code String, term String, refresh_date Timestamp') )Where:
db1: Name of the database in the Managed Service for ClickHouse® cluster where you want to create a view.v$google_top_rising_terms: Name of the view for the imported data.<cluster_ID>: Managed Service for ClickHouse® cluster ID. You can get it from your folder’s cluster list.top_terms-*: Naming prefix for Object Storage bucket objects. For example, if you migrate a Google Cloud table containing rows namedtop_terms, the corresponding objects in the Object Storage bucket will have the following names:top_terms-000000000001,top_terms-000000000002, etc. To include all table entries with this name in the view, use thetop_terms-*pattern in your SQL query .
-
To retrieve the first 100 records from the view, run the following SQL query (in our example, we use database
db1and its viewv$google_top_rising_terms):SELECT * FROM db1.v$google_top_rising_terms limit 100
Analize your data with Yandex DataLens
-
Connect the Managed Service for ClickHouse® cluster to DataLens.
-
Create a dataset from the
db1.v$google_top_rising_termstable. For thescorefield, select the average aggregation. -
- Drag the
country_namefield to the X section. - Drag the
scorefield to the Y section. - Drag the
termfield to the Filters section. In the form that opens, specify the following settings:- Operation: Belongs to a set.
- Available: Select a term from a list of available terms, then click Apply filter.
- Drag the
termfield to the Sorting section.
- Drag the
The use of the specified search query will be analyzed, with the result displayed as a country-by-country bar chart.
Delete the resources you created
Delete the resources you no longer need to avoid paying for them:
- Delete the Managed Service for ClickHouse® cluster.
- Delete all objects from the Object Storage bucket, then delete the bucket.
- Delete the Google Storage bucket
.
ClickHouse® is a registered trademark of ClickHouse, Inc