Transferring data to a MongoDB target endpoint
Yandex Data Transfer enables you to migrate data to a MongoDB database and implement various data transfer, processing, and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Configure one of the supported data sources.
- Prepare the MongoDB database for the transfer.
- Configure the target endpoint in Yandex Data Transfer.
- Create a transfer and start it.
- Perform the required operations with the database and see how the transfer is going.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data to MongoDB
-
Migration: Moving data from one storage to another. Migration often means migrating a database from obsolete local databases to managed cloud ones.
-
Data delivery is a process of delivering arbitrary data to target storage. It includes data retrieval from a queue and its deserialization with subsequent transformation to target storage format.
For a detailed description of possible Yandex Data Transfer scenarios, see Tutorials.
Configuring the data source
Configure one of the supported data sources:
For a complete list of supported sources and targets in Yandex Data Transfer, see Available transfers.
Preparing the target database
-
Create a user with the
readWrite
role for the new database. -
To shard the migrated collections in the Yandex Managed Service for MongoDB target cluster:
-
Use this guide to create and configure empty sharded collections in the target database.
Data Transfer does not automatically shard the migrated collections. Sharding large collections may take a long time and slow down the transfer.
-
If sharding uses any key other than
_id
(default), assign themdbShardingManager
role to the user. -
When creating a target endpoint, select
DISABLED
orTRUNCATE
as your cleanup policy.Selecting the
DROP
policy will result in the service deleting all the data from the target database, including sharded collections, and replacing them with new unsharded ones when a transfer is activated.
Learn more about sharding in the MongoDB documentation
. -
-
If not planning to use Cloud Interconnect or VPN
for connections to an external cluster, make such cluster accessible from the Internet from IP addresses used by Data Transfer .For details on linking your network up with external resources, see this concept.
-
Make sure the MongoDB version on the target is not lower than that on the source.
-
Make sure the MongoDB cluster is configured so that it returns correctly resolving IP addresses or FQDNs (fully qualified domain names) in response to requests.
-
Configure the target cluster
to allow connections from the internet:-
In the configuration file, change
net.bindIp
from127.0.0.1
to0.0.0.0
:# network interfaces net: port: 27017 bindIp: 0.0.0.0
-
Restart
mongod
:sudo systemctl restart mongod.service
-
-
If the target cluster does not use replication, enable it:
-
Add the replication settings to the
/etc/mongod.conf
configuration file:replication: replSetName: <replica_set_name>
-
Restart
mongod
:sudo systemctl restart mongod.service
-
Connect to MongoDB and initialize the replica set with this command:
rs.initiate({ _id: "<replica_set_name>", members: [{ _id: 0, host: "<IP_address_listened_by_MongoDB>:<port>" }] });
-
-
Connect to the cluster and create a target database:
use <database_name>
-
Create a user with the
readWrite
role for the target database:use admin; db.createUser({ user: "<username>", pwd: "<password>", mechanisms: ["SCRAM-SHA-1"], roles: [ { db: "<target_database_name>", role: "readWrite" } ] });
Once started, the transfer will connect to the target on behalf of this user.
-
To shard the migrated collections in the target cluster:
-
Set up a database and populate it with empty collections with the same names as those in the source.
Data Transfer does not automatically shard the migrated collections. Sharding large collections may take a long time and slow down the transfer.
-
Enable target database sharding:
sh.enableSharding("<target_database_name>")
-
Shard every collection based on its namespace:
sh.shardCollection("<target_database_name>.<collection_name>", { <field_name>: <1|"hashed">, ... });
For the
shardCollection()
function description, see the MongoDB documentation . -
To make sure that sharding is set up and enabled, get a list of available shards:
sh.status()
-
If sharding uses any key other than
_id
(default), assign theclusterManager
system role to the user Data Transfer will use for connection to the target cluster:use admin; db.grantRolesToUser("<username>", ["clusterManager"]);
-
When creating a target endpoint, select
DISABLED
orTRUNCATE
as your cleanup policy.Selecting the
DROP
policy will result in the service deleting all the data from the target database, including sharded collections, and replacing them with new unsharded ones when a transfer is activated.
Learn more about sharding in the MongoDB documentation
. -
Configuring the MongoDB target endpoint
Data Transfer supports transfers from MongoDB starting with version 3.6.
When creating or updating an endpoint, you can define:
- Yandex Managed Service for MongoDB cluster connection or custom installation settings, including those based on Yandex Compute Cloud VMs. These are required parameters.
- Additional parameters.
Managed Service for MongoDB cluster
Warning
To create or edit an endpoint of a managed database, you need to have the managed-mongodb.viewer
role or the viewer
primitive role assigned for the folder where this managed database cluster resides.
Connecting to the database with the cluster ID specified in Yandex Cloud.
-
Managed Service for MongoDB cluster: Specify ID of the cluster to connect to.
-
Authentication source: Specify the database name in the cluster.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
- Endpoint type:
mongo-target
.
-
--cluster-id
: ID of the cluster you need to connect to. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
To set a user password to access the database, use one of the parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
mongo_target
.
-
connection.connection_options.mdb_cluster_id
: ID of cluster to connect to. -
subnet_id
: ID of the subnet the cluster is in. The transfer will use this subnet to access the cluster. If the ID is not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the cluster. For more information, see Networking in Yandex Data Transfer.
Security groups and the
subnet_id
subnet, if the latter is specified, must belong to the same network as the cluster.Note
In Terraform, it is not required to specify a network for security groups.
-
auth_source
: Name of the cluster database. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
mongo_target {
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
mdb_cluster_id = "<cluster_ID>"
auth_source = "<DB_name>"
user = "<username>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
mdbClusterId
: ID of the cluster you need to connect to. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Custom installation
Connecting to the database with explicitly specified network addresses and ports.
-
Hosts: Specify the IPs or FQDNs of the hosts to connect to.
-
Replica set: Specify the name of the replica set.
-
Port: Set the number of the port that Data Transfer will use for the connection.
-
CA certificate: To encrypt transmitted data, upload the PEM certificate or add its contents as text.
-
Subnet ID: Select or create a subnet in the required availability zone. The transfer will use this subnet to access the cluster.
If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
Authentication source: Specify the database name in the cluster.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
- Endpoint type:
mongo-target
.
-
--host
: IP address or FQDN of the master host you want to connect to. -
--port
: Number of the port that Data Transfer will use for the connection. -
--ca-certificate
— If the transmitted data needs to be encrypted, for example, to meet the requirements of PCI DSS . -
--subnet-id
: ID of the subnet the host is in. The transfer will use that subnet to access the host. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
To set a user password to access the database, use one of the parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
mongo_target
.
-
on_premise.port
: Port number that Data Transfer will use for connections. -
connection.connection_options.on_premise.tls_mode.enabled.ca_certificate
: CA certificate used if the data being transferred must be encrypted to comply with the PCI DSS requirements. -
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the VM with the database. For more information, see Networking in Yandex Data Transfer.
Security groups must belong to the same network as the
subnet_id
subnet, if the latter is specified.Note
In Terraform, it is not required to specify a network for security groups.
-
subnet_id
: ID of the subnet the cluster is in. The transfer will use this subnet to access the cluster. If the ID is not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
connection.connection_options.on_premise.replica_set
: Specify the name of the replica set. -
connection.connection_options.on_premise.hosts
: Specify the IP addresses or FQDN of the hosts to connect to. -
auth_source
: Name of the cluster database. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
mongo_target {
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
on_premise {
hosts = [ "list of replica set hosts" ]
port = "<port_for_connection>"
replica_set = "<replica_set_name>"
tls_mode {
enabled {
ca_certificate = "<certificate_in_PEM_format>"
}
}
}
auth_source = "<DB_name>"
user = "<username>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
onPremise
: Database connection parameters:-
hosts
— IP address or FQDN of the master host to connect to. -
port
: The number of the port that Data Transfer will use for the connection. -
tlsMode
: Parameters of encryption of transmitted data if it is required, for example, to meet the requirements of PCI DSS . -
subnetId
: ID of the subnet the host is in. The transfer will use that subnet to access the host.
-
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Additional settings
-
Database: Enter the database name. If you do not specify any name, the source database name will be used.
-
Cleanup policy: Select a way to clean up data in the target database before the transfer:
-
Don't cleanup
: Select this option if you are only going to do replication without copying data. -
Drop
: Completely delete tables included in the transfer (used by default).Use this option so that the latest version of the table schema is always transferred to the target database from the source whenever the transfer is activated.
-
Truncate
: Delete only the data from the tables included in the transfer but keep the schema.Use this option if the schema in the target database differs from the one that would have been transferred from the source during the transfer.
-
--target-database
: Specify the database name if you want to create collections in a database that is different from the source database.
-
database
: Specify the database name if you want to create collections in a database that is different from the source database. -
cleanup_policy
: Select a way to clean up data in the target database before the transfer:-
DISABLED
: Do not clean up (default).Select this option if only replication without copying data is performed.
-
DROP
: Completely delete the collections included in the transfer.Use this option so that the latest collection version is always transferred to the target database from the source whenever the transfer is activated.
-
TRUNCATE
: Delete only the data from the collections included in the transfer but keep the collections.Use this option if the structure of collections in the target database differs from the one that would have been transferred from the source during the transfer.
-
-
database
: Specify the database name if you want to create collections in a database that is different from the source database. -
cleanupPolicy
: Select a way to clean up data in the target database before the transfer:-
DISABLED
: Do not clean up (default).Select this option if only replication without copying data is performed.
-
DROP
: Completely delete the collections included in the transfer.Use this option so that the latest collection version is always transferred to the target database from the source whenever the transfer is activated.
-
TRUNCATE
: Delete only the data from the collections included in the transfer but keep the collections.Use this option if the structure of collections in the target database differs from the one that would have been transferred from the source during the transfer.
-
Warning
By default, Data Transfer transfers collections without sharding. If you are transferring data to a sharded target cluster and want your collections to be sharded:
- Prepare the target cluster to shard the collections.
- Select
DISABLED
orTRUNCATE
for cleanup policy.
Selecting the DROP
policy will cause the service to delete all the data, including sharded collections, from the target database and replace them with new unsharded ones when activating a transfer.
After configuring the data source and target, create and start the transfer.
Operations with the database during transfer
-
For transfers with the Copying status, you cannot perform any actions reducing the origin's operation log (oplog) time window. You should not add, delete, or reconfigure shards in any way during copying or perform any other actions resulting in a shorter operation log time window.
-
In transfers in Replicating status, you may encounter the key duplication problem when a sharded MongoDB cluster with a sharding index other than
_id
is the target. While a transfer is underway, we caution against creating clusters with sharding indexes other than_id
on the target.
Troubleshooting data transfer issues
Known issues when using a MongoDB endpoint:
- Collection key size exceeds 5 MB.
- Collection object size exceeds 16 MB.
- No tables found.
- Error when transferring a sharded cluster.
- Error when transferring timeseries collections.
- Unable to recognize an external cluster IP address or FQDN.
For more troubleshooting tips, see Troubleshooting.
Collection key size exceeds 5 MB
Error message:
Warn(replication): Usage of bulk objects in 'database <DB_name>'
breaks change event log, transfer is stopping.
Reason: (Location<item_number>) Tried to create string longer than 16MB.
If the collection key size exceeds 5 MB, transfers of the Replication type crash due to MongoDB internal limits
Solution: exclude any collections that exceed MongoDB limits from the transfer and reactivate it.
Collection object size exceeds 16 MB
Error message:
Warn(replication): Usage of bulk objects in 'collection '<DB_name>.<collection_name>''
breaks change event log, transfer is stopping.
Reason: (BSONObjectTooLarge) BSONObj size: <object_size> (<object_size_in_hex>) is invalid.
Size muse be between 0 and 16793600(16MB).
If the collection object size exceeds 16 MB, transfers of Replication type crash due to MongoDB internal limits
Solution: exclude any collections that exceed MongoDB limits from the transfer and reactivate it.
No table found
Error message:
Unable to find any tables
An empty number of collections was extracted from the database. The user might be missing permissions for the database used in the transfer.
Solution: for the database to be transferred, grant the user the transfer uses to connect to the source readWrite
permissions.
Error when transferring a sharded cluster
Solution: In the Snapshot settings → Parallel snapshot settings transfer parameter, specify the number of workers equal to the number of collections being transferred.
Error when transferring timeseries collections
Error messages:
Unable to find any tables
Cannot execute mongo activate hook:
Failed in accordance with configuration:
some tables from include list are missing in the source database: [<collection_name>]
The service does not support transfers of Time Series collections
Solution: exclude any Time Series collections from the transfer and reactivate it.
Unable to recognize an external cluster IP address or FQDN
The transfer fails with the error message:
server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: <unresolved_FQDN>, Type: Unknown, Last error: connection() error occurred during connection handshake: dial tcp: lookup <unresolved_FQDN> on <IP address>: no such host }, ] }"
The transfer error is due to the MongoDB cluster configuration. For example, when unresolved internal names are used in shard descriptions.
Solution:
Make sure the MongoDB cluster is configured so that it returns correctly resolving IP addresses or FQDNs (fully qualified domain names) in response to requests.