Transferring data from a MongoDB source endpoint
Yandex Data Transfer enables you to migrate data from a MongoDB database and implement various data transfer, processing, and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Prepare the MongoDB database for the transfer.
- Set up a source endpoint in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- Perform the required operations with the database and see how the transfer is going.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from MongoDB
-
Migration: Moving data from one storage to another. Migration often means migrating a database from obsolete local databases to managed cloud ones.
-
Uploading data to scalable Object Storage storage allows you to save on data storage and simplifies the exchange with contractors.
For a detailed description of possible Yandex Data Transfer scenarios, see Tutorials.
Preparing the source database
-
Estimate the total number of databases for transfer and the total Managed Service for MongoDB workload. If the workload on the database exceeds 10,000 writes per second, create multiple endpoints and transfers. For more information, see Transferring data from a MongoDB source endpoint.
-
Create a user with the
readWrite
role for each source database to replicate. ThereadWrite
role is required to enable the transfer to write data to the__data_transfer.__dt_cluster_time
service collection.
-
Estimate the total number of databases for transfer and the total MongoDB workload. If the workload on the database exceeds 10,000 writes per second, create multiple endpoints and transfers. For more information, see Transferring data from a MongoDB source endpoint.
-
If not planning to use Cloud Interconnect or VPN
for connections to an external cluster, make such cluster accessible from the Internet from IP addresses used by Data Transfer .For details on linking your network up with external resources, see this concept.
-
Make sure the MongoDB version on the target is
4.0
or higher. -
Make sure the MongoDB cluster is configured so that it returns correctly resolving IP addresses or FQDNs (fully qualified domain names) in response to requests.
-
Configure access to the source cluster from Yandex Cloud. To configure a source cluster
for connections from the internet:-
In the configuration file, change
net.bindIp
from127.0.0.1
to0.0.0.0
:# network interfaces net: port: 27017 bindIp: 0.0.0.0
-
Restart
mongod
:sudo systemctl restart mongod.service
-
-
If the source cluster does not use replication, enable it:
-
Add the replication settings to the
/etc/mongod.conf
configuration file:replication: replSetName: <replica_set_name>
-
Restart
mongod
:sudo systemctl restart mongod.service
-
Connect to MongoDB and initialize the replica set with this command:
rs.initiate({ _id: "<replica_set_name>", members: [{ _id: 0, host: "<IP_address_listened_by_MongoDB>:<port>" }] });
-
-
Create a user with the
readWrite
role for all the source databases to replicate:use admin db.createUser({ user: "<username>", pwd: "<password>", mechanisms: ["SCRAM-SHA-1"], roles: [ { db: "<source_database_1_name>", role: "readWrite" }, { db: "<source_database_2_name>", role: "readWrite" }, ... ] });
Once started, the transfer will connect to the source on behalf of this user. The
readWrite
role is required to enable the transfer to write data to the__data_transfer.__dt_cluster_time
service collection.Note
For MongoDB 3.6 or higher, you only need to assign the created user the
read
role for the databases to replicate. -
When using MongoDB 3.6 or higher, to run the transfer, the user must have the read permission for the
local.oplog.rs
collection and the read and write permissions for the__data_transfer.__dt_cluster_time
collection. To assign a user theclusterAdmin
role granting these permissions, connect to MongoDB and run the following commands:use admin; db.grantRolesToUser("<username>", ["clusterAdmin"]);
To grant more granular permissions, you can assign the
clusterMonitor
role required for reading thelocal.oplog.rs
collection and grant read and write access to the__data_transfer.__dt_cluster_time
system collection.
Configuring the MongoDB source endpoint
Data Transfer supports transfers from MongoDB starting with version 3.6.
When creating or updating an endpoint, you can define:
- Yandex Managed Service for MongoDB cluster connection or custom installation settings, including those based on Yandex Compute Cloud VMs. These are required parameters.
- Additional parameters.
Managed Service for MongoDB cluster
Warning
To create or edit an endpoint of a managed database, you need to have the managed-mongodb.viewer
role or the viewer
primitive role assigned for the folder where this managed database cluster resides.
Connecting to the database with the cluster ID specified in Yandex Cloud.
-
Managed Service for MongoDB cluster: Specify ID of the cluster to connect to.
-
Authentication source: Specify the database name in the cluster.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
- Endpoint type:
mongo-source
.
-
--cluster-id
: ID of the cluster you need to connect to. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
To set a user password to access the database, use one of the parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
mongo_source
.
-
connection.connection_options.mdb_cluster_id
: ID of cluster to connect to. -
subnet_id
: ID of the subnet the cluster is in. The transfer will use this subnet to access the cluster. If the ID is not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the cluster. For more information, see Networking in Yandex Data Transfer.
Security groups and the
subnet_id
subnet, if the latter is specified, must belong to the same network as the cluster.Note
In Terraform, it is not required to specify a network for security groups.
-
auth_source
: Name of the cluster database. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
mongo_source {
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
mdb_cluster_id = "<cluster_ID>"
auth_source = "<DB_name>"
user = "<username>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
mdbClusterId
: ID of the cluster you need to connect to. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Custom installation
The settings are given for the OnPremise use case when all fields are filled in manually.
-
Hosts: Specify the IPs or FQDNs of the hosts to connect to.
-
Replica set: Specify the name of the replica set.
-
Port: Set the number of the port that Data Transfer will use for the connection.
-
CA certificate: To encrypt transmitted data, upload the PEM certificate or add its contents as text.
-
Subnet ID: Select or create a subnet in the required availability zone. The transfer will use this subnet to access the cluster.
If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
Authentication source: Specify the database name in the cluster.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
- Endpoint type:
mongo-source
.
-
--host
: IP address or FQDN of the master host you want to connect to. -
--port
: Number of the port that Data Transfer will use for the connection. -
--ca-certificate
— If the transmitted data needs to be encrypted, for example, to meet the requirements of PCI DSS . -
--subnet-id
: ID of the subnet the host is in. The transfer will use that subnet to access the host. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
To set a user password to access the database, use one of the parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
mongo_source
.
-
on_premise.port
: Port number that Data Transfer will use for connections. -
connection.connection_options.on_premise.tls_mode.enabled.ca_certificate
: CA certificate used if the data being transferred must be encrypted to comply with the PCI DSS requirements. -
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the VM with the database. For more information, see Networking in Yandex Data Transfer.
Security groups must belong to the same network as the
subnet_id
subnet, if the latter is specified.Note
In Terraform, it is not required to specify a network for security groups.
-
subnet_id
: ID of the subnet the cluster is in. The transfer will use this subnet to access the cluster. If the ID is not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
connection.connection_options.on_premise.replica_set
: Specify the name of the replica set. -
connection.connection_options.on_premise.hosts
: Specify the IP addresses or FQDN of the hosts to connect to. -
auth_source
: Name of the cluster database. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
mongo_source {
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
on_premise {
hosts = [ "list of replica set hosts" ]
port = "<port_for_connection>"
replica_set = "<replica_set_name>"
tls_mode {
enabled {
ca_certificate = "<certificate_in_PEM_format>"
}
}
}
auth_source = "<DB_name>"
user = "<username>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
onPremise
: Database connection parameters:-
hosts
— IP address or FQDN of the master host to connect to. -
port
: The number of the port that Data Transfer will use for the connection. -
tlsMode
: Parameters of encryption of transmitted data if it is required, for example, to meet the requirements of PCI DSS . -
subnetId
: ID of the subnet the host is in. The transfer will use that subnet to access the host.
-
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Collection filter
-
Included collections: Data is only transferred from listed collections. All collections are transferred by default.
When you add new collections while editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these collections will not get uploaded. To add a collection with its historical data, use the List of objects for transfer field in the transfer settings.
-
Excluded collections: Data is transferred from all collections except the specified ones.
Included and excluded collection names must meet the ID naming rules in MongoDB. Escaping double quotes is not required.
-
--include-collection
: Transfer data only from the listed collections. The values are specified in<database_name>.<collection_name>
format. All collections are transferred by default.When you add new collections while editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these collections will not get uploaded. To add a collection with its historical data, use the List of objects for transfer field in the transfer settings.
-
--exclude-collection
: Transfer data from all collections except the specified ones. The values are specified in<database_name>.<collection_name>
format. -
--prefer-secondary
: Set totrue
to use replicas (if there are any in the cluster) instead of the master host to read data.
-
collections
: Transfer data only from the listed collections. All collections are transferred by default.When you add new collections while editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these collections will not get uploaded. To add a collection with its historical data, use the List of objects for transfer field in the transfer settings.
-
excluded_collections
: Data is transferred from all collections except the specified ones. -
secondary_preferred_mode
: Set totrue
to use replicas (if there are any in the cluster) instead of the master host to read data.
-
collections
: Transfer data only from the listed collections. You need to specify the following for each collection:databaseName
: Database namecollectionName
: Collection name
All collections are transferred by default.
When you add new collections while editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these collections will not get uploaded. To add a collection with its historical data, use the List of objects for transfer field in the transfer settings.
-
excludedCollections
: Transfer data from all collections except the specified ones. You need to specify the following for each collection:databaseName
: Database namecollectionName
: Collection name
-
secondaryPreferredMode
: Set totrue
to use replicas (if there are any in the cluster) instead of the master host to read data.
If a source workload is high (over 10,000 write transactions per second), we recommend that you select these settings to have no more than ten different databases at each endpoint. This will help avoid database connection errors while the transfer is ongoing.
Note
- If you use several endpoints, you need to create a separate transfer for each one.
- As transfers of timeseries collections
are not supported, you should exclude such collections.
Configuring the data target
Configure one of the supported data targets:
For a complete list of supported sources and targets in Yandex Data Transfer, see Available transfers.
After configuring the data source and target, create and start the transfer.
Operations with the database during transfer
-
For transfers with the Copying status, you cannot perform any actions reducing the origin's operation log (oplog) time window. You should not add, delete, or reconfigure shards in any way during copying or perform any other actions resulting in a shorter operation log time window.
-
In transfers in Replicating status, you may encounter the key duplication problem when a sharded MongoDB cluster with a sharding index other than
_id
is the target. While a transfer is underway, we caution against creating clusters with sharding indexes other than_id
on the target.
Troubleshooting data transfer issues
Known issues when using a MongoDB endpoint:
- Collection key size exceeds 5 MB.
- Collection object size exceeds 16 MB.
- No tables found.
- Error when transferring a sharded cluster.
- Error when transferring timeseries collections.
- Unable to recognize an external cluster IP address or FQDN.
- Error at the copying stage
For more troubleshooting tips, see Troubleshooting.
Collection key size exceeds 5 MB
Error message:
Warn(replication): Usage of bulk objects in 'database <DB_name>'
breaks change event log, transfer is stopping.
Reason: (Location<item_number>) Tried to create string longer than 16MB.
If the collection key size exceeds 5 MB, transfers of the Replication type crash due to MongoDB internal limits
Solution: exclude any collections that exceed MongoDB limits from the transfer and reactivate it.
Collection object size exceeds 16 MB
Error message:
Warn(replication): Usage of bulk objects in 'collection '<DB_name>.<collection_name>''
breaks change event log, transfer is stopping.
Reason: (BSONObjectTooLarge) BSONObj size: <object_size> (<object_size_in_hex>) is invalid.
Size muse be between 0 and 16793600(16MB).
If the collection object size exceeds 16 MB, transfers of Replication type crash due to MongoDB internal limits
Solution: exclude any collections that exceed MongoDB limits from the transfer and reactivate it.
No table found
Error message:
Unable to find any tables
An empty number of collections was extracted from the database. The user might be missing permissions for the database used in the transfer.
Solution: for the database to be transferred, grant the user the transfer uses to connect to the source readWrite
permissions.
Error when transferring a sharded cluster
Solution: In the Snapshot settings → Parallel snapshot settings transfer parameter, specify the number of workers equal to the number of collections being transferred.
Error when transferring timeseries collections
Error messages:
Unable to find any tables
Cannot execute mongo activate hook:
Failed in accordance with configuration:
some tables from include list are missing in the source database: [<collection_name>]
The service does not support transfers of Time Series collections
Solution: exclude any Time Series collections from the transfer and reactivate it.
Unable to recognize an external cluster IP address or FQDN
The transfer fails with the error message:
server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: <unresolved_FQDN>, Type: Unknown, Last error: connection() error occurred during connection handshake: dial tcp: lookup <unresolved_FQDN> on <IP address>: no such host }, ] }"
The transfer error is due to the MongoDB cluster configuration. For example, when unresolved internal names are used in shard descriptions.
Solution:
Make sure the MongoDB cluster is configured so that it returns correctly resolving IP addresses or FQDNs (fully qualified domain names) in response to requests.
Error at data copying stage
The Snapshot and increment type transfer terminates with the following error at the copying stage:
encountered non-recoverable resume token error. Sync cannot be resumed from this state and must be terminated and re-enabled to continue functioning: (ChangeStreamHistoryLost) Resume of change stream was not possible, as the resume point may no longer be in the oplog.
The ChangeStreamHistoryLost
error occurs when the total copy time of the MongoDB origin cluster data exceeds the operation log (oplog) time window size. You can check the current time window size in the management console. See the Oplog window graph of the cluster monitoring page.
For more information on oplog, see the MongoDB documentation
Solution:
- Increase the oplog size (10% of the cluster disk size by default). To increase the oplog size in a Managed Service for MongoDB origin cluster, contact technical support
. To change the oplog size if using a custom origin installation, see the MongoDB documentation . - Enable parallel data copying to speed up the copying stage.
- Limit the list of transferable objects in the transfer settings.
Once that is done, activate the transfer again.
Source data cannot be sharded
The transfer from a MongoDB source fails with the following error message:
ERROR: Unable to Activate
error: "failed to execute mongo activate hook: Snapshot loading failed: unable to shard upload tables: unable to shard upload (main worker) tables: unable to shard tables for operation ID: unable to split table, err: cannot get delimiters: there are two or more types of objects in the sharding index"
The cannot get delimiters: there are two or more types of objects in the sharding index
error means that the source collection id
field contains different data types, making the source unsuitable for sharding.
Solution:
In the Snapshot settings → Parallel snapshot settings transfer settings, specify one worker and one stream to disable sharding.
Once that is done, activate the transfer again.