After watching a Yandex Cloud stream about Yandex DataProc, we decided to create a course on big data analysis using Apache Spark™ clusters. The idea was to give our students an opportunity to work with distributed data and computation systems using real-world tools widely used in the industry.

Such tools as Dataproc and Apache Spark™ are quite advanced, while many ML projects often do not need that much data. However, if you want to work at a high level, you should know how to use them.

4. Yandex DataSphere

At Yandex Cloud, we also build our own tools for ML development. One of them is Yandex DataSphere, a serverless platform that keeps hardware utilization close to 100%. This means you only pay for compute time, with no idle resources.

Its integration with other essential ML tools, like Yandex Data Proc, is another reason it stands out.

Yandex DataSphere works well for teams, too. Its flexible entity model allows setting up contexts for both development teams and student groups. Apart from that, you can fine-tune visibility settings, such as assign shared tasks to everyone but keep each student’s project hidden from others.

Anton Naumov from the School of Data Analysis (SDA) pointed out a native file system as a major advantage of Yandex DataSphere. It allows saving both the results and the entire project, meaning you do not have to set up your environment again each time, like you do in Colab.

Still, there are some drawbacks. Anton mentioned that it was sometimes tricky to add all required packages, leading to workarounds. He also noted occasional problems accessing VMs during peak usage. We are already working on speeding up machine allocation and making the environment setup easier.