Skip to content

34 Things I Wish I Knew Before My Databricks ML Associate Exam

Databricks is one of the most powerful data platforms out there. It combines a Data Platform and an AI Platform into the extremely powerful Databricks Data Intelligence Platform, that enables organizations to capture the true potential of their data and their data teams.

To enable users to make the most of its countless features, Databricks has opened free overview courses to partners and customers in the Databricks Academy. These courses are linked to paid certifications targeting different roles, including for ML professionals. These courses and certification paths are a great way of learning more about Databricks.

There are two relevant Databricks certifications for Machine Learning:

  • Machine Learning Associate
  • Machine Learning Professional

Often, these are pursued by Machine Learning Engineers (of all flavours including MLOPS) and Data Scientists that want to prove their Databricks expertise, and acquire a more comprehensive view of what the platform offers.

This was the case for me. I am a Machine Learning Engineer. While preparing for the Machine Learning Associate certification, I noticed the lack of good public content out there to prepare candidates for the certification. I’ve passed the certification, and written this article to help you conquer your badge too.

This article covers relevant topics for the Associate certification. This is not an introduction to Machine Learning or Databricks; rather, this is a topic review for the Databricks Machine Learning Associate certification. Be sharp on the following topics before taking the exam.

The Exam

To obtain the Databricks Machine Learning Associate certification, you need to pass an online-proctored exam. The certification is intended for professionals with at least six-month experience in the platform, and it’s valid for 2 years.

In under 90 minutes, you will need to achieved a 70% mark on 45 questions. Some of these questions focus on code and the Databricks platform, some on modelling and theoretical Machine Learning concepts.

The exam is only available in English.

It is divided in four sections with the following weights and topics:

  1. Databricks Machine Learning (29%): Databricks ML, Databricks Runtime for Machine Learning, AutoML, Feature Store, Managed MLflow
  2. ML Workflows (29%): Exploratory Data Analysis, Feature Engineering, Training, Evaluation and Selection
  3. Spark ML (33%): Distributed ML Concepts, Spark ML Modelling APIs, Hyperopt, Pandas API on Spark, Pandas UDFs/Function APIs
  4. Scaling ML Models (9%): Model Distribution, Ensembling Distribution

More details are available in the official Exam Guide.

Databricks ML Associate Exam: 34 useful things to know

Below we have displayed all 34 topics, linking to the right section in the blog post which was originally posted on Medium.com.

Databricks Machine Learning

  1. How to integrate Git and Databricks to deliver CI/CD?
  2. When is a single-node cluster preferred?
  3. Which libraries are included in Databricks Runtime ML?
  4. How to install additional libraries to a cluster?
  5. What can Databricks AutoML do?
  6. What evaluation metrics are optimized by Databricks AutoML?
  7. How to orchestrate multi-task Databricks jobs?
  8. What is a Feature Store?
  9. How to take a feature from Feature Store?
  10. How to log an artifact using MLflow?
  11. How to get the best run of a model using MLflow?

    ML Workflows
  12. How to filter a Spark DataFrame on a given column?
  13. How to use dbutils.data.summarize?
  14. How to use summary?
  15. How to input missing values?
  16. How to deal with missing data in tree algorithms?
  17. How to index and one hot encode categorical variables?
  18. How to use the VectorAssembler?
  19. How do you interpret level and logs in regressions?
  20. How to evaluate a regression model?
  21. How to perform cross-validation when fitting a model?
  22. How to evaluate a binary classifier?
  23. How to orchestrate runs with MLflow?

    Spark ML
  24. What is an estimator?
  25. How may scikit-learn and Spark deliver different models if the same data and parameters are used?
  26. What are the hardships of parallelization?
  27. How to optimize with Hyperopt?
  28. How to convert Pandas to Spark and back?
  29. What is the Pandas API on Spark?
  30. What is the Pandas UDF?
  31. How to implement a Series to Series Pandas UDF?
  32. When to use ApplyInPandas and MapInPandas?

    Scaling ML Models
  33. What are ensemble methods?
  34. What is Gradient Boosting?

Good luck!

Enroll now for the Databricks Machine Learning Associate certification, and go get your badge. Good luck with your exam!

If you found this article useful, please like, comment, share, or buy me a coffee.