Databricks-Certified-Professional-Data-Engineer Exam Question 1

A table in the Lakehouse namedcustomer_churn_paramsis used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?
  • Databricks-Certified-Professional-Data-Engineer Exam Question 2

    A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
    Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?
  • Databricks-Certified-Professional-Data-Engineer Exam Question 3

    The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame namedpredswith the schema "customer_id LONG, predictions DOUBLE, date DATE".

    The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
    Which code block accomplishes this task while minimizing potential compute costs?
  • Databricks-Certified-Professional-Data-Engineer Exam Question 4

    A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFramedf. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.
    Streaming DataFramedfhas the following schema:
    "device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
    Code block:

    Choose the response that correctly fills in the blank within the code block to complete this task.
  • Databricks-Certified-Professional-Data-Engineer Exam Question 5

    A Delta Lake table was created with the below query:

    Realizing that the original query had a typographical error, the below code was executed:
    ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
    Which result will occur after running the second command?