The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching column names and inserting null values where column names do not appear in both DataFrames. Find the error. Sample of DataFrame transactionsDfMonday: 1.+-------------+---------+-----+-------+---------+----+ 2.|transactionId|predError|value|storeId|productId| f| 3.+-------------+---------+-----+-------+---------+----+ 4.| 5| null| null| null| 2|null| 5.| 6| 3| 2| 25| 2|null| 6.+-------------+---------+-----+-------+---------+----+ Sample of DataFrame transactionsDfTuesday: 1.+-------+-------------+---------+-----+ 2.|storeId|transactionId|productId|value| 3.+-------+-------------+---------+-----+ 4.| 25| 1| 1| 4| 5.| 2| 2| 2| 7| 6.| 3| 4| 2| null| 7.| null| 5| 2| null| 8.+-------+-------------+---------+-----+ Code block: sc.union([transactionsDfMonday, transactionsDfTuesday])
Correct Answer: E
Explanation Correct code block: transactionsDfMonday.unionByName(transactionsDfTuesday, True) Output of correct code block: +-------------+---------+-----+-------+---------+----+ |transactionId|predError|value|storeId|productId| f| +-------------+---------+-----+-------+---------+----+ | 5| null| null| null| 2|null| | 6| 3| 2| 25| 2|null| | 1| null| 4| 25| 1|null| | 2| null| 7| 2| 2|null| | 4| null| null| 3| 2|null| | 5| null| null| null| 2|null| +-------------+---------+-----+-------+---------+----+ For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to merge DataFrames that have different columns - just like in this example. sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away from sc.union() is given where the question talks about joining "into a new DataFrame". concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames. Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed, especially given that with the default arguments we cannot define a join condition. More info: - pyspark.sql.DataFrame.unionByName - PySpark 3.1.2 documentation - pyspark.SparkContext.union - PySpark 3.1.2 documentation - pyspark.sql.functions.concat - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 42
Which of the following DataFrame methods is classified as a transformation?
Correct Answer: C
Explanation DataFrame.select() Correct, DataFrame.select() is a transformation. When the command is executed, it is evaluated lazily and returns an RDD when it is triggered by an action. DataFrame.foreach() Incorrect, DataFrame.foreach() is not a transformation, but an action. The intention of foreach() is to apply code to each element of a DataFrame to update accumulator variables or write the elements to external storage. The process does not return an RDD - it is an action! DataFrame.first() Wrong. As an action, DataFrame.first() executed immediately and returns the first row of a DataFrame. DataFrame.count() Incorrect. DataFrame.count() is an action and returns the number of rows in a DataFrame. DataFrame.show() No, DataFrame.show() is an action and displays the DataFrame upon execution of the command.
Associate-Developer-Apache-Spark Exam Question 43
Which of the following statements about DAGs is correct?
Correct Answer: E
Explanation DAG stands for "Directing Acyclic Graph". No, DAG stands for "Directed Acyclic Graph". Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts. No, quite the opposite. You can access DAGs through the Spark UI and they can be of great help when optimizing queries manually. In contrast to transformations, DAGs are never lazily executed. DAGs represent the execution plan in Spark and as such are lazily executed when the driver requests the data processed in the DAG.
Associate-Developer-Apache-Spark Exam Question 44
Which of the following describes the characteristics of accumulators?
Correct Answer: E
Explanation If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator. Correct, when Spark tries to rerun a failed action that includes an accumulator, it will only update the accumulator if the action succeeded. Accumulators are immutable. No. Although accumulators behave like write-only variables towards the executors and can only be read by the driver, they are not immutable. All accumulators used in a Spark application are listed in the Spark UI. Incorrect. For scala, only named, but not unnamed, accumulators are listed in the Spark UI. For pySpark, no accumulators are listed in the Spark UI - this feature is not yet implemented. Accumulators are used to pass around lookup tables across the cluster. Wrong - this is what broadcast variables do. Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module. Wrong, accumulators are instantiated via the accumulator(n) method of the sparkContext, for example: counter = spark.sparkContext.accumulator(0). More info: python - In Spark, RDDs are immutable, then how Accumulators are implemented? - Stack Overflow, apache spark - When are accumulators truly reliable? - Stack Overflow, Spark - The Definitive Guide, Chapter 14
Associate-Developer-Apache-Spark Exam Question 45
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
Correct Answer: E
Explanation spark.createDataFrame(throughputRates, FloatType()) Correct! spark.createDataFrame is the correct operator to use here and the type FloatType() which is passed in for the command's schema argument is correctly instantiated using the parentheses. Remember that it is essential in PySpark to instantiate types when passing them to SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object. spark.createDataFrame((throughputRates), FloatType) No. While packing throughputRates in parentheses does not do anything to the execution of this command, not instantiating the FloatType with parentheses as in the previous answer will make this command fail. spark.createDataFrame(throughputRates, FloatType) Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the explanation of the previous answer for further insights. spark.DataFrame(throughputRates, FloatType) Wrong. There is no SparkSession.DataFrame() method in Spark. spark.createDataFrame(throughputRates) False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see in the documentation (linked below), the inference will only work if you pass in an "RDD of either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are passing a Python list, Spark's schema inference will fail. More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3