Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?
Correct Answer: D
Explanation transactionsDf.select('value').union(transactionsDf.select('productId')).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below). transactionsDf.select('value', 'productId').distinct() Wrong. This code block returns unique rows, but not unique values. transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'}) Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls). transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'}) No. This command will count the number of rows, but will not return unique values. transactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer') Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read up on the difference between union and join, a link is posted below. More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation, sql - What is the difference between JOIN and UNION? - Stack Overflow Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 22
Which of the following DataFrame operators is never classified as a wide transformation?
Correct Answer: D
Explanation As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide transformation. DataFrame.select() Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation. DataFrame.repartition() Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is known as a shuffle and, in turn, is classified as a wide transformation. DataFrame.aggregate() No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation. DataFrame.join() Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join() counts as a wide transformation. DataFrame.sort() False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation. More info: Understanding Apache Spark Shuffle | Philipp Brunenberg
Associate-Developer-Apache-Spark Exam Question 23
The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to transactionNumber. Find the error. Code block: transactionsDf.withColumn("transactionNumber", "transactionId")
Correct Answer: E
Explanation Correct code block: transactionsDf.withColumnRenamed("transactionId", "transactionNumber") Note that in Spark, a copy is returned by default. So, there is no need to append copy() to the code block. More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 24
Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?
Correct Answer: D
Explanation In this question it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column. So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort(). Also, order_by is not a valid DataFrame API method. More info: pyspark.sql.DataFrame.sort - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 25
Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?
Correct Answer: F
Explanation from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY) Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed. transactionsDf.cache() This is wrong because the default storage level of DataFrame.cache() is MEMORY_AND_DISK, meaning that partitions that do not fit into memory are stored on disk. transactionsDf.persist() This is wrong because the default storage level of DataFrame.persist() is MEMORY_AND_DISK. transactionsDf.clear_persist() Incorrect, since clear_persist() is not a method of DataFrame. transactionsDf.storage_level('MEMORY_ONLY') Wrong. storage_level is not a method of DataFrame. More info: RDD Programming Guide - Spark 3.0.0 Documentation, pyspark.sql.DataFrame.persist - PySpark 3.0.0 documentation (https://bit.ly/3sxHLVC , https://bit.ly/3j2N6B9)