Explanation The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD documentation linked below). The output of print(transactionsDf.schema) is this: StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),StructField (value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true),StructFiel It includes the same information as the nicely formatted original output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method. More info: - pyspark.RDD: pyspark.RDD - PySpark 3.1.2 documentation - DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 22
Which of the following statements about stages is correct?
Correct Answer: D
Explanation Tasks in a stage may be executed by multiple machines at the same time. This is correct. Within a single stage, tasks do not depend on each other. Executors on multiple machines may execute tasks belonging to the same stage on the respective partitions they are holding at the same time. Different stages in a job may be executed in parallel. No. Different stages in a job depend on each other and cannot be executed in parallel. The nuance is that every task in a stage may be executed in parallel by multiple machines. For example, if a job consists of Stage A and Stage B, tasks belonging to those stages may not be executed in parallel. However, tasks from Stage A may be executed on multiple machines at the same time, with each machine running it on a different partition of the same dataset. Then, afterwards, tasks from Stage B may be executed on multiple machines at the same time. Stages may contain multiple actions, narrow, and wide transformations. No, stages may not contain multiple wide transformations. Wide transformations mean that shuffling is required. Shuffling typically terminates a stage though, because data needs to be exchanged across the cluster. This data exchange often causes partitions to change and rearrange, making it impossible to perform tasks in parallel on the same dataset. Stages ephemerally store transactions, before they are committed through actions. No, this does not make sense. Stages do not "store" any data. Transactions are not "committed" in Spark. Stages consist of one or more jobs. No, it is the other way around: Jobs consist of one more stages. More info: Spark: The Definitive Guide, Chapter 15.
Associate-Developer-Apache-Spark Exam Question 23
The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame transactionsDf. Find the errors. Code block: transactionsDf.select([col(productId), col(f)]) Sample of transactionsDf: 1.+-------------+---------+-----+-------+---------+----+ 2.|transactionId|predError|value|storeId|productId| f| 3.+-------------+---------+-----+-------+---------+----+ 4.| 1| 3| 4| 25| 1|null| 5.| 2| 6| 7| 2| 2|null| 6.| 3| 3| null| 25| 3|null| 7.+-------------+---------+-----+-------+---------+----+
Correct Answer: B
Explanation Correct code block: transactionsDf.drop("productId", "f") This question requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error question will make it easier for you to deal with single-error questions in the real exam. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator. Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the question can be solved by using a select statement, a drop statement, given the answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column names should be expressed as strings and not as Python variable names as in the original code block. The column names should be listed directly as arguments to the operator and not as a list. Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId. Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId - for that, you need to express it as a string. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed. No. This still leaves you with Python trying to interpret the column names as Python variables (see above). The select operator should be replaced by a drop operator. Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables (see above). More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 24
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?
Correct Answer: B
Explanation distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question. More info: pyspark.sql.DataFrame.distinct - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 25
Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?
Correct Answer: B
Explanation transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2) Correct. The repartition operator is the correct one for increasing the number of partitions. calling getNumPartitions() on DataFrame.rdd returns the current number of partitions. transactionsDf.coalesce(10) No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it. transactionsDf.repartition(transactionsDf.getNumPartitions()+2) Incorrect, there is no getNumPartitions() method for the DataFrame class. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2) Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class. transactionsDf.repartition(transactionsDf._partitions+2) No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method. More info: pyspark.sql.DataFrame.repartition - PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3