Which of the following is the deepest level in Spark's execution hierarchy?
Correct Answer: B
Explanation The hierarchy is, from top to bottom: Job, Stage, Task. Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
Associate-Developer-Apache-Spark Exam Question 27
Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate format for this kind of data?
Correct Answer: D
Explanation The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect. In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here. NumberType() is not a valid data type and StringType() would fail, since the parquet file is stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided. Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid. Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed. More info: pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation and StructType - PySpark 3.1.2 documentation
Associate-Developer-Apache-Spark Exam Question 28
The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items in column value. Find the error. Code block: transactionsDf.orderBy('value', asc_nulls_first(col('predError')))
Correct Answer: C
Explanation Correct code block: transactionsDf.orderBy('value', desc_nulls_last('predError')) Column predError should be sorted in a descending way, putting nulls last. Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last. Instead of orderBy, sort should be used. No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here. Column value should be wrapped by the col() operator. Incorrect. DataFrame.sort() accepts both string and Column objects. Column predError should be sorted by desc_nulls_first() instead. Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement. No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question. More info: pyspark.sql.DataFrame.orderBy - PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last - PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data Science Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 29
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
Correct Answer: A
Explanation transactionsDf.select("storeId").dropDuplicates().count() Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column. transactionsDf.select(count("storeId")).dropDuplicates() No. transactionsDf.select(count("storeId")) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context. transactionsDf.dropDuplicates().agg(count("storeId")) Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates instead. transactionsDf.distinct().select("storeId").count() Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count not represent the number of unique values in that column. transactionsDf.select(distinct("storeId")).count() False. There is no distinct method in pyspark.sql.functions.
Associate-Developer-Apache-Spark Exam Question 30
Which of the following statements about executors is correct?
Correct Answer: B
Explanation Executors stop upon application completion by default. Correct. Executors only persist during the lifetime of an application. A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle, independent of whether the application has been completed or not. An executor can serve multiple applications. Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above). Each node hosts a single executor. No. Each node can host one or more executors. Executors store data in memory only. No. Executors can store data in memory or on disk. Executors are launched by the driver. Incorrect. Executors are launched by the cluster manager on behalf of the driver. More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to clear some of the... | by Mageswaran D | Medium