Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column predError in DataFrame transactionsDf?
Correct Answer: C
Explanation While only one of these code blocks works, the DataFrame API is pretty flexible when it comes to accepting columns into the pow() method. The following code blocks would also work: transactionsDf.withColumn("predErrorSquared", pow("predError", 2)) transactionsDf.withColumn("predErrorSquared", pow("predError", lit(2))) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/26.html , https://bit.ly/sparkpracticeexams_import_instructions)
Associate-Developer-Apache-Spark Exam Question 32
Which of the following statements about DAGs is correct?
Correct Answer: E
Explanation DAG stands for "Directing Acyclic Graph". No, DAG stands for "Directed Acyclic Graph". Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts. No, quite the opposite. You can access DAGs through the Spark UI and they can be of great help when optimizing queries manually. In contrast to transformations, DAGs are never lazily executed. DAGs represent the execution plan in Spark and as such are lazily executed when the driver requests the data processed in the DAG.
Associate-Developer-Apache-Spark Exam Question 33
The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the answer that correctly fills the blanks in the code block to accomplish this. 1.from pyspark import StorageLevel 2.transactionsDf.__1__(StorageLevel.__2__).__3__
Correct Answer: E
Explanation Correct code block: from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY_2).count() Only persist takes different storage levels, so any option using cache() cannot be correct. persist() is evaluated lazily, so an action needs to follow this command. select() is not an action, but count() is - so all options using select() are incorrect. Finally, the question states that "the executors' memory should be utilized as much as possible, but not writing anything to disk". This points to a MEMORY_ONLY storage level. In this storage level, partitions that do not fit into memory will be recomputed when they are needed, instead of being written to disk, as with the storage option MEMORY_AND_DISK. Since the data need to be duplicated across two executors, _2 needs to be appended to the storage level. Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 34
The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before 2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error. Schema: 1.root 2. |-- itemId: integer (nullable = true) 3. |-- attributes: array (nullable = true) 4. | |-- element: string (containsNull = true) 5. |-- supplier: string (nullable = true) Code block: 1.schema = StructType([ 2. StructType("itemId", IntegerType(), True), 3. StructType("attributes", ArrayType(StringType(), True), True), 4. StructType("supplier", StringType(), True) 5.]) 6. 7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)
Correct Answer: D
Explanation Correct code block: schema = StructType([ StructField("itemId", IntegerType(), True), StructField("attributes", ArrayType(StringType(), True), True), StructField("supplier", StringType(), True) ]) spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath) This question is more difficult than what you would encounter in the exam. In the exam, for this question type, only one error needs to be identified and not "one or multiple" as in the question. Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format. Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the question is wrong. The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original code block (see documentation linked below). Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for example, DataFrameReader.parquet(). Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly. No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be nullable and this is specified correctly by the third argument being True in the schema in the code block. It is correct, however, that the modification date threshold is specified incorrectly (see above). The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect. Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer above. In addition, the DataFrameReader is called correctly through the SparkSession spark. Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect. Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct. The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly. False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified incorrectly (see correct answer above).
Associate-Developer-Apache-Spark Exam Question 35
Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?
Correct Answer: E
Explanation The DataFrame.summary() command is very practical for quickly calculating statistics of a DataFrame. You need to call .show() to display the results of the calculation. By default, the command calculates various statistics (see documentation linked below), including standard deviation and minimum. Note that the answer that lists many options in the summary() parentheses does not include the minimum, which is asked for in the question. Answer options that include agg() do not work here as shown, since DataFrame.agg() expects more complex, column-specific instructions on how to aggregate values. More info: - pyspark.sql.DataFrame.summary - PySpark 3.1.2 documentation - pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Newest Associate-Developer-Apache-Spark Exam PDF Dumps shared by Actual4test.com for Helping Passing Associate-Developer-Apache-Spark Exam! Actual4test.com now offer the updated Associate-Developer-Apache-Spark exam dumps, the Actual4test.com Associate-Developer-Apache-Spark exam questions have been updated and answers have been corrected get the latest Actual4test.com Associate-Developer-Apache-Spark pdf dumps with Exam Engine here: