The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error. Code block:
Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet? 1.+------+-----------------------------+-------------------+ 2.|itemId|attributes |supplier | 3.+------+-----------------------------+-------------------+ 4.|1 |[blue, winter, cozy] |Sports Company Inc.| 5.|2 |[red, summer, fresh, cooling]|YetiX | 6.|3 |[green, summer, travel] |Sports Company Inc.| 7.+------+-----------------------------+-------------------+
Correct Answer: D
Explanation Output of correct code block: +------+-----------------------------+-------------------+ |itemId|attributes |supplier | +------+-----------------------------+-------------------+ |1 |[winter, cozy, blue] |Sports Company Inc.| |2 |[summer, red, fresh, cooling]|YetiX | |3 |[travel, summer, green] |Sports Company Inc.| +------+-----------------------------+-------------------+ It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this question you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions. More info: pyspark.sql.functions.sort_array - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 48
Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below? Code block: 1.json_schema = """ 2.{"type": "struct", 3. "fields": [ 4. { 5. "name": "itemId", 6. "type": "integer", 7. "nullable": true, 8. "metadata": {} 9. }, 10. { 11. "name": "supplier", 12. "type": "string", 13. "nullable": true, 14. "metadata": {} 15. } 16. ] 17.} 18."""
Correct Answer: B
Explanation Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this question is beneficial to your exam preparation, since it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in - a topic within the scope of the exam. The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong. With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema. The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type. Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator's documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see an example use case which helps you understand the difference better. Here, you pass string '{a: 1}' to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it. In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option. Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType - exactly the type which the schema parameter of spark.read.json expects. Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options. More info: - pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation - pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation - pyspark.sql.functions.schema_of_json - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 49
The code block displayed below contains an error. The code block should return all rows of DataFrame transactionsDf, but including only columns storeId and predError. Find the error. Code block: spark.collect(transactionsDf.select("storeId", "predError"))
Correct Answer: E
Explanation Correct code block: transactionsDf.select("storeId", "predError").collect() collect() is a method of the DataFrame object. More info: pyspark.sql.DataFrame.collect - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 50
Which of the following describes Spark's standalone deployment mode?
Correct Answer: D
Explanation Standalone mode uses only a single executor per worker per application. This is correct and a limitation of Spark's standalone mode. Standalone mode is a viable solution for clusters that run multiple frameworks. Incorrect. A limitation of standalone mode is that Apache Spark must be the only framework running on the cluster. If you would want to run multiple frameworks on the same cluster in parallel, for example Apache Spark and Apache Flink, you would consider the YARN deployment mode. Standalone mode uses a single JVM to run Spark driver and executor processes. No, this is what local mode does. Standalone mode is how Spark runs on YARN and Mesos clusters. No. YARN and Mesos modes are two deployment modes that are different from standalone mode. These modes allow Spark to run alongside other frameworks on a cluster. When Spark is run in standalone mode, only the Spark framework can run on the cluster. Standalone mode means that the cluster does not contain the driver. Incorrect, the cluster does not contain the driver in client mode, but in standalone mode the driver runs on a node in the cluster. More info: Learning Spark, 2nd Edition, Chapter 1