The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors. Sample of DataFrame transactionsDf: 1.+-------------+---------+-----+-------+---------+----+----------------+ 2.|transactionId|predError|value|storeId|productId| f| transactionDate| 3.+-------------+---------+-----+-------+---------+----+----------------+ 4.| 1| 3| 4| 25| 1|null|2020-04-26 15:35| 5.| 2| 6| 7| 2| 2|null|2020-04-13 22:01| 6.| 3| 3| null| 25| 3|null|2020-04-02 10:53| 7.+-------------+---------+-----+-------+---------+----+----------------+ Code block: 1.transactionsDf = transactionsDf.drop("transactionDate") 2.transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM-dd")
Correct Answer: E
Explanation This question requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error question will make it easier for you to deal with single-error questions in the real exam. You can clearly see that column transactionDate should be dropped only after transactionTimestamp has been written. This is because to generate column transactionTimestamp, Spark needs to read the values from column transactionDate. Values in column transactionDate in the original transactionsDf DataFrame look like 2020-04-26 15:35. So, to convert those correctly, you would have to pass yyyy-MM-dd HH:mm. In other words: The string indicating the date format should be adjusted. While you might be tempted to change unix_timestamp() to to_unixtime() (in line with the from_unixtime() operator), this function does not exist in Spark. unix_timestamp() is the correct operator to use here. Also, there is no DataFrame.withColumnReplaced() operator. A similar operator that exists is DataFrame.withColumnRenamed(). Whether you use col() or not is irrelevant with unix_timestamp() - the command is fine with both. Finally, you cannot assign a column like transactionsDf["columnName"] = ... in Spark. This is Pandas syntax (Pandas is a popular Python package for data analysis), but it is not supported in Spark. So, you need to use Spark's DataFrame.withColumn() syntax instead. More info: pyspark.sql.functions.unix_timestamp - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 57
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient executor memory is available, in a fault-tolerant way. Find the error. Code block: transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
Correct Answer: C
Explanation The storage level is inappropriate for fault-tolerant storage. Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2. The code block uses the wrong command for caching. Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level. DataFrame.cache() does not support passing a storage level. Caching is not supported in Spark, data are always recomputed. Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed repeatedly. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API. No. Caching is either accessed through DataFrame.cache() or DataFrame.persist(). The DataFrameWriter needs to be invoked. Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as "cache" and "executor memory" that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The DataFrameWriter does not write to memory, so we cannot use it here. More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
Associate-Developer-Apache-Spark Exam Question 58
Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below? Sample of itemsDf: 1.+------+-----------------------------+-------------------+ 2.|itemId|attributes |supplier | 3.+------+-----------------------------+-------------------+ 4.|1 |[blue, winter, cozy] |Sports Company Inc.| 5.|2 |[red, summer, fresh, cooling]|YetiX | 6.|3 |[green, summer, travel] |Sports Company Inc.| 7.+------+-----------------------------+-------------------+
Correct Answer: D
Explanation The challenge in this question comes from there being an array variable in the schema. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by spark.read. The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array<string>, supplier string. A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so: StringType(). Just using StringType does not work in pySpark and will fail. Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the question just asks for a "valid" schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample. More info: Learning Spark, 2nd Edition, Chapter 3 Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 59
The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error. Code block: transactionsDf.filter(col('predError').in([3, 6])).count()
Correct Answer: C
Explanation Correct code block: transactionsDf.filter(col('predError').isin([3, 6])).count() The isin method is the correct one to use here - the in method does not exist for the Column object. More info: pyspark.sql.Column.isin - PySpark 3.1.2 documentation
Associate-Developer-Apache-Spark Exam Question 60
Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?
Correct Answer: B
Explanation itemsDf.withColumn("itemId", col("itemId").cast("string")) Correct. You can convert the data type of a column using the cast method of the Column class. Also note that you will have to use the withColumn method on itemsDf for replacing the existing itemId column with the new version that contains strings. itemsDf.withColumn("itemId", col("itemId").convert("string")) Incorrect. The Column object that col("itemId") returns does not have a convert method. itemsDf.withColumn("itemId", convert("itemId", "string")) Wrong. Spark's spark.sql.functions module does not have a convert method. The question is trying to mislead you by using the word "converted". Type conversion is also called "type casting". This may help you remember to look for a cast method instead of a convert method (see correct answer). itemsDf.select(astype("itemId", "string")) False. While astype is a method of Column (and an alias of Column.cast), it is not a method of pyspark.sql.functions (what the code block implies). In addition, the question asks to return a full DataFrame that matches the multi-column DataFrame itemsDf. Selecting just one column from itemsDf as in the code block would just return a single-column DataFrame. spark.cast(itemsDf, "itemId", "string") No, the Spark session (called by spark) does not have a cast method. You can find a list of all methods available for the Spark session linked in the documentation below. More info: - pyspark.sql.Column.cast - PySpark 3.1.2 documentation - pyspark.sql.Column.astype - PySpark 3.1.2 documentation - pyspark.sql.SparkSession - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3