Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i? Sample of DataFrame itemsDf: 1.+------+----------------------------------+-----------------------------+-------------------+ 2.|itemId|itemName |attributes |supplier | 3.+------+----------------------------------+-----------------------------+-------------------+ 4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.| 5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX | 6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.| 7.+------+----------------------------------+-----------------------------+-------------------+
Correct Answer: D
Explanation Result of correct code block: +-------------------+ |attributes_exploded| +-------------------+ | winter| | cooling| +-------------------+ To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below). Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options. More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 42
Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?
Correct Answer: D
Explanation transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError"))) Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just using the column name as a string, "predError". The question asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of DataFrame transactionsDf expressed through col("predError"). transactionsDf.withColumn("predErrorSqrt", sqrt(predError)) Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way - to Spark it looks as if you are trying to refer to the non-existent Python variable predError. You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead. transactionsDf.select(sqrt(predError)) Wrong. Here, the explanation just above this one about how to refer to predError applies. transactionsDf.select(sqrt("predError")) No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the question asks for a column to be added to the original DataFrame transactionsDf. transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt()) No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column. More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 43
The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items in column value. Find the error. Code block: transactionsDf.orderBy('value', asc_nulls_first(col('predError')))
Correct Answer: C
Explanation Correct code block: transactionsDf.orderBy('value', desc_nulls_last('predError')) Column predError should be sorted in a descending way, putting nulls last. Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last. Instead of orderBy, sort should be used. No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here. Column value should be wrapped by the col() operator. Incorrect. DataFrame.sort() accepts both string and Column objects. Column predError should be sorted by desc_nulls_first() instead. Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement. No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question. More info: pyspark.sql.DataFrame.orderBy - PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last - PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data Science Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 44
Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame todayTransactionsDf, ignoring that both DataFrames have different column names?
Correct Answer: E
Explanation todayTransactionsDf.union(yesterdayTransactionsDf) Correct. The union command appends rows of yesterdayTransactionsDf to the rows of todayTransactionsDf, ignoring that both DataFrames have different column names. The resulting DataFrame will have the column names of DataFrame todayTransactionsDf. todayTransactionsDf.unionByName(yesterdayTransactionsDf) No. unionByName specifically tries to match columns in the two DataFrames by name and only appends values in columns with identical names across the two DataFrames. In the form presented above, the command is a great fit for joining DataFrames that have exactly the same columns, but in a different order. In this case though, the command will fail because the two DataFrames have different columns. todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True) No. The unionByName command is described in the previous explanation. However, with the allowMissingColumns argument set to True, it is no longer an issue that the two DataFrames have different column names. Any columns that do not have a match in the other DataFrame will be filled with null where there is no value. In the case at hand, the resulting DataFrame will have 7 or more columns though, so it this command is not the right answer. union(todayTransactionsDf, yesterdayTransactionsDf) No, there is no union method in pyspark.sql.functions. todayTransactionsDf.concat(yesterdayTransactionsDf) Wrong, the DataFrame class does not have a concat method. More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation, pyspark.sql.DataFrame.unionByName - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 45
The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. Code block: transactionsDf.__1__(itemsDf, __2__).__3__(__4__)
Correct Answer: C
Explanation This question is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the question carefully, you can use your logic skills to weed out the wrong answers here. First, you should examine the join statement which is common to all answers. The first argument of the join() operator (documentation linked below) is the DataFrame to be joined with. Where join is in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers. For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates. Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the documentation, we can discard that answer, leaving us with two remaining candidates. Both candidates have valid syntax, but only one of them fulfills the condition in the question "only where column storeId of DataFrame transactionsDf does not match column itemId of DataFrame itemsDf". So, this one remaining answer option has to be the correct one! As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the exam. More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3