Which of the following describes characteristics of the Dataset API?
Correct Answer: D
Explanation The Dataset API is available in Scala, but it is not available in Python. Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API. The Dataset API does not provide compile-time type safety. No - in fact, depending on the use case, the type safety that the Dataset API provides is an advantage. The Dataset API does not support unstructured data. Wrong, the Dataset API supports structured and unstructured data. In Python, the Dataset API's schema is constructed via type hints. No, this is not applicable since the Dataset API is not available in Python. In Python, the Dataset API mainly resembles Pandas' DataFrame API. The Dataset API does not exist in Python, only in Scala and Java.
Associate-Developer-Apache-Spark Exam Question 57
Which of the following describes Spark's way of managing memory?
Correct Answer: B
Explanation Spark's memory usage can be divided into three categories: Execution, transaction, and storage. No, it is either execution or storage. As a general rule for garbage collection, Spark performs better on many small objects than few big objects. No, Spark's garbage collection runs faster on fewer big objects than many small objects. Disabling serialization potentially greatly reduces the memory footprint of a Spark application. The opposite is true - serialization reduces the memory footprint, but may impact performance in a negative way. Spark uses a subset of the reserved system memory. No, the reserved system memory is separate from Spark memory. Reserved memory stores Spark's internal objects. More info: Tuning - Spark 3.1.2 Documentation, Spark Memory Management | Distributed Systems Architecture, Learning Spark, 2nd Edition, Chapter 7
Associate-Developer-Apache-Spark Exam Question 58
The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively. Find the error. Code block: 1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color") Instead of calling spark.createDataFrame, just DataFrame should be called.
Correct Answer: D
Explanation Correct code block: spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"]) The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the code block presented here which should help you answer this question correctly. More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 59
Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?
Correct Answer: D
Explanation transactionsDf.groupBy("productId").count() Correct. This code block first groups DataFrame transactionsDf by column productId and then counts the rows in each group. transactionsDf.groupBy("productId").select(count("value")) Incorrect. You cannot call select on a GroupedData object (the output of a groupBy) statement. transactionsDf.count("productId") No. DataFrame.count() does not take any arguments. transactionsDf.count("productId").distinct() Wrong. Since DataFrame.count() does not take any arguments, this option cannot be right. transactionsDf.groupBy("productId").agg(col("value").count()) False. A Column object, as returned by col("value"), does not have a count() method. You can see all available methods for Column object linked in the Spark documentation below. More info: pyspark.sql.DataFrame.count - PySpark 3.1.2 documentation, pyspark.sql.Column - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Associate-Developer-Apache-Spark Exam Question 60
Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type double?
Correct Answer: B
Explanation spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple. The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column "season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())])) No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame. Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below. spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method. More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation Static notebook | Dynamic notebook: See test 1