The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join. Find the error. Code block:
Correct Answer: B
Explanation This is question is hard. Let's assess the different answers one-by-one. Spark will only broadcast DataFrames that are much smaller than the default value. This is correct. The default value is 10 MB (10485760 bytes). Since the configuration for spark.sql.autoBroadcastJoinThreshold expects a number in bytes (and not megabytes), the code block sets the limits to merely 20 bytes, instead of the requested 20 * 1024 * 1024 (= 20971520) bytes. The command is evaluated lazily and needs to be followed by an action. No, this command is evaluated right away! Spark will only apply the limit to threshold joins and not to other joins. There are no "threshold joins", so this option does not make any sense. The correct option to write configurations is through spark.config and not spark.conf. No, it is indeed spark.conf! The passed limit has the wrong variable type. The configuration expects the number of bytes, a number, as an input. So, the 20 provided in the code block is fine.
Associate-Developer-Apache-Spark Exam Question 2
Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?
Correct Answer: B
Explanation More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Associate-Developer-Apache-Spark Exam Question 3
Which of the following statements about Spark's execution hierarchy is correct?
Correct Answer: A
Explanation In Spark's execution hierarchy, a job may reach over multiple stage boundaries. Correct. A job is a sequence of stages, and thus may reach over multiple stage boundaries. In Spark's execution hierarchy, tasks are one layer above slots. Incorrect. Slots are not a part of the execution hierarchy. Tasks are the lowest layer. In Spark's execution hierarchy, a stage comprises multiple jobs. No. It is the other way around - a job consists of one or multiple stages. In Spark's execution hierarchy, executors are the smallest unit. False. Executors are not a part of the execution hierarchy. Tasks are the smallest unit! In Spark's execution hierarchy, manifests are one layer above jobs. Wrong. Manifests are not a part of the Spark ecosystem.
Associate-Developer-Apache-Spark Exam Question 4
Which of the following describes characteristics of the Spark driver?
Correct Answer: D
Explanation The Spark driver requests the transformation of operations into DAG computations from the worker nodes. No, the Spark driver transforms operations into DAG computations itself. If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance. No. There is always a single driver per application, but one or more executors. The Spark driver processes partitions in an optimized, distributed fashion. No, this is what executors do. In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object. Wrong. In a non-interactive Spark application, you need to create the SparkSession object. In an interactive Spark shell, the Spark driver instantiates the object for you.
Associate-Developer-Apache-Spark Exam Question 5
The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. DataFrame itemsDf: 1.+------+----------------------------------+-----------------------------+-------------------+ 2.|itemId|itemName |attributes |supplier | 3.+------+----------------------------------+-----------------------------+-------------------+ 4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.| 5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX | 6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.| 7.+------+----------------------------------+-----------------------------+-------------------+ Code block: itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))
Correct Answer: D
Explanation Correct code block: itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) Returned DataFrame: +------------+ |consonant_ct| +------------+ | 19| | 16| | 10| +------------+ This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u| \s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam. The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation. The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct. More info: - lower: pyspark.sql.functions.lower - PySpark 3.1.2 documentation - regexp_replace: pyspark.sql.functions.regexp_replace - PySpark 3.1.2 documentation - length: pyspark.sql.functions.length - PySpark 3.1.2 documentation - alias: pyspark.sql.Column.alias - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2