Spark Interview Questions for Experienced || Apache Spark Interview Questions and Answers for Freshers & Experienced

What is the difference between persist() and cache()

persist () allows the user to specify the storage level whereas cache () uses the default storage level.

Posted Date:- 2021-10-22 04:56:15

Is Apache Spark a good fit for Reinforcement learning?

No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

Posted Date:- 2021-10-22 04:55:36

What are the analytic algorithms provided in Apache Spark GraphX?

GraphX is Apache Spark's API for graphs and graph-parallel computation. GraphX includes a set of graph algorithms to simplify analytics tasks. The algorithms are contained in the org.apache.spark.graphx.lib package and can be accessed directly as methods on Graph via GraphOps.

PageRank: PageRank is a graph parallel computation that measures the importance of each vertex in a graph. Example: You can run PageRank to evaluate what the most important pages in Wikipedia are.

Connected Components: The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters.

Triangle Counting: A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the TriangleCount object that determines the number of triangles passing through each vertex, providing a measure of clustering.

Posted Date:- 2021-10-22 04:55:01

What are the different types of operators provided by the Apache GraphX library?

In such spark interview questions, try giving an explanation too (not just the name of the operators).

Property Operator: Property operators modify the vertex or edge properties using a user-defined map function and produce a new graph.

Structural Operator: Structure operators operate on the structure of an input graph and produce a new graph.

Join Operator: Join operators add data to graphs and generate new graphs.

Posted Date:- 2021-10-22 04:54:21

How can you compare Hadoop and Spark in terms of ease of use?

Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop.

Posted Date:- 2021-10-22 04:50:30

What are the different levels of persistence in Spark?

DISK_ONLY - Stores the RDD partitions only on the disk

MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array per partition

MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions wonâ€™t be cached

OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory

MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk

MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk

Posted Date:- 2021-10-22 04:49:53

Does Apache Spark provide checkpoints?

This is one of the most frequently asked spark interview questions where the interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an answer as possible here.

Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.

Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

Posted Date:- 2021-10-22 04:49:13

Why is BlinkDB used?

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance â€˜query accuracyâ€™ with response time. BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution. The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. BlinkDB consists of two main components:

>> Sample building engine: determines the stratified samples to be built based on workload history and data distribution.

>> Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query.

Posted Date:- 2021-10-22 04:48:36

What is Catalyst framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

Posted Date:- 2021-10-22 04:47:45

Which transformation returns a new DStream by selecting only those records of the source DStream for which the function returns true?

1. map(func)

2. transform(func)

3. filter(func)

4. count()

The correct answer is c) filter(func).

Posted Date:- 2021-10-22 04:47:06

How is Spark SQL different from HQL and SQL?

Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.

Posted Date:- 2021-10-22 04:46:11

Explain the types of operations supported by RDDs.

RDDs support 2 types of operation:

Transformations: Transformations are operations that are performed on an RDD to create a new RDD containing the results (Example: map, filter, join, union)

Actions: Actions are operations that return a value after running a computation on an RDD (Example: reduce, first, count)

Posted Date:- 2021-10-22 04:45:42

What do you understand by Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget â€“ but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

Posted Date:- 2021-10-22 04:45:10

How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

Posted Date:- 2021-10-22 04:44:42

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Posted Date:- 2021-10-22 04:44:13

What API is used for Graph Implementation in Spark?

Spark provides a powerful API called GraphX that extends Spark RDD for supporting graphs and graph-based computations. The extended property of Spark RDD is called as Resilient Distributed Property Graph which is a directed multi-graph that has multiple parallel edges. Each edge and the vertex has associated user-defined properties. The presence of parallel edges indicates multiple relationships between the same set of vertices. GraphX has a set of operators such as subgraph, mapReduceTriplets, joinVertices, etc that can support graph computation. It also includes a large collection of graph builders and algorithms for simplifying tasks related to graph analytics.

Posted Date:- 2021-10-22 04:43:12

Define Piping in Spark.

Apache Spark provides the pipe() method on RDDs which gives the opportunity to compose different parts of occupations that can utilize any language as needed as per the UNIX Standard Streams. Using the pipe() method, the RDD transformation can be written which can be used for reading each element of the RDD as String. These can be manipulated as required and the results can be displayed as String.

Posted Date:- 2021-10-22 04:42:42

How are automatic clean-ups triggered in Spark for handling the accumulated metadata?

The clean-up tasks can be triggered automatically either by setting spark.cleaner.ttl parameter or by doing the batch-wise division of the long-running jobs and then writing the intermediary results on the disk.

Posted Date:- 2021-10-22 04:42:10

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter â€˜spark.cleaner.ttlâ€™ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Posted Date:- 2021-10-22 04:41:34

Why is there a need for broadcast variables when working with Apache Spark?

Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup().

Posted Date:- 2021-10-22 04:41:05

How do you convert a Spark RDD into a DataFrame?

There are 2 ways to convert a Spark RDD into a DataFrame:

* Using the helper function - toDF
import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB(<table-name>)

.where(field(â€œfirst_nameâ€) === â€œPeterâ€)

.select(â€œ_idâ€, â€œfirst_nameâ€).toDF()

* Using SparkSession.createDataFrame
You can convert an RDD[Row] to a DataFrame by

calling createDataFrame on a SparkSession object

def createDataFrame(RDD, schema:StructType)

Posted Date:- 2021-10-22 04:40:07

What is the significance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

Posted Date:- 2021-10-22 04:39:34

What are the types of Transformation on DStream?

There are two types transformation on DStream:

1) Stateless transformation : In stateless transformation, the processing of each batch does not depend on the data of its previous batches. Each stateless transformation applies separately to each RDD.

Examples: map(), flatMap(), filter(), repartition(), reduceByKey(), groupByKey().

2) Stateful transformation : stateful transformation use data or intermediate results from previous batches to compute the result of the current batch. The stateful transformations on the other hand allow us combining data across time.

Example:updateStateByKey and mapWithState

Posted Date:- 2021-10-22 04:39:09

How does Spark achieve full tolerance as compared to Hadoop?

Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, RDD. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.This removes the need for replication to achieve fault tolerance.

Posted Date:- 2021-10-22 04:38:36

What do you understand by Caching RDDs in Spark? Name the function calls for caching an RDD

Caching is an optimization techniques for iterative and interactive Spark computations. They help saving interim partial results so they can be reused in subsequent stages. This helps to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD.

There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. persist() without an argument is equivalent with cache().

Posted Date:- 2021-10-22 04:38:09

What is the use of VectorAssembler in Spark MlLib?

The VectorAssembler is a tool that is used in nearly every single pipeline API you generate. It helps concatenate all your features into one big vector you can then pass into an estimator. Itâ€™s used typically in the last step of a machine learning pipeline and takes as input a number of columns of Boolean, Double, or Vector. This is particularly helpful if youâ€™re going to perform a number of manipulations using a variety of transformers and need to gather all of those results together.

Posted Date:- 2021-10-22 04:37:48

What are DStreams?

DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.

Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.

Posted Date:- 2021-10-22 04:37:13

What is the significance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

Posted Date:- 2021-10-22 04:36:47

HOW IS MACHINE LEARNING CARRIED OUT IN SPARK?

Machine learning is carried out in Spark with the help of MLlib. Itâ€™s a scalable machine learning library provided by Spark.

Posted Date:- 2021-10-22 04:36:20

Which languages can Spark be integrated with?

Spark can be integrated with the following languages:

Python, using the Spark Python API
R, using the R on Spark API
Java, using the Spark Java API
Scala, using the Spark Scala API

Posted Date:- 2021-10-22 04:36:00

What is Spark Executor?

When SparkContext connects to Cluster Manager, it acquires an executor on the nodes in the cluster. Executors are Spark processes that run computations and store data on worker nodes. The final tasks by SparkContext are transferred to executors for their execution.

Posted Date:- 2021-10-22 04:35:29

What are the benefits of Spark over MapReduce?

* Due to the availability of in-memory processing, Spark implements data processing 10â€“100x faster than Hadoop MapReduce. MapReduce, on the other hand, makes use of persistence storage for any of the data processing tasks.
* Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks using batch processing, steaming, Machine Learning, and interactive SQL queries. However, Hadoop only supports batch processing.
* Hadoop is highly disk-dependent, whereas Spark promotes caching and in-memory data storage.
* Spark is capable of performing computations multiple times on the same dataset, which is called iterative computation. Whereas, there is no iterative computing implemented by Hadoop.

Posted Date:- 2021-10-22 04:34:48

What does MLlib do?

MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.

Posted Date:- 2021-10-22 04:34:06

What is GraphX?

Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.

Posted Date:- 2021-10-22 04:33:47

WHAT IS IMPLIED BY THE TREATMENT OF MEMORY IN SPARK?

In memory computing, we retain data in sloppy access memory instead of specific slow disc drives.

Posted Date:- 2021-10-22 04:33:12

What is the difference between DataFrame and RDD?

A DataFrame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. Due to it tabular format, a DataFrame has additional metadata, which allows Spark to run certain optimizations on the finalized query.

An RDD is a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

However, one can go from a DataFrame to an RDD via its rdd method. Similarly, from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.

Posted Date:- 2021-10-22 04:32:52

On which port the Spark UI is available?

The Spark UI is available on port 4040 of the driver node. If you are running in local mode, this will be http://localhost:4040. The Spark UI displays information on the state of your Spark jobs, its environment, and cluster state. Itâ€™s very useful, especially for tuning and debugging.

Posted Date:- 2021-10-22 04:32:33

What are the benefits of using Spark with Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

Posted Date:- 2021-10-22 04:32:15

Explain about the major libraries that constitute the Spark Ecosystem

>> Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
>> Spark Streaming â€“ This library is used to process real time streaming data.
>> Spark GraphX â€“ Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
>> Spark SQL â€“ Helps execute SQL like queries on Spark data using standard visualization or BI tools.

Posted Date:- 2021-10-22 04:31:56

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter â€˜spark.cleaner.ttlâ€™ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Posted Date:- 2021-10-22 04:31:17

What is lineage graph?

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

Posted Date:- 2021-10-22 04:31:02

Why is there a need for broadcast variables when working with Apache Spark?

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Posted Date:- 2021-10-22 04:30:41

What is a Sparse Vector?

A sparse vector has two parallel arrays; one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

Posted Date:- 2021-10-22 04:30:10

Illustrate some demerits of using Spark.

The following are some of the demerits of using Apache Spark:

1. Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
2. Developers need to be careful while running their applications in Spark.
3. Instead of running everything on a single node, the work must be distributed over multiple clusters.
4. Sparkâ€™s â€œin-memoryâ€ capability can become a bottleneck when it comes to cost-efficient processing of big data.
5. Spark consumes a huge amount of data when compared to Hadoop.

Posted Date:- 2021-10-22 04:29:51

Why do we need broadcast variables in Spark?

Broadcast variables let the developers maintain read-only variables cached on each machine instead of shipping a copy of it with tasks. They are used to give every node copy of a large input dataset efficiently. These variables are broadcasted to the nodes using different algorithms to reduce the cost of communication.

Posted Date:- 2021-10-22 04:29:12

What do you understand by worker node?

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.

Posted Date:- 2021-10-22 04:28:33

What makes Spark good at low latency workloads like graph processing and Machine Learning?

Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.

Posted Date:- 2021-10-22 04:28:18

What is Spark Executor?

When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

Posted Date:- 2021-10-22 04:28:01

List the functions of Spark SQL.

Spark SQL is capable of:

>> Loading data from a variety of structured sources.

>> Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.

>> Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

Posted Date:- 2021-10-22 04:27:41

How can the data transfers be minimized while working with Spark?

Data transfers correspond to the process of shuffling. Minimizing these transfers results in faster and reliable running Spark applications. There are various ways in which these can be minimized. They are:

>> Usage of Broadcast Variables: Broadcast variables increases the efficiency of the join between large and small RDDs.

>> Usage of Accumulators: These help to update the variable values parallelly during execution.

>> Another common way is to avoid the operations which trigger these reshuffles.

Posted Date:- 2021-10-22 04:27:00

Spark Interview Questions for Experienced/Apache Spark Interview Questions and Answers for Freshers & Experienced

Search

R4R Team