persist () allows the user to specify the storage level whereas cache () uses the default storage level.
Posted Date:- 2021-10-21 21:56:15
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
Posted Date:- 2021-10-21 21:55:36
GraphX is Apache Spark's API for graphs and graph-parallel computation. GraphX includes a set of graph algorithms to simplify analytics tasks. The algorithms are contained in the org.apache.spark.graphx.lib package and can be accessed directly as methods on Graph via GraphOps.
PageRank: PageRank is a graph parallel computation that measures the importance of each vertex in a graph. Example: You can run PageRank to evaluate what the most important pages in Wikipedia are.
Connected Components: The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters.
Triangle Counting: A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the TriangleCount object that determines the number of triangles passing through each vertex, providing a measure of clustering.
Posted Date:- 2021-10-21 21:55:01
In such spark interview questions, try giving an explanation too (not just the name of the operators).
Property Operator: Property operators modify the vertex or edge properties using a user-defined map function and produce a new graph.
Structural Operator: Structure operators operate on the structure of an input graph and produce a new graph.
Join Operator: Join operators add data to graphs and generate new graphs.
Posted Date:- 2021-10-21 21:54:21
Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop.
Posted Date:- 2021-10-21 21:50:30
DISK_ONLY - Stores the RDD partitions only on the disk
MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array per partition
MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions won’t be cached
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk
MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk
Posted Date:- 2021-10-21 21:49:53
This is one of the most frequently asked spark interview questions where the interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an answer as possible here.
Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.
Posted Date:- 2021-10-21 21:49:13
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time. BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution. The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. BlinkDB consists of two main components:
>> Sample building engine: determines the stratified samples to be built based on workload history and data distribution.
>> Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query.
Posted Date:- 2021-10-21 21:48:36
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Posted Date:- 2021-10-21 21:47:45
1. map(func)
2. transform(func)
3. filter(func)
4. count()
The correct answer is c) filter(func).
Posted Date:- 2021-10-21 21:47:06
Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.
Posted Date:- 2021-10-21 21:46:11
RDDs support 2 types of operation:
Transformations: Transformations are operations that are performed on an RDD to create a new RDD containing the results (Example: map, filter, join, union)
Actions: Actions are operations that return a value after running a computation on an RDD (Example: reduce, first, count)
Posted Date:- 2021-10-21 21:45:42
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
Posted Date:- 2021-10-21 21:45:10
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.
Posted Date:- 2021-10-21 21:44:42
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Posted Date:- 2021-10-21 21:44:13
Spark provides a powerful API called GraphX that extends Spark RDD for supporting graphs and graph-based computations. The extended property of Spark RDD is called as Resilient Distributed Property Graph which is a directed multi-graph that has multiple parallel edges. Each edge and the vertex has associated user-defined properties. The presence of parallel edges indicates multiple relationships between the same set of vertices. GraphX has a set of operators such as subgraph, mapReduceTriplets, joinVertices, etc that can support graph computation. It also includes a large collection of graph builders and algorithms for simplifying tasks related to graph analytics.
Posted Date:- 2021-10-21 21:43:12
Apache Spark provides the pipe() method on RDDs which gives the opportunity to compose different parts of occupations that can utilize any language as needed as per the UNIX Standard Streams. Using the pipe() method, the RDD transformation can be written which can be used for reading each element of the RDD as String. These can be manipulated as required and the results can be displayed as String.
Posted Date:- 2021-10-21 21:42:42
The clean-up tasks can be triggered automatically either by setting spark.cleaner.ttl parameter or by doing the batch-wise division of the long-running jobs and then writing the intermediary results on the disk.
Posted Date:- 2021-10-21 21:42:10
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
Posted Date:- 2021-10-21 21:41:34
Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup().
Posted Date:- 2021-10-21 21:41:05
There are 2 ways to convert a Spark RDD into a DataFrame:
* Using the helper function - toDF
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.where(field(“first_name”) === “Peter”)
.select(“_id”, “first_name”).toDF()
* Using SparkSession.createDataFrame
You can convert an RDD[Row] to a DataFrame by
calling createDataFrame on a SparkSession object
def createDataFrame(RDD, schema:StructType)
Posted Date:- 2021-10-21 21:40:07
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Posted Date:- 2021-10-21 21:39:34
There are two types transformation on DStream:
1) Stateless transformation : In stateless transformation, the processing of each batch does not depend on the data of its previous batches. Each stateless transformation applies separately to each RDD.
Examples: map(), flatMap(), filter(), repartition(), reduceByKey(), groupByKey().
2) Stateful transformation : stateful transformation use data or intermediate results from previous batches to compute the result of the current batch. The stateful transformations on the other hand allow us combining data across time.
Example:updateStateByKey and mapWithState
Posted Date:- 2021-10-21 21:39:09
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, RDD. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.This removes the need for replication to achieve fault tolerance.
Posted Date:- 2021-10-21 21:38:36
Caching is an optimization techniques for iterative and interactive Spark computations. They help saving interim partial results so they can be reused in subsequent stages. This helps to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD.
There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. persist() without an argument is equivalent with cache().
Posted Date:- 2021-10-21 21:38:09
The VectorAssembler is a tool that is used in nearly every single pipeline API you generate. It helps concatenate all your features into one big vector you can then pass into an estimator. It’s used typically in the last step of a machine learning pipeline and takes as input a number of columns of Boolean, Double, or Vector. This is particularly helpful if you’re going to perform a number of manipulations using a variety of transformers and need to gather all of those results together.
Posted Date:- 2021-10-21 21:37:48
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.
Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.
Posted Date:- 2021-10-21 21:37:13
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Posted Date:- 2021-10-21 21:36:47
Machine learning is carried out in Spark with the help of MLlib. It’s a scalable machine learning library provided by Spark.
Posted Date:- 2021-10-21 21:36:20
Spark can be integrated with the following languages:
Python, using the Spark Python API
R, using the R on Spark API
Java, using the Spark Java API
Scala, using the Spark Scala API
Posted Date:- 2021-10-21 21:36:00
When SparkContext connects to Cluster Manager, it acquires an executor on the nodes in the cluster. Executors are Spark processes that run computations and store data on worker nodes. The final tasks by SparkContext are transferred to executors for their execution.
Posted Date:- 2021-10-21 21:35:29
* Due to the availability of in-memory processing, Spark implements data processing 10–100x faster than Hadoop MapReduce. MapReduce, on the other hand, makes use of persistence storage for any of the data processing tasks.
* Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks using batch processing, steaming, Machine Learning, and interactive SQL queries. However, Hadoop only supports batch processing.
* Hadoop is highly disk-dependent, whereas Spark promotes caching and in-memory data storage.
* Spark is capable of performing computations multiple times on the same dataset, which is called iterative computation. Whereas, there is no iterative computing implemented by Hadoop.
Posted Date:- 2021-10-21 21:34:48
MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.
Posted Date:- 2021-10-21 21:34:06
Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.
Posted Date:- 2021-10-21 21:33:47
In memory computing, we retain data in sloppy access memory instead of specific slow disc drives.
Posted Date:- 2021-10-21 21:33:12
A DataFrame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. Due to it tabular format, a DataFrame has additional metadata, which allows Spark to run certain optimizations on the finalized query.
An RDD is a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, one can go from a DataFrame to an RDD via its rdd method. Similarly, from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.
Posted Date:- 2021-10-21 21:32:52
The Spark UI is available on port 4040 of the driver node. If you are running in local mode, this will be http://localhost:4040. The Spark UI displays information on the state of your Spark jobs, its environment, and cluster state. It’s very useful, especially for tuning and debugging.
Posted Date:- 2021-10-21 21:32:33
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.
Posted Date:- 2021-10-21 21:32:15
>> Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
>> Spark Streaming – This library is used to process real time streaming data.
>> Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
>> Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
Posted Date:- 2021-10-21 21:31:56
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
Posted Date:- 2021-10-21 21:31:17
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
Posted Date:- 2021-10-21 21:31:02
These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
Posted Date:- 2021-10-21 21:30:41
A sparse vector has two parallel arrays; one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
Posted Date:- 2021-10-21 21:30:10
The following are some of the demerits of using Apache Spark:
1. Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
2. Developers need to be careful while running their applications in Spark.
3. Instead of running everything on a single node, the work must be distributed over multiple clusters.
4. Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
5. Spark consumes a huge amount of data when compared to Hadoop.
Posted Date:- 2021-10-21 21:29:51
Broadcast variables let the developers maintain read-only variables cached on each machine instead of shipping a copy of it with tasks. They are used to give every node copy of a large input dataset efficiently. These variables are broadcasted to the nodes using different algorithms to reduce the cost of communication.
Posted Date:- 2021-10-21 21:29:12
Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.
Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.
Posted Date:- 2021-10-21 21:28:33
Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.
Posted Date:- 2021-10-21 21:28:18
When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
Posted Date:- 2021-10-21 21:28:01
Spark SQL is capable of:
>> Loading data from a variety of structured sources.
>> Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.
>> Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
Posted Date:- 2021-10-21 21:27:41
Data transfers correspond to the process of shuffling. Minimizing these transfers results in faster and reliable running Spark applications. There are various ways in which these can be minimized. They are:
>> Usage of Broadcast Variables: Broadcast variables increases the efficiency of the join between large and small RDDs.
>> Usage of Accumulators: These help to update the variable values parallelly during execution.
>> Another common way is to avoid the operations which trigger these reshuffles.
Posted Date:- 2021-10-21 21:27:00