ORC stores collections of rows in one file and within the collection the row data will be stored in a columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage cost. While querying also, it queries the particular column instead of querying the whole row as the records are stored in columnar format. ORC has got indexing on every block based on the statistics min, max, sum, count on columns so when you query, it will skip the blocks based on the indexing.
Posted Date:- 2021-10-22 02:26:52
Yes, we can run UNIX shell commands from Hive using an ‘!‘ mark before the command. For example, !pwd at Hive prompt will display the current directory. We can execute Hive queries from the script files using the source command.
Posted Date:- 2021-10-22 02:25:27
Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. You’ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.
Posted Date:- 2021-10-22 02:24:40
* Command Line Interface (cli)
* Hive Web Interface (hwi)
* HiveServer (hiveserver)
* Printing the contents of an RC file using the tool rcfilecat.
* Jar
* Metastore
Posted Date:- 2021-10-22 02:22:38
With the help of partitioning, a subdirectory will be created with the name of the partitioned column and when you perform a query using the WHERE clause, only the particular sub-directory will be scanned instead of scanning the whole table. This gives you faster execution of queries.
Posted Date:- 2021-10-22 02:19:49
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
Posted Date:- 2021-10-22 02:18:45
To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table. Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition. If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.
Posted Date:- 2021-10-22 02:17:45
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
Posted Date:- 2021-10-22 02:16:19
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
Posted Date:- 2021-10-22 02:15:36
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.
Posted Date:- 2021-10-22 02:14:24
When we issue the command DROP TABLE IF EXISTS table_name, Hive throws an error if the table being dropped does not exist in the first place.
Posted Date:- 2021-10-22 02:13:09
Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
Posted Date:- 2021-10-22 02:12:11
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields. RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input pattern and output fields. For example, you have to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.
Posted Date:- 2021-10-22 02:11:17
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
Posted Date:- 2021-10-22 02:10:42
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
Posted Date:- 2021-10-22 02:10:08
A local metastore is created when we run Hive in an embedded mode. Before creating, it checks whether the metastore exists or not, and this metastore property is defined in the configuration file, hive-site.xml. The property is: javax.jdo.option.ConnectionURL with the default value: jdbc:derby:;databaseName=metastore_db;create=true. Therefore, we have to change the behavior of the location to an absolute path so that from that location the metastore can be used.
Posted Date:- 2021-10-22 02:09:34
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
Posted Date:- 2021-10-22 02:08:58
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data.
Posted Date:- 2021-10-22 02:08:25
When running Hive as a server, the application can be connected in one of the 3 ways-
ODBC Driver-This supports the ODBC protocol
JDBC Driver- This supports the JDBC protocol
Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.
Posted Date:- 2021-10-22 02:06:42
This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict; The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
Posted Date:- 2021-10-22 02:05:56
When you perform a "select * from", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it. This is similar to "hadoop dfs -text". However, while using "selectfrom", Hive requires a map-reduce job since it needs to extract the 'column' from each row by parsing it from the file it loads.
Posted Date:- 2021-10-22 02:05:27
If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets.
Buckets are basically files that are created inside the HDFS directory.
There are different properties which you need to set for bucket map joins and they are as follows:
set hive.enforce.sortmergebucketmapjoin = false; set hive.auto.convert.sortmerge.join = false; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;
Posted Date:- 2021-10-22 02:03:04
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects . Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union.
Posted Date:- 2021-10-22 02:01:27
ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
>> An instance of Java class.
>> A standard Java object.
>> A lazily initialized object
ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.
Posted Date:- 2021-10-22 02:01:02
Following classes are used by Hive to read and write HDFS files:
>> TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
>> SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
Posted Date:- 2021-10-22 02:00:12
Local Metastore: It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
Remote Metastore: In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
Posted Date:- 2021-10-22 01:59:29
>> SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
>> ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
>> DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
>> CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.
Posted Date:- 2021-10-22 01:58:58
Hive query processor converts graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies.
Posted Date:- 2021-10-22 01:58:03
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
Posted Date:- 2021-10-22 01:57:39
In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
Posted Date:- 2021-10-22 01:56:59
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
Posted Date:- 2021-10-22 01:56:32
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
Create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Posted Date:- 2021-10-22 01:55:56
In a managed table, both the metadata information and the table data is deleted from the Hive warehouse directory if you leave/exit a managed table. However, in an external table, only the metadata information associated with the table is deleted while the table data is retained in the HDFS.
Posted Date:- 2021-10-22 01:55:32
The key differentiating points between Hive and HBase are:
* Hive is a data warehouse framework whereas HBase is a NoSQL database.
* While Hive can run most SQL queries, HBase does not allow SQL queries.
* Hive doesn’t support record-level insert, update, and delete operations on a table, but HBase supports these functions.
* Hive runs on top of MapReduce, but HBase runs on top of HDFS.
Posted Date:- 2021-10-22 01:55:08
In Hive, the Object-Inspector helps to analyze the internal structure of a row object and individual structure of columns. Furthermore, it also offers ways to access complex objects that can be stored in different formats in memory.
Posted Date:- 2021-10-22 01:54:22
Using the ORC (Optimized Row Columnar) file format, you can store the Hive data efficiently as it helps to simplify numerous limitations of the Hive file format.
Posted Date:- 2021-10-22 01:54:00
The HiveServer2 is a server interface and part of Hive Services that enables remote clients to execute queries against Hive and retrieve the results. The current implementation(HS2), based on Thrift RPC which has improved version of Hive Server 1 and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC Drivers.
Posted Date:- 2021-10-22 01:53:35
The main purpose of Hive Thrift server is it allows access to Hive over a single port.
Thrift server is also known as Thrift Server.However, for scalable cross-language services development Thrift is a software framework. Also, it allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
Posted Date:- 2021-10-22 01:53:17
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
Posted Date:- 2021-10-22 01:52:57
When performing queries on large datasets in Hive, bucketing can offer better structure to Hive tables. You’ll also want to take your answer a step further by explaining some of the specific bucketing features, as well as some of the advantages of bucketing in Hive. For example, bucketing can give programmers more flexibility when it comes to record-keeping and can make it easier to debug large datasets when needed.
Posted Date:- 2021-10-22 01:49:27
Sometimes interviewers like to ask these basic questions to see how confident you are when it comes to your Hive knowledge. Answer by saying that Hive can sometimes operate in two modes, which are MapReduce mode and local mode. Explain that this depends on the size of the DataNodes in Hadoop.
Posted Date:- 2021-10-22 01:48:43
In Hive the analysis of the inner structure of the segments, columns, and complex items are finished utilizing Object Inspector functionality. Question Inspector functionality makes availability to the inner fields, which are present inside the objects.
Posted Date:- 2021-10-22 01:48:04
Hcatalog is required for sharing data structures with external systems. It provides access to the Hive metastore, so you can read/write data to Hive data warehouse.
Posted Date:- 2021-10-22 01:45:08
In Hive, you can enable buckets by using the following command: set.hive.enforce.bucketing=true;
Posted Date:- 2021-10-22 01:44:41
Hive index is a Hive query optimization method. It is used to speed up the access of a specific column or set of columns in a Hive database. By utilizing a Hive index, the database system does not require to read all rows in a table to find the chosen data.
Posted Date:- 2021-10-22 01:43:18
In Hive, tables are classified and organized into partitions to organize similar type of data together, either according to a column or partition key. So, a partition is actually a sub-directory in the table directory. A table may have more than one partition keys for a particular partition.
Through partitioning, you can achieve granularity in a Hive table. This helps to reduce the query latency as it only scans relevant partitioned data instead of the whole dataset.
Posted Date:- 2021-10-22 01:41:37
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive data warehouse.
Posted Date:- 2021-10-22 01:40:24
Yes, with the help of LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive. However, to do so, the user needs to specify the storage path of the managed table as the value to the LOCATION keyword, that will help to change the default location of a managed table.
Posted Date:- 2021-10-22 01:40:02
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA INPATH," ) and use Hive to correctly "parse" that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to "parse" data stored in HDFS to be used by Hive.
Posted Date:- 2021-10-22 01:39:36
There are three ways of Interacting with Hive:
I Hive Thrift Client:
Basically, with any programming language that supports thrift, we can interact with HIVE.
IIIJDBC Driver:
However, to connect to the HIVE Server the BeeLine CLI uses JDBC Driver.
III. ODBC Driver:
Also, we can use an ODBC Driver application. Since that support ODBC to connect to the HIVE server.
Posted Date:- 2021-10-22 01:39:19