The hive.fetch.task.conversion property of Hive lowers the latency of MapReduce overhead, and in effect when executing queries such as SELECT, FILTER, LIMIT, etc. it skips the MapReduce function.
Posted Date:- 2021-10-21 22:42:23
It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data
Posted Date:- 2021-10-21 22:41:27
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
Posted Date:- 2021-10-21 22:40:50
No, we cannot use metastore in sharing mode. It is possible to use it in standalone “real” database. Such as MySQL or PostGresSQL.
Posted Date:- 2021-10-21 22:39:27
We are using a precedence hierarchy for setting properties:
1. The SET command in Hive
2. The command-line –hiveconf option
3. Hive-site.XML
4. Hive-default.xml
5. Hadoop-site.xml
6. Hadoop-default.xml
Posted Date:- 2021-10-21 22:38:47
Basically, it creates the local metastore, while we run the hive in embedded mode. Also, it looks whether metastore already exist or not before creating the metastore. Hence, in configuration file hive-site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true” this property is defined. Hence, to change the behavior change the location to the absolute path, thus metastore will be used from that location.
Posted Date:- 2021-10-21 22:36:05
Basically, the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. But only if data is already present in HDFS. Hence, using the keyword external that creates the table definition in the hive metastore the user just has to define the table.
Create external table table_name (
id int,
myfields string
)
location ‘/my/location/in/hdfs’;
Posted Date:- 2021-10-21 22:35:12
Yes, by using the LOCATION keyword while creating the managed table, we can change the default location of Managed tables. But the one condition is, the user has to specify the storage path of the managed table as the value of the LOCATION keyword.
Posted Date:- 2021-10-21 22:34:22
For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION ‘/partitioned_transaction’;
Posted Date:- 2021-10-21 22:33:46
Basically, hive-site.xml file has to be configured with the below property, to configure metastore in Hive –
hive.metastore.uris
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
Posted Date:- 2021-10-21 22:33:03
Yes, one can run shell commands in Hive by adding a ‘!’ before the command.
Posted Date:- 2021-10-21 22:30:17
Yes, you can overwrite Hadoop MapReduce configuration in Hive.
Posted Date:- 2021-10-21 22:29:03
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
Posted Date:- 2021-10-21 22:28:25
Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. You’ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.
Posted Date:- 2021-10-21 22:27:00
No. The name of a view must be unique compared to all other tables and as views present in the same database.
Posted Date:- 2021-10-21 22:25:57
To analyze the structure of individual columns and the internal structure of the row objects we use ObjectInspector. Basically, it provides access to complex objects which can be stored in multiple formats in Hive.
Posted Date:- 2021-10-21 22:25:17
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.
Posted Date:- 2021-10-21 22:24:40
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.
Posted Date:- 2021-10-21 22:23:53
Hive index is a Hive query optimization techniques. Basically, we use it to speed up the access of a column or set of columns in a Hive database. Since, the database system does not need to read all rows in the table to find the data with the use of the index, especially that one has selected.
Posted Date:- 2021-10-21 22:23:05
By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be:
hash_function (int_type_column)= value of int_type_column
Posted Date:- 2021-10-21 22:20:03
Basically, for performing bucketing to a partition there are two main reasons:
* A map side join requires the data belonging to a unique join key to be present in the same partition.
* It allows us to decrease the query time. Also, makes the sampling process more efficient.
Posted Date:- 2021-10-21 22:19:32
ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects. It allows to access the internal fields inside the objects.
Posted Date:- 2021-10-21 22:18:01
Hive variable is created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
Posted Date:- 2021-10-21 22:17:25
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.
Posted Date:- 2021-10-21 22:16:56
Hive default read and write classes are
1. TextInputFormat/HiveIgnoreKeyTextOutputFormat
2. SequenceFileInputFormat/SequenceFileOutputFormat
Posted Date:- 2021-10-21 22:16:30
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
Posted Date:- 2021-10-21 22:15:47
Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
Posted Date:- 2021-10-21 22:15:15
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.
Posted Date:- 2021-10-21 22:14:51
Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.
Posted Date:- 2021-10-21 22:14:30
Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table.
Usage:
* While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
* Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.
Posted Date:- 2021-10-21 22:13:58
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
Posted Date:- 2021-10-21 22:13:13
Map reduce mode is used when,
It will perform on large amount of data sets and query going to execute in a parallel way
Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode
Processing large data sets with better performance needs to be achieved
Posted Date:- 2021-10-21 22:11:30
Basically, for the purpose of grouping similar type of data together on the basis of column or partition key, Hive organizes tables into partitions.
Moreover, to identify a particular partition each table can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in the table directory.
Posted Date:- 2021-10-21 22:10:41
Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer.
Hence, using ORDER BY will take a lot of time to execute a large number of inputs.
Posted Date:- 2021-10-21 22:09:44
Managed table
The metadata information along with the table data is deleted from the Hive warehouse directory if one drops a managed table.
External table
Hive just deletes the metadata information regarding the table. Further, it leaves the table data present in HDFS untouched.
Posted Date:- 2021-10-21 22:09:19
Local meta stores run on the same Java Virtual Machine (JVM) as the Hive service whereas remote meta stores run on a separate, distinct JVM.
Posted Date:- 2021-10-21 22:08:50
Using REPLACE column option
ALTER TABLE table_name REPLACE COLUMNS ……
Posted Date:- 2021-10-21 22:08:23
It offers an embedded Derby database instance backed by the local disk for the metastore, by default. It is what we call embedded metastore configuration.
Posted Date:- 2021-10-21 22:07:55
Local Metastore:
It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
Remote Metastore:
In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
Posted Date:- 2021-10-21 22:07:19
Yes, the default managed table location can be changed in Hive by using the LOCATION ‘<hdfs_path>’ clause.
Posted Date:- 2021-10-21 22:06:51
Hive table data is stored in an HDFS directory by default – user/hive/warehouse. This can be altered.
Posted Date:- 2021-10-21 22:06:33
Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.
Posted Date:- 2021-10-21 22:06:13
Alter Table table_name RENAME TO new_name
Posted Date:- 2021-10-21 22:05:51
No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
Posted Date:- 2021-10-21 22:05:13
There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.
Posted Date:- 2021-10-21 22:04:50
Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.
Posted Date:- 2021-10-21 22:04:25
Basically, to store the metadata information in the Hive we use Metastore. Though, it is possible by using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus. That converts the object representation into the relational schema and vice versa.
Posted Date:- 2021-10-21 22:04:06
In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
Posted Date:- 2021-10-21 22:03:49
All those client applications which are written in Java, PHP, Python, C++ or Ruby by exposing its thrift server, Hive supports them.
Posted Date:- 2021-10-21 22:03:36
Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL queries to perform an analysis and also an abstraction. Although, Hive it is not a database it gives you logical abstraction over the databases and the tables.
Posted Date:- 2021-10-21 22:03:15