How to delete and update a record in Hive
Yes, rightly said. Hive does not support UPDATE option. But the following alternative could be used to achieve the result:
Update records in a
partitioned Hive table:
- The main table is assumed to be partitioned by some key.
- Load the incremental data (the data to be updated) to a staging table partitioned with the same keys as the main table.
Join the two tables (main & staging tables) using a
LEFT OUTER JOINoperation as below:
insert overwrite table main_table partition (c,d) select t2.a, t2.b, t2.c,t2.d from staging_table t2 left outer join main_table t1 on t1.a=t2.a;
In the above example, the
main_table & the
staging_table are partitioned using the
(c,d) keys. The tables are joined via a
LEFT OUTER JOIN and the result is used to
OVERWRITE the partitions in the
A similar approach could be used in the case of
un-partitioned Hive table
UPDATE operations too.
As of Hive version 0.14.0: INSERT...VALUES, UPDATE, and DELETE are now available with full ACID support.
INSERT ... VALUES Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is: ( value [, value ...] ) where a value is either null or any valid SQL literal
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
DELETE FROM tablename [WHERE expression]
Additionally, from the Hive Transactions doc:
If a table is to be used in ACID writes (insert, update, delete) then the table property "transactional" must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.
Hive DML reference:
Hive Transactions reference:
You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
The following applies to versions prior to Hive 0.14, see the answer by ashtonium for later versions.
There is no operation supported for deletion or update of a particular record or particular set of records, and to me this is more a sign of a poor schema.
Here is what you can find in the official documentation:
Hadoop is a batch processing system and Hadoop jobs tend to have high latency and
incur substantial overheads in job submission and scheduling. As a result -
latency for Hive queries is generally very high (minutes) even when data sets
involved are very small (say a few hundred megabytes). As a result it cannot be
compared with systems such as Oracle where analyses are conducted on a
significantly smaller amount of data but the analyses proceed much more
iteratively with the response times between iterations being less than a few
minutes. Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not offer
real-time queries and row level updates. It is best used for batch jobs over
large sets of immutable data (like web logs).
A way to work around this limitation is to use partitions: I don't know what you id corresponds to, but if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.