可以通過(guò)多種方式將數(shù)據(jù)導(dǎo)入hive表
1.通過(guò)外部表導(dǎo)入
用戶(hù)在hive上建external表,建表的同時(shí)指定hdfs路徑,在數(shù)據(jù)拷貝到指定hdfs路徑的同時(shí),也同時(shí)完成數(shù)據(jù)插入external表。
例如:
編輯文件test.txt
$ cat test.txt
1 hello
2 world
3 test
4 case
字段之間以'\t'分割
啟動(dòng)hive:
$ hive
建external表:
hive> CREATE EXTERNAL TABLE MYTEST(num INT, name STRING)
> COMMENT 'this is a test'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> LOCATION '/data/test';
OK
Time taken: 0.714 seconds
hive> show tables;
OK
mytest
partition_test
partition_test_input
test
Time taken: 0.07 seconds
hive> desc mytest ;
OK
num int
name string
Time taken: 0.121 seconds|
數(shù)據(jù)拷貝到hdfs:
$ hadoop fs -put test.txt /data/test
查看hive表數(shù)據(jù):
hive> select * from mytest;
OK
1 hello
2 world
3 test
4 case
Time taken: 0.375 seconds
hive> select num from mytest;
Total MapReduce jobs = 1
Launching Job 1 out of 1
......
Total MapReduce CPU Time Spent: 510 msec
OK
1
2
3
4
Time taken: 27.157 seconds
這種方式常常用于當(dāng)hdfs上有一些歷史數(shù)據(jù),而我們需要在這些數(shù)據(jù)上做一些hive的操作時(shí)使用。這種方式避免了數(shù)據(jù)拷貝開(kāi)銷(xiāo)
2.從本地導(dǎo)入
數(shù)據(jù)不在hdfs上,直接從本地導(dǎo)入hive表
文件/home/work/test.txt內(nèi)容同上
建表:
hive> CREATE TABLE MYTEST2(num INT, name STRING)
> COMMENT 'this is a test2'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
OK
Time taken: 0.077 seconds
導(dǎo)數(shù)據(jù)入表:
hive> LOAD DATA LOCAL INPATH '/home/work/test.txt' INTO TABLE MYTEST2;
Copying data from file:/home/work/test.txt
Copying file: file:/home/work/test.txt
Loading data to table default.mytest2
OK
Time taken: 0.24 seconds
查看數(shù)據(jù):
hive> select * from MYTEST2;
OK
1 hello
2 world
3 test
4 case
Time taken: 0.11 seconds
這種方式導(dǎo)入的本地?cái)?shù)據(jù)可以是一個(gè)文件,一個(gè)文件夾或者通配符,需要注意的是,如果是文件夾,文件夾內(nèi)不能包含子目錄,同樣,通配符只能通配文件。
3.從hdfs導(dǎo)入
上述test.txt文件已經(jīng)導(dǎo)入/data/test
則可以使用下述命令直接將數(shù)據(jù)導(dǎo)入hive表:
hive> CREATE TABLE MYTEST3(num INT, name STRING)
> COMMENT "this is a test3"
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
OK
Time taken: 4.735 seconds
hive> LOAD DATA INPATH '/data/test/test.txt' INTO TABLE MYTEST3;
Loading data to table default.mytest3
OK
Time taken: 0.337 seconds
hive> select * from MYTEST3 ;
OK
1 hello
2 world
3 test
4 case
Time taken: 0.227 seconds
4. 從其它表導(dǎo)入數(shù)據(jù):
hive> CREATE EXTERNAL TABLE MYTEST4(num INT) ;
OK
Time taken: 0.091 seconds
hive> FROM MYTEST3 test3
> INSERT OVERWRITE TABLE MYTEST4
> select test3.num where name='world';
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201207230024_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201207230024_0002
Kill Command = /home/work/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201207230024_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-07-23 18:59:02,365 Stage-1 map = 0%, reduce = 0%
2012-07-23 18:59:08,417 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.62 sec
2012-07-23 18:59:09,435 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.62 sec
2012-07-23 18:59:10,445 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.62 sec
2012-07-23 18:59:11,455 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.62 sec
2012-07-23 18:59:12,470 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.62 sec
2012-07-23 18:59:13,489 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.62 sec
2012-07-23 18:59:14,508 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 0.62 sec
MapReduce Total cumulative CPU time: 620 msec
Ended Job = job_201207230024_0002
Ended Job = -174856900, job is filtered out (removed at runtime).
Moving data to: hdfs://localhost:9000/tmp/hive-work/hive_2012-07-23_18-58-44_166_189728317691010041/-ext-10000
Loading data to table default.mytest4
Deleted hdfs://localhost:9000/user/hive/warehouse/mytest4
Table default.mytest4 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 2, raw_data_size: 0]
1 Rows loaded to mytest4
MapReduce Jobs Launched:
Job 0: Map: 1 Accumulative CPU: 0.62 sec HDFS Read: 242 HDFS Write: 2 SUCESS
Total MapReduce CPU Time Spent: 620 msec
OK
Time taken: 30.663 seconds
hive> select * from mytest4;
OK
2
Time taken: 0.103 seconds