Hive_SQL数据定义语言_DDL

发表于 2022-10-29 分类于 hive

这是文章开头，显示在主页面，详情请点击此处。

简介

截屏2022-02-24 15.38.54

截屏2022-02-24 16.23.09

截屏2022-02-24 16.27.45

截屏2022-02-24 16.35.28

截屏2022-02-24 16.38.17

截屏2022-02-24 16.39.02

截屏2022-02-24 16.39.45

截屏2022-02-24 16.40.13

截屏2022-02-24 16.43.38

截屏2022-02-24 16.46.46

截屏2022-02-24 16.48.48

截屏2022-02-24 16.56.03

截屏2022-02-24 17.05.34

截屏2022-02-24 17.06.03

截屏2022-02-24 17.08.04

截屏2022-02-24 17.10.07

截屏2022-02-24 17.11.08

截屏2022-02-24 21.23.10

内嵌模式安装（这只是简介，具体的安装配置教程参考之前的hive安装笔记。）：

截屏2022-02-24 21.28.06

本地模式安装（这只是简介，具体的安装配置教程参考之前的hive安装笔记。）：

截屏2022-02-24 21.33.58

截屏2022-02-24 21.39.15

远程模式安装：

截屏2022-02-24 21.44.17

截屏2022-02-24 21.48.19

方式3 远程方式安装，这才是hive主要应用场景（这只是简介，具体的安装配置教程参考之前的hive安装笔记。）：

截屏2022-02-24 21.50.23

截屏2022-02-24 21.54.14

截屏2022-02-24 22.00.36

截屏2022-02-24 22.01.58

截屏2022-02-24 22.12.10

我们在用hive强行新建一个表，并插入数据时，速度很慢实际是运行一个java程序，此时我们打开yarn的监测端口可以看见一个mapreduce程序正在运行。

截屏2022-03-01 12.03.53

截屏2022-03-01 12.04.17

截屏2022-03-01 12.04.39

卧槽插入一条数据花了182秒，可见hive真的不是sql。

映射成表才是正解

vim user.txt
1,zhangsan
2,lisi

bin/hadoop fs -put user.txt / # 用Hadoop命令把user.txt 传到hdfs的根目录下。

此时用9870端口查看yarn的根目录，可见多了一个user.txt文件。

截屏2022-03-01 12.38.49

默认情况下，hive在 /user/hive/warehouse 中创建库和表。将我们刚才放在hdfs根目录的user.txt移动到 /user/hive/warehouse 与之对应的表中。

1	bin/hadoop fs -mv /user.txt /user/hive/warehouse/test.db/table_test

现在可以看见有东西存进来了，但是新的内容都是显示null，

截屏2022-03-01 14.03.56

hive建立表的时间加上分隔符

1	create table t_user(id int,name varchar(20))row format delimited fields terminated by ',';

把user.txt 文件映射到指定位置的表中

1	bin/hadoop fs -put user.txt /user/hive/warehouse/test.db/t_user

再在hive中 select * from t_user; 可见

截屏2022-03-01 14.14.14

成功映射好了。

注意文件映射为表，hive会自己尝试着把类型对应，比如文件类型是int，映射的表中类型是varchar，那么hive可以帮我们自动类型转换，但是如果文件中类型是str字符串类型，表中定义时就定义了是int类型，hive自动类型转换str转int是转不过来的，有点像java的自动类型转换的级别特点。

所以建立表时，要把字段顺序和字段类型和文件保持一致。

截屏2022-02-24 22.26.28

Hive SQL数据定义语言 DDL

截屏2022-03-01 21.11.51

截屏2022-03-02 17.10.00

截屏2022-03-02 17.11.09

截屏2022-03-02 17.18.40

截屏2022-03-02 17.35.34

截屏2022-03-02 17.40.16

截屏2022-03-02 17.44.33

截屏2022-03-02 17.47.18

截屏2022-03-02 17.48.21

截屏2022-03-02 18.30.46

截屏2022-03-02 18.31.03

截屏2022-03-03 11.39.33

新建完sql 记得新增数据源 configure data source，就是apache hive。改一下主机名，用户名，测试一下，然后链接。

截屏2022-03-03 11.46.51

截屏2022-03-03 17.00.07

截屏2022-03-03 22.37.00

把文件放到hive表对应的的位置，马上就映射成功了。

截屏2022-03-03 22.40.45

截屏2022-03-03 22.48.14

截屏2022-03-03 23.05.05

所以优先考虑 \001 作为分隔符。

截屏2022-03-03 23.14.15

注意上面的 / 是hdfs的根目录，不是系统根目录。截屏2022-03-03 23.17.57

hadoop fs -chmod -R 777 / # 可以让 node01:9870 浏览器中修改hdfs的目录文件夹。给了权限。

# 更加传统的方法是
# 查看文件
hadoop fs -ls /user/hive/warehouse
# 删除文件、目录
hadoop fs -rm  /data

# 有了新的hdfs的文件夹以后，可以hive建表location到 指定的hdfs下的位置了。将文件Hadoop fs -put 文件 新位置 ; 传过来即可映射。

截屏2022-03-04 00.10.57

截屏2022-03-04 00.12.03

截屏2022-03-04 10.46.05

截屏2022-03-04 10.47.46

截屏2022-03-04 10.48.29

截屏2022-03-04 10.50.34

截屏2022-03-04 10.56.21

external创建外部表，但是不添加location，存放位置还是在/usr/hive/warehouse 中。但是它还是外部表，删除操作保留原文件。

内部表还是外部表由external存在与否决定，location在哪里和这点没关系。

截屏2022-03-05 11.05.12

截屏2022-03-05 11.10.19

截屏2022-03-05 11.12.14

截屏2022-03-05 11.13.21

截屏2022-03-05 11.18.49

截屏2022-03-05 11.20.01

截屏2022-03-05 11.21.51

分区表数据不能想原来一样 put 进指定位置映射就完事，需要每一个源文件在 hive console控制台内静态加载到 hdfs 对应表中。

截屏2022-03-05 11.26.44

截屏2022-03-05 11.31.46

截屏2022-03-05 11.34.17

截屏2022-03-05 11.36.44

我的分区优化后的查询用时1分59秒，没有分区的查询时间是2分7秒。

截屏2022-03-05 11.39.07

截屏2022-03-05 11.40.39

截屏2022-03-05 11.42.42

截屏2022-03-05 11.45.56

截屏2022-03-05 11.53.13

截屏2022-03-05 11.56.25

第三点中，底层原数据没有变化，显示的分区字段其实是 hive 中生成的。

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

create table t_all_hero_part_dynamic(
                                id int,
                                name string,
                                hp_max int,
                                mp_max int,
                                attack_max int,
                                defense_man int,
                                attack_range string,
                                role_main string,
                                role_assist string
)partitioned by (role string) row format delimited fields terminated by "\t";
create table t_all_hero(
                                        id int,
                                        name string,
                                        hp_max int,
                                        mp_max int,
                                        attack_max int,
                                        defense_man int,
                                        attack_range string,
                                        role_main string,
                                        role_assist string
)row format delimited fields terminated by "\t";

# 动态插入分区表\动态分区。
insert into table t_all_hero_part_dynamic partition(role)
select tmp.*,tmp.role_main from t_all_hero tmp;

截屏2022-03-05 20.04.18

截屏2022-03-05 20.06.10

这里可以回顾一下，之前学习的算法与数据结构中的散列表。截屏2022-03-05 20.14.10

截屏2022-03-05 20.18.44

截屏2022-03-05 20.19.55

截屏2022-03-05 20.29.27

截屏2022-03-05 20.34.15

分桶表还可以提高抽样的效率。

截屏2022-03-05 21.20.06

截屏2022-03-05 21.24.12

截屏2022-03-05 21.26.06

截屏2022-03-05 21.28.36

截屏2022-03-05 21.31.36

截屏2022-03-05 21.34.40

insert into trans_student values(1,’chenyushao’,29);

截屏2022-03-05 21.37.31

截屏2022-03-05 21.39.32

可以看见其实hive没有删除原文件，只是用了新的 “事务操作” 产生新的版本而已。

set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=1;

drop table trans_student;
create table trans_student(
    id int,
    name string,
    age int
)clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');

insert into trans_student values(1,'chenyushao',29);  # 插入数据到事务表， 速度实在是太慢了！3分钟。
update trans_student set age=20 where id=1;
delete from trans_student where id =1;
select * from trans_student;

截屏2022-03-05 21.40.45

基于select 创建视图，只看前面5项。

截屏2022-03-05 21.49.13

截屏2022-03-05 21.51.26

截屏2022-03-05 21.53.39

截屏2022-03-05 21.55.03

截屏2022-03-05 21.58.49

截屏2022-03-05 22.04.12

注意：view只是让我们方便查看语句，并不是真的优化了查询的性能。

截屏2022-03-05 22.07.09

截屏2022-03-05 22.11.05

截屏2022-03-05 22.15.09

截屏2022-03-05 22.21.13

截屏2022-03-05 22.24.30

截屏2022-03-05 22.27.46

截屏2022-03-05 22.31.42

截屏2022-03-05 22.32.16

截屏2022-03-05 22.33.46

截屏2022-03-05 22.34.37

create table student(
    id int ,
    name string,
    sex string,
    age string,
    studept string
)row format delimited fields terminated by",";

drop table student_trans;
create table student_trans(
    sno int,
    sname string,
    sdept string
)clustered by (sno) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true'); # 新建一个分桶事务表。
insert into table student_trans select id,name,studept from student; # 从普通表插入到分桶事务表中。

create materialized view student_trans_agg
    as select sdept ,count(*)as dept_cnt from student_trans group by sdept; # 创建物化视图（因为这个查询经常用，才会作一个物化视图。）

select sdept,count(*) from student_trans group by sdept;  # 速度快，直接从物化视图的存档继续。
explain select sdept,count(*) from student_trans group by sdept; # 可见从物化视图继续的，并不是从student_trans。
alter materialized view student_trans_agg disable rewrite; # 关闭此物化视图的重写属性。
select sdept,count(*) from student_trans group by sdept;  # 速度瞬间变慢。
alter materialized view student_trans_agg enable rewrite;
select sdept,count(*) from student_trans group by sdept;