ClickHouse 是一个由俄罗斯搜索巨头Yandex开源的分布式列存储OLAP数据库。主要的特点有
看起来非常厉害的样子,但 Clickhouse 最吸引人的一点应该就是一个“快”字吧。但是骡子是马拉出来溜溜,让我们开始吧。
Clickhouse官方有提供 clickhouse 的docker镜像, 只要简单运行
docker run -d --name clickhouse-server --ulimit nofile=262144:262144 -p 9000:9000 yandex/clickhouse-server:1.1
clickhouse-server
就可以跑起来了。但我们想有更多的控制项。
先将默认的配置拷贝出来
mkdir etc
mkdir data
docker run -it --rm --entrypoint=/bin/bash -v $PWD:/work --privileged=true --user=root yandex/clickhouse-server:1.1
cp -r /etc/clickhouse-server/* /work/etc/
exit
再运行
docker run -d --name clickhouse-server \
--ulimit nofile=262144:262144 \
-p 9000:9000 \
-v $PWD:/etc/clickhouse-server \
-v $PWD/data:/var/lib/clickhosue \
--privileged=true --user=root \
yandex/clickhouse-server:1.1
clickhouse 跑起来之后,就可以去 官方教程 那里玩一玩啦。
clickhouse性能再好,单机总是有上限的。但 clickhouse 可以通过集群分片来应对数据的增长。
用 docker-compose 我们可以很轻松的跑一个 clickhouse 集群。 下面我们来跑一个3节点分片的 clickhouse 集群
mkdir -p clickhouse-3shards/ch01
mkdir -p clickhouse-3shards/ch02
mkdir -p clickhouse-3shards/ch03
mkdir -p clickhouse-3shards/ch01/data
mkdir -p clickhouse-3shards/ch02/data
mkdir -p clickhouse-3shards/ch03/data
cp -r etc clickhouse-3shards/ch01/etc
cp -r etc clickhouse-3shards/ch02/etc
cp -r etc clickhouse-3shards/ch03/etc
vim docker-compose.yaml
配置 clickhouse 集群
为方便管理,我们将集群相关的配置拿出来,放在metrika.xml
。在每个config.xml
文件里面加入下面一行引入metrika.xml
。
<include_from>/etc/clickhouse-server/metrika.xml</include_from>
metrika.xml
<yandex>
<clickhouse_remote_servers>
<!-- 集群名 -->
<perftest_3shards_1replicas>
<!-- 分片地址 -->
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse01</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse02</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse03</host>
<port>9000</port>
</replica>
</shard>
</perftest_3shards_1replicas>
</clickhouse_remote_servers>
<!-- 宏配置,用于分布式建表时做替换,每个节点配置不能相同 -->
<macros>
<shard>01</shard>
<replica>01</replica>
</macros>
<networks>
<ip>::/0</ip>
</networks>
</yandex>
docker-compose.yaml
version: '2'
services:
clickhouse01:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
ports:
- "9001:9000"
volumes:
- ./ch01/etc:/etc/clickhouse-server
- ./ch01/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
privileged: true
clickhouse02:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
ports:
- "9002:9000"
volumes:
- ./ch02/etc:/etc/clickhouse-server
- ./ch02/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
privileged: true
clickhouse03:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
ports:
- "9003:9000"
volumes:
- ./ch03/etc:/etc/clickhouse-server
- ./ch03/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
privileged: true
配置之后启动集群
docker-compose -d up
事不宜迟,我们来先在每个 clickhouse-server 上都建一个本地表和 Distributed 表。
clickhouse-client --port=9001
# 其他clickhouse-server 同样处理
# clickhouse-client --port=9002
# clickhouse-client --port=9003
CREATE TABLE chtest_local (TDate Date,Value UInt16) ENGINE = MergeTree(TDate, (Value, TDate), 8192);
CREATE TABLE chtest_all AS chtest_local ENGINE = Distributed(perftest_3shards_1replicas, default, chtest_local, rand());
在任意一台插入一些数据
clickhouse-client --port=9001
insert into chtest_all (TDate,Value) values ('2017-12-25', 111);
insert into chtest_all (TDate,Value) values ('2017-12-25', 222);
insert into chtest_all (TDate,Value) values ('2017-12-26', 333);
insert into chtest_local (TDate,Value) values ('2017-12-26', 444);
之后就可以查 chtest_all 获取全部都数据。注意的是写入 chtest_local 的数据也是可以在 chtest_all 查出来的。
:) select * from chtest_all;
SELECT *
FROM chtest_all
┌──────TDate─┬─Value─┐
│ 2017-12-26 │ 444 │
└────────────┴───────┘
┌──────TDate─┬─Value─┐
│ 2017-12-25 │ 111 │
└────────────┴───────┘
┌──────TDate─┬─Value─┐
│ 2017-12-25 │ 222 │
└────────────┴───────┘
┌──────TDate─┬─Value─┐
│ 2017-12-26 │ 333 │
└────────────┴───────┘
4 rows in set. Elapsed: 0.008 sec.
:) select * from chtest_local;
SELECT *
FROM chtest_local
┌──────TDate─┬─Value─┐
│ 2017-12-26 │ 444 │
└────────────┴───────┘
┌──────TDate─┬─Value─┐
│ 2017-12-25 │ 111 │
└────────────┴───────┘
2 rows in set. Elapsed: 0.006 sec.
:) select * from chtest_local;
SELECT *
FROM chtest_local
┌──────TDate─┬─Value─┐
│ 2017-12-25 │ 222 │
└────────────┴───────┘
1 rows in set. Elapsed: 0.005 sec.
:) select * from chtest_local;
SELECT *
FROM chtest_local
┌──────TDate─┬─Value─┐
│ 2017-12-25 │ 222 │
└────────────┴───────┘
1 rows in set. Elapsed: 0.005 sec.
当我们任意分片挂掉的时候,是无法读的 Distributed 表的,当写入 Distributed 表是数据分片到挂了的服务时是会报错的。
docker stop cluster3shards_clickhouse03_1
clickhouse-client --port=9001
在分布式系统里面,为保证服务的可用性,我们需要数据多副本存储。
clickhouse 的 数据副本同步需要用到 zookeeper,所以我们修改docker-compose.yaml
,加多一个节点,配置成2个分片,2个副本的高可用集群。
version: '2'
services:
zookeeper:
image: zookeeper:3.5
ports:
- "2181:2181"
- "2182:2182"
clickhouse01:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
privileged: true
ports:
- "9001:9000"
volumes:
- ./ch01/etc:/etc/clickhouse-server
- ./ch01/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
depends_on:
- "zookeeper"
clickhouse02:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
privileged: true
ports:
- "9002:9000"
volumes:
- ./ch02/etc:/etc/clickhouse-server
- ./ch02/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
depends_on:
- "zookeeper"
clickhouse03:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
privileged: true
ports:
- "9003:9000"
volumes:
- ./ch03/etc:/etc/clickhouse-server
- ./ch03/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
depends_on:
- "zookeeper"
clickhouse04:
image: yandex/clickhouse-server:1.1
expose:
- "9000"
user: root
privileged: true
ports:
- "9004:9000"
volumes:
- ./ch04/etc:/etc/clickhouse-server
- ./ch04/data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
depends_on:
- "zookeeper"
集群配置
<yandex>
<clickhouse_remote_servers>
<perftest_2shards_2replicas>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse01</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse03</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse02</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse04</host>
<port>9000</port>
</replica>
</shard>
</perftest_2shards_2replicas>
</clickhouse_remote_servers>
<macros>
<shard>01</shard>
<replica>02</replica>
</macros>
<zookeeper-servers>
<node index="1">
<host>zookeeper</host>
<port>2181</port>
</node>
</zookeeper-servers>
<networks>
<ip>::/0</ip>
</networks>
</yandex>
集群起来之后,在每一个节点创建表
CREATE TABLE chtest_local (TDate Date,Value UInt16) ENGINE = MergeTree(TDate, (Value, TDate), 8192);
CREATE TABLE chtest_replica (TDate Date,Value UInt16)
ENGINE = ReplicatedMergeTree(
'/clickhouse_perftest/tables/{shard}/ontime',
'{replica}',
TDate,
(Value, TDate),
8192);
CREATE TABLE chtest_all AS chtest_replica ENGINE = Distributed(perftest_2shards_2replicas, default, chtest_replica, rand());
后面再任意一台机器往 chtest_all 插入数据,可以看到ch01 与 ch03, ch02 与 ch04 都存有相同的数据。
总的来说 clickhouse 的体验还是很不错的。性能不错,分布式比较方便,开箱即用。不过 clickhouse 的集群管理方面还是很弱,没有一个中央的控制节点。例如增减节点需要更新所有节点的配置,需要自己弄一套管理的工具。
参考资料
https://clickhouse.yandex/tutorial.html
http://www.cnblogs.com/gomysql/p/6708650.html
http://jackpgao.github.io/2017/12/13/ClickHouse-Cluster-Beginning-to-End/