Service: ClickHouse - EyevinnOSC/community GitHub Wiki

Getting Started

ClickHouse is the fastest and most resource efficient real-time data warehouse and open-source database. This tutorial walks you through how you quickly can get started with data warehousing in Eyevinn Open Source Cloud.

Prerequisites

Step 1: Setup secrets

Create a service secret to hold the password for the default user.

Step 2: Create a ClickHouse server instance

Click on the button "Create clickhouse-server" in the web user interface. Then enter the following values in the create dialog.

Press the button Create and wait for the instance to be in status Running.

Step 3: Insert and Query data

As an example we will create a table and insert a large dataset (two million rows of the New York taxi data). Then you will run queries on the dataset, including an example of how to create a dictionary and use it to perform a JOIN.

Click on the ClickHouse server instance card.

Create the following trips table by pasting below in to the input field above. Enter the user and password that you set when creating the instance.

CREATE TABLE trips
(
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;

Press button Run.

Now insert the dataset by copying and pasting this.

INSERT INTO trips
SELECT * FROM s3(
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
    'TabSeparatedWithNames', "
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
") SETTINGS input_format_try_infer_datetimes = 0

Verify it works with this command.

SELECT count() FROM trips

You should have about 2 million rows in your table. Now let us try with this query.

SELECT DISTINCT(pickup_ntaname) FROM trips

This query has to process 2M rows and return 190 values, but notice it does this in about 1 second. The pickup_ntaname column represents the name of the neighborhood in New York City where the taxi ride originated.

Step 4: Analyze the data

Let's run some queries to analyze the 2M rows of data...

We will start with some simple calculations, like computing the average tip amount:

SELECT round(avg(tip_amount), 2) FROM trips

This query computes the average cost based on the number of passengers:

SELECT
    passenger_count,
    ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count

Find more examples in the ClickHouse tutorials.

Using the ClickHouse client

You can also use the ClickHouse client to insert and query data from your server. Install the client.

% brew install clickhouse

Find the IP and Port on the ClickHouse server instance card.

The ClickHouse URL in this example is then clickhouse://myuser:[email protected]:10523 and what you specify when you run the client.

% clickhouse client clickhouse://myuser:[email protected]:10523
ClickHouse client version 25.1.3.23 (official build).
Connecting to 172.232.131.169:10523 as user myuser.
Connected to ClickHouse server version 24.12.3.

eyevinnlab-guide-7cc86668b-wmq57 :)

Now we can try with the same query as above.

eyevinnlab-guide-7cc86668b-wmq57 :) SELECT round(avg(tip_amount), 2) FROM trips

SELECT round(avg(tip_amount), 2)
FROM trips

Query id: 20428ab7-3f2d-420d-b05a-bab2a789ce57

   β”Œβ”€round(avg(tip_amount), 2)─┐
1. β”‚                      1.68 β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 row in set. Elapsed: 0.017 sec. Processed 2.00 million rows, 8.00 MB (117.53 million rows/s., 470.13 MB/s.)
Peak memory usage: 71.18 KiB.