Using DSBulk with Astra - datastaxdevs/awesome-astra GitHub Wiki
๐ Back to home | Written by Cรฉdrick Lunven and Artem Chebotko
๐ On this page
DataStax Bulk Loader or DSBulk is an efficient, flexible, easy-to-use command line utility that excels at loading, unloading, and counting data stored in Cassandra-compatible storage engines, such as OSS Apache Cassandraยฎ, DataStax Astra DB and DataStax Enterprise (DSE).
DSBulk is commonly used to:
- Load data from JSON or CSV files to the database;
- Unload data stored in the database to JSON or CSV files;
- Count the number of rows in a given table.
# Load data
dsbulk load <options>
# Unload data
dsbulk unload <options>
# Count rows
dsbulk count <options>
Currently, CSV and JSON formats are supported for both loading and unloading data.
For more information about the DSBulk capabilities, please see the reference documentation for DSBulk.
DataStax Bulk Loader or DSBulk can be used to load data into and unload data from your DataStax Astra DB database efficiently and reliably.
For more information about the DSBulk usage with Astra DB, please see the reference documentation for DSBulk and Astra DB.
- Create an Astra Account
- Create an Astra Database
- Create an Astra Token
- Download a Secure Connect Bundle
Get the latest distribution of DSBulk by going to https://downloads.datastax.com/#bulk-loader.
Alternatively, use the curl
tool to download a specific version of DSBulk:
curl -OL https://downloads.datastax.com/dsbulk/dsbulk-1.8.0.tar.gz
In this tutorial, we use Datastax Bulk Loader version 1.8.0
.
Extract the archive:
tar -xvzf dsbulk-1.8.0.tar.gz
Find dsbulk
inside the bin
directory:
cd dsbulk-1.8.0/bin
./dsbulk help
Note that using DSBulk with Astra DB requires specifying your own client id and client secret, as well as your database secure connect bundle. The client id, client secret, and secure connect bundle used in the examples below are no longer valid.
Example dsbulk load
command:
./dsbulk load \
-url /tmp/input.csv \
-header true \
-k my_keyspace \
-t my_table \
-u BBygiXTpFXPLeOAdQwRLBZBB \
-p xUr4qUCjsdexniP5.0PE,e09FeZ6W1,6-OuhXTwYeUcImKvBok_P3Kh8qS1djJlRE6t_tcgneMIKhgznI7Mf6iKEGq6gZOv+MPKURA7c30Ws4atjbCwdx+WcgduZ-o43 \
-b /tmp/secure-connect-my-database.zip
# where
# -url ........ CSV file
# -header ..... file header presence
# -k .......... keyspace name
# -t .......... table name
# -u .......... client id
# -p .......... client secret
# -b .......... secure connect bundle
Example table schema:
CREATE TABLE my_keyspace.my_table (
id UUID,
name TEXT,
PRIMARY KEY(id)
);
Example CSV file with a header:
id,name
d270543c-62f5-4108-9548-5bbc50cd94fe,Alice
74871405-e108-4bf7-b4bf-2c3477ef7d6d,Bob
Example output:
total | failed | rows/s | p50ms | p99ms | p999ms | batches
2 | 0 | 8 | 77.59 | 84.93 | 84.93 | 1.00
Operation LOAD_20220319-044851-396835 completed successfully in less than one second.
Last processed positions can be found in positions.txt
Example dsbulk count
command:
./dsbulk count \
-k my_keyspace \
-t my_table \
-u BBygiXTpFXPLeOAdQwRLBZBB \
-p xUr4qUCjsdexniP5.0PE,e09FeZ6W1,6-OuhXTwYeUcImKvBok_P3Kh8qS1djJlRE6t_tcgneMIKhgznI7Mf6iKEGq6gZOv+MPKURA7c30Ws4atjbCwdx+WcgduZ-o43 \
-b /tmp/secure-connect-my-database.zip
# where
# -k .......... keyspace name
# -t .......... table name
# -u .......... client id
# -p .......... client secret
# -b .......... secure connect bundle
Example output for the table with 2 rows:
total | failed | rows/s | p50ms | p99ms | p999ms
2 | 0 | 2 | 66.13 | 73.92 | 73.92
Operation COUNT_20220319-041624-950820 completed successfully in less than one second.
2
Example dsbulk unload
command:
./dsbulk unload \
-k my_keyspace \
-t my_table \
-u BBygiXTpFXPLeOAdQwRLBZBB \
-p xUr4qUCjsdexniP5.0PE,e09FeZ6W1,6-OuhXTwYeUcImKvBok_P3Kh8qS1djJlRE6t_tcgneMIKhgznI7Mf6iKEGq6gZOv+MPKURA7c30Ws4atjbCwdx+WcgduZ-o43 \
-b /tmp/secure-connect-my-database.zip \
> /tmp/output.csv
# where
# -k .......... keyspace name
# -t .......... table name
# -u .......... client id
# -p .......... client secret
# -b .......... secure connect bundle
Example output for the table with 2 rows:
total | failed | rows/s | p50ms | p99ms | p999ms
2 | 0 | 2 | 71.37 | 83.89 | 83.89
Operation UNLOAD_20220319-181929-004308 completed successfully in less than one second.
For more examples, see