Ingesting local CSV file into Database - vmware/versatile-data-kit GitHub Wiki
In this example, we will use the Versatile Data Kit (VDK) to ingest a local .csv file into a database. In our case we will use local SQLite. Thus exploring the possibility to ingest large locally stored data set into a local or remote database.
Before you continue, make sure you are familiar with the Getting Started section of the wiki.
For the purpose of this example, we will be using a simple .csv file. The file can be found here, and you can either save it, or create your own .csv file with the same structure.
NOTE: The only important thing at this step is that the .csv file that you are going to use, has the same column names, as the ones in the example file: (FirstName, LastName, Position). This is because SQLite expects the table schema, that we are going to create later in this example, to match the file structure.
If you have not done so already, you can install Versatile Data Kit and the plugins required for this example by running the following commands from a terminal:
pip install quickstart-vdk vdk-csv
Note that Versatile Data Kit requires Python 3.7+. See the main guide for getting started with quickstart-vdk for more details.
Ingestion requires us to set environment variables for:
- the type of database in which we will be ingesting;
- the ingestion method;
- the ingestion target - the location of the target database - if this file is not present, ingestion will create it in the current directory. For this example, we will use vdk-sqlite.db file which will be created in the current directory;
- the file of the default SQLite database against which VDK runs (same value as ingestion target in this case);
export VDK_DB_DEFAULT_TYPE=SQLITE
export VDK_INGEST_METHOD_DEFAULT=sqlite
export VDK_INGEST_TARGET_DEFAULT=vdk-sqlite.db
export VDK_SQLITE_FILE=vdk-sqlite.db
Note: if you want to ingest data into another target (another database for example - Postgres, Trino), install the appropriate plugin using pip install vdk-plugin-name
and change VDK_INGEST_METHOD_DEFAULT. See a list of plugins here
To see all possible configuration options use command vdk config-help
Make sure to check the help! It has pretty good documentation and examples
vdk ingest-csv --help
If the .csv file to be ingested (csv_example.csv in our example), and the vdk-sqlite.db are present in the current directory, the only thing left is to run
vdk ingest-csv -f csv_example.csv --table-name my_csv_example_table
With this command, the CSV data will be ingested into the SQLite database. Upon successful ingestion, the logs should be similar to the ones below
Result logs
2021-09-15 17:24:01,078=1631715841[VDK] csv_ingest_job [INFO ] vdk.internal.builtin_plugins.i ingester_base.py:546 close_now [OpId:1631715838-e7edd7-188650]- Ingester statistics: Successful uploads:1 Failed uploads:0 ingesting plugin errors:defaultdict(, {})2021-09-15 17:24:01,079=1631715841[VDK] csv_ingest_job [INFO ] vdk.internal.builtin_plugins.r cli_run.py:66 run_job [OpId:1631715838-e7edd7-188650]- Data Job execution summary: { "data_job_name": "csv_ingest_job", "execution_id": "1631715838-e7edd7", "start_time": "2021-09-15T14:23:59.033471", "end_time": "2021-09-15T14:23:59.061216", "status": "success", "steps_list": [ { "name": "csv_ingest_step.py", "type": "python", "start_time": "2021-09-15T14:23:59.033566", "end_time": "2021-09-15T14:23:59.061136", "status": "success", "details": null, "exception": null } ], "exception": null }
To verify that the data is ingested as expected, run
vdk sqlite-query -q "SELECT * FROM my_csv_example_table"
The output should be similar to the one below
------ -------- ---------
John Doe Manager
Jane Dow Engineer
Gordon Brown Engineer
Kate McGround Manager
Maria Johnson Engineer
Jack Frown Engineer
Joe Britte Team Lead
------ -------- ---------
You can find a list of all Versatile Data Kit examples here.