Using Data Job Properties vs Secrets - vmware/versatile-data-kit GitHub Wiki
This article outlines when and how you should use Data Job Properties or Secrets.
While both mechanisms can be used somewhat interchangeably there are certain things you should be aware of:
-
Properties are used to store state. They are generally faster to access and modify. If you need to overwrite a value often, sometimes on multiple occasions during the execution of a data job - properties are the way to go. They are not encrypted at rest. Using the data classification levels, storing internal or public data or low-sensitive private is likely appropriate.
-
Secrets are used to store sensitive data. Secrets are generally fast to access (somewhat slower than Properties), but slow to modify, as they are encrypted/decrypted during the storage/retrieval process. They are best suited for storing sensitive data - secrets, passwords, credentials, tokens, API keys, etc. They are stored in an encrypted state in a secure storage - for example, a Hashicorp Vault instance. Suitable for storing highly sensitive data
- Last processed state: store the timestamp of the last successful run or last ingested recod timestamp or last row id.
-
Query Parameters: properties are automatically expanded in SQL queries (
select * from {db}.{table}
) - Progress information: information about the progress of a long-running ETL task, such as the percentage of data processed.
- Configurations: like environment (staging or production) or other non-sensitive configurations
- API keys or tokens: tasks that involve pulling data from third-party APIs which require API key to authenticate
- Service Account Credentials: connecting to some internal services such as email server passwords, third-party service credentials .
- Cloud Service Credentials: When interacting with cloud services like AWS, Google Cloud, or Azure, the access keys, client IDs, client secrets, and other such sensitive credentials.
You need to store the date of the last processed data entry to ensure the job begins processing new data from the correct point the next day.
In this case, the 'last processed date' can be stored as a property. It's not sensitive information but necessary for maintaining the job's state.
def run(job_input):
# get the properties
properties = job_input.get_all_properties()
current_date = str(date.today())
if ('last_ingested_timestamp' not in properties) or current_date != properties['last_ingested_timestamp']:
# some very complex processing goes here...
# update the property value and store it
properties['last_ingested_timestamp'] = current_date
job_input.set_all_properties(properties)
else:
logging.info("Skipped ingestion")
You can also use the vdk properties
command to store and retrieve properties via the command line.
You can check all options and examples using vdk properties --help
Now, suppose you have to extract data from a third-party service that requires API authentication. The API key, being a sensitive piece of information, needs to be securely stored. In this scenario, you will store the API key as a secret.
You can use the vdk secrets
command to store and retrieve secrets via the command line.
You can check all options and examples using vdk secrets --help
If you are using the vdk cli on a private/secure console, you can use the "--set-prompt" option and then you'll get prompted to enter it and it won't be kept in your console's history.
vdk secrets -n my-job -t my-team --set-prompt "api_key"
In a data job, you can access Job Secrets via the JobInput's secrets methods. In the following example we'll get the value of a single secret and use it to make an authenticated REST call:
import requests
from datetime import date, timedelta
from vdk.api.job_input import IJobInput
def run(job_input: IJobInput):
# Get the API Key from the Job Secrets
api_key = job_input.get_secret('api_key')
# Get the data
url=...
response = requests.get(url, params=params)
data = response.json()
# ...
Feature | Properties | Secrets |
---|---|---|
Recommended data type | State or non/low-sensitive data | Medium or highly sensitive data |
Use cases | state, configuration, status | passwords, API keys, tokens, credentials |
Size limit | ~10 KBs per key/value | 512 bytes per key/value |
Read access | Fast | Slightly slower |
Update request rate | High (many times per job execution) | Low (usually through UI or CLI) |
Backend storage | OLTP Database | HashiCorp Vault |
Encryption at rest | No | Yes |
By understanding these differences, you can optimize your data jobs and maintain best practices for data security and efficiency.
Remember to consider the nature of your data before deciding whether to use properties or secrets