Converting to Platform SDK Python - adobe/experience-platform-dsw-reference GitHub Wiki

Guide to convert Data Access code to Platform SDK in Python

data_access_sdk_python will be depreciated soon. The new Python platform_sdk will be the replacement. This is a guide on how to convert to the new sdk.

Build Authentication

Authentication is built from client context. Here is how to build them:

#Python
from platform_sdk.client_context import ClientContext
client_context = ClientContext(api_key=<api key>,
              org_id=<ims org id>,
              user_token=<user token>,
              service_token=<service token>)
#R
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
client_context <- psdk$client_context$ClientContext(api_key=<api key>,
              org_id=<ims org id>,
              user_token=<user token>,
              service_token=<service token>)

If you are using DSW Notebook, this is already set for you. However if you wish to change the IMS org, you have to manually build the client_context using the above steps.

#Python
client_context = PLATFORM_SDK_CLIENT_CONTEXT
#R
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")

py_run_file("../.ipython/profile_default/startup/platform_sdk_context.py")
client_context <- py$PLATFORM_SDK_CLIENT_CONTEXT

Basic Read

#Old Python
from data_access_sdk_python.reader import DataSetReader
reader = DataSetReader()
df = reader.load(data_set_id="<dataset id>", ims_org="<ims org id>")
df.head()
#New Python
from platform_sdk.dataset_reader import DatasetReader
dataset_reader = DatasetReader(client_context, "<dataset id>")
df = dataset_reader.limit(100).read()
df.head()
#Old R
library(reticulate)
use_python("/usr/local/bin/python3")
psdk <- import("data_access_sdk_python")
reader <- psdk$reader$DataSetReader()
df <- reader$load(data_set_id="<dataset id>", ims_org="<ims org id>")
df
#New R
DatasetReader <- psdk$dataset_reader$DatasetReader
dataset_reader <- DatasetReader(client_context, "<dataset id>") 
df <- dataset_reader$read() 
df

Please note that IMS org is now set while building the client_context.

Batch ID

Along with the new package, batch id will not be referenced. Please use offset and limit to scope the data read

#Python
df = dataset_reader.limit(100).offset(1).read()
df.head
#R
df <- dataset_reader$limit(10L)$offset(1L)$read() 
df

Filter by Date

For Experience Event datasets (e.g. datasets from Adobe Analytics), we will compare old and new method of filtering:

#Old Python
from datetime import date
df = reader.load(data_set_id="<dataset id>", ims_org="<ims org id>",\
     batch_id="<batch id>",\
     date_after=date(<YEAR>,<MONTH>,<DAY>), date_before=date(<YEAR>,<MONTH>,<DAY>))
#New Python
df = dataset_reader.where(\
    dataset_reader['timestamp'].gt('2019-04-10 15:00:00').\
    And(dataset_reader['timestamp'].lt('2019-04-10 17:00:00'))\
).read()
df.head()
#Old R
datetime <- import("datetime", convert = FALSE)
df <- reader$load(data_set_id="<dataset id>", ims_org="<ims org id>",
                  batch_id="<batch id>",
                  date_after=datetime$date(<YEAR>L,<MONTH>L,<DAY>L), date_before=datetime$date(<YEAR>L,<MONTH>L,<DAY>L))
#New R
df2 <- dataset_reader$where(
    dataset_reader['timestamp']$gt('2018-12-10 15:00:00')$
    And(dataset_reader['timestamp']$lt('2019-04-10 17:00:00'))
)$read()
df2

In the old version, the granularity is one day. In the new version, the granularity is defined by the timestamp.

The SDK supports certain operators to help filter the dataset

eq() = '='
gt() = '>'
ge() = '>='
lt() = '<'
le() = '<='
And() = and operator
Or() = or operator

Load Selected Columns

A further optimization is to limit the columns during reading:

#Python
df = dataset_reader.select(['column-a','column-b']).read()
#R
df <- dataset_reader$select(c('column-a','column-b'))$read() 

Get Sorted Results

There is sort option available to get the results sorted using order by clause. It allows the results received, to be already sorted by specified columns of the dataset and in their order (asc/desc) respectively.

#Python
df = dataset_reader.sort([('column-a', 'asc'), ('column-b', 'desc')])

Here in the above example, dataframe will be sorted by column-a first in ascending order, rows having same values for column-a, will be sorted by column-b in descending order.

#R
df <- dataset_reader$sort(c(('column-a', 'asc'), ('column-b', 'desc')))$read()

Caveats

  • There is a hard limit of 10 minute run time for every query. If you are running into this limit, please use date filtering and selected columns read as describe above.

  • The max size of read is 32 GB.

Basic Write

#Old Python
from data_access_sdk_python.writer import DataSetWriter

writer = DataSetWriter()
writer.write(data_set_id="<dataset id>", dataframe=<dataframe>, ims_org="<ims org>", file_format="json")
#New Python
from platform_sdk.models import Dataset
from platform_sdk.dataset_writer import DatasetWriter

dataset = Dataset(client_context).get_by_id("<dataset id>")
dataset_writer = DatasetWriter(client_context, dataset)
write_tracker = dataset_writer.write(<panda dataframe>, file_format='json')
#Old R
writer <- psdk$writer$DataSetWriter()
writer$write(data_set_id="<dataset id>", dataframe=<dataframe>, ims_org="<ims org>", file_format="json")
#New R
dataset <- psdk$models$Dataset(client_context)$get_by_id("<dataset id>")
dataset_writer <- psdk$dataset_writer$DatasetWriter(client_context, dataset)
write_tracker <- dataset_writer$write(<panda dataframe>, file_format='json')
⚠️ **GitHub.com Fallback** ⚠️