Data access in Jupyterlab notebooks
Each supported kernel provides built-in functionalities that allow you to read Platform data from a dataset within a notebook. Currently JupyterLab in Adobe Experience Platform Data Science Workspace supports notebooks for Python, R, PySpark, and Scala. However, support for paginating data is limited to Python and R notebooks. This guide focuses on how to use JupyterLab notebooks to access your data.
Getting started
Before reading this guide, please review the JupyterLab user guide for a high-level introduction to JupyterLab and its role within Data Science Workspace.
Notebook data limits notebook-data-limits
The following information defines the max amount of data that can be read, what type of data was used, and the estimated timeframe reading the data takes.
For Python and R, a notebook server configured at 40GB RAM was used for the benchmarks. For PySpark and Scala, a databricks cluster configured at 64GB RAM, 8 cores, 2 DBU with a maximum of 4 workers was used for the benchmarks outlined below.
The ExperienceEvent schema data used varied in size starting from one thousand (1K) rows ranging up-to one billion (1B) rows. Note that for the PySpark and Spark metrics, a date span of 10 days was used for the XDM data.
The ad-hoc schema data was pre-processed using Query Service Create Table as Select (CTAS). This data also varied in size starting from one thousand (1K) rows ranging up-to one billion (1B) rows.
When to use batch mode vs interactive mode mode
When reading datasets with PySpark and Scala notebooks, you have the option to use interactive mode or batch mode to read the dataset. Interactive is made for fast results whereas batch mode is for large datasets.
Python notebook data limits
XDM ExperienceEvent schema: You should be able to read a maximum of 2 million rows (~6.1 GB data on disk) of XDM data in less than 22 minutes. Adding additional rows may result in errors.
ad-hoc schema: You should be able to read a maximum of 5 million rows (~5.6 GB data on disk) of non-XDM (ad-hoc) data in less than 14 minutes. Adding additional rows may result in errors.
R notebook data limits
XDM ExperienceEvent schema: You should be able to read a maximum of 1 million rows of XDM data (3GB data on disk) in under 13 minutes.
ad-hoc schema: You should be able to read a maximum of 3 million rows of ad-hoc data (293MB data on disk) in around 10 minutes.
PySpark (Python kernel) notebook data limits: pyspark-data-limits
XDM ExperienceEvent schema: On interactive mode you should be able to read a maximum of 5 million rows (~13.42GB data on disk) of XDM data in around 20 minutes. Interactive mode only supports up to 5 million rows. If you wish to read larger datasets, it’s suggested you switch to batch mode. On batch mode you should be able to read a maximum of 500 million rows (~1.31TB data on disk) of XDM data in around 14 hours.
ad-hoc schema: On Interactive mode you should be able to read a maximum of 5 million rows (~5.36GB data on disk) of non-XDM data in less than 3 minutes. On Batch mode you should be able to read a maximum of 1 billion rows (~1.05TB data on disk) of non-XDM data in around 18 minutes.
Spark (Scala kernel) notebook data limits: scala-data-limits
XDM ExperienceEvent schema: On interactive mode you should be able to read a maximum of 5 million rows (~13.42GB data on disk) of XDM data in around 18 minutes. Interactive mode only supports up to 5 million rows. If you wish to read larger datasets, it’s suggested you switch to batch mode. On batch mode you should be able to read a maximum of 500 million rows (~1.31TB data on disk) of XDM data in around 14 hours.
ad-hoc schema: On interactive mode you should be able to read a maximum of 5 million rows (~5.36GB data on disk) of non-XDM data in less than 3 minutes. On batch mode you should be able to read a maximum of 1 billion rows (~1.05TB data on disk) of non-XDM data in around 16 minutes.
Python notebooks python-notebook
Python notebooks allow you to paginate data when accessing datasets. Sample code to read data with and without pagination is demonstrated below. For more information on the available starter Python notebooks, visit the JupyterLab Launcher section within the JupyterLab user guide.
The Python documentation below outlines the following concepts:
Read from a dataset in Python python-read-dataset
Without pagination:
Executing the following code will read the entire dataset. If the execution is successful, then data will be saved as a Pandas dataframe referenced by the variable df
.
# Python
from platform_sdk.dataset_reader import DatasetReader
dataset_reader = DatasetReader(get_platform_sdk_client_context(), dataset_id="{DATASET_ID}")
df = dataset_reader.read()
df.head()
With pagination:
Executing the following code will read data from the specified dataset. Pagination is achieved by limiting and offsetting data through the functions limit()
and offset()
respectively. Limiting data refers to the maximum number of data points to be read, while offsetting refers to the number of data points to skip prior to reading data. If the read operation executes successfully, then data will be saved as a Pandas dataframe referenced by the variable df
.
# Python
from platform_sdk.dataset_reader import DatasetReader
dataset_reader = DatasetReader(get_platform_sdk_client_context(), dataset_id="{DATASET_ID}")
df = dataset_reader.limit(100).offset(10).read()
Write to a dataset in Python write-python
In order to write to a dataset in your JupyterLab notebook, select the Data icon tab (highlighted below) in the left-navigation of JupyterLab. The Datasets and Schemas directories appear. Select Datasets and right-click, then select the Write Data in Notebook option from the dropdown menu on the dataset you wish to use. An executable code entry appears at the bottom of your notebook.
- Use Write Data in Notebook to generate a write cell with your selected dataset.
- Use Explore Data in Notebook to generate a read cell with your selected dataset.
- Use Query Data in Notebook to generate a basic query cell with your selected dataset.
Alternatively ,you can copy and paste the following code cell. Replace both the {DATASET_ID}
and {PANDA_DATAFRAME}
.
from platform_sdk.models import Dataset
from platform_sdk.dataset_writer import DatasetWriter
dataset = Dataset(get_platform_sdk_client_context()).get_by_id(dataset_id="{DATASET_ID}")
dataset_writer = DatasetWriter(get_platform_sdk_client_context(), dataset)
write_tracker = dataset_writer.write({PANDA_DATAFRAME}, file_format='json')
Query data using Query Service in Python query-data-python
JupyterLab on Platform allows you to use SQL in a Python notebook to access data through Adobe Experience Platform Query Service. Accessing data through Query Service can be useful for dealing with large datasets due to its superior running times. Be advised that querying data using Query Service has a processing time limit of ten minutes.
Before you use Query Service in JupyterLab, ensure you have a working understanding of the Query Service SQL syntax.
Querying data using Query Service requires you to provide the name of the target dataset. You can generate the necessary code cells by finding the desired dataset using the Data explorer. Right click on the dataset listing and click Query Data in Notebook to generate two code cells in your notebook. These two cells are outlined in more detail below.
In order to utilize Query Service in JupyterLab, you must first create a connection between your working Python notebook and Query Service. This can be achieved by executing the first generated cell.
qs_connect()
In the second generated cell, the first line must be defined before the SQL query. By default, the generated cell defines an optional variable (df0
) which saves the query results as a Pandas dataframe.
The -c QS_CONNECTION
argument is mandatory and tells the kernel to execute the SQL query against Query Service. See the appendix for a list of additional arguments.
%%read_sql df0 -c QS_CONNECTION
SELECT *
FROM name_of_the_dataset
LIMIT 10
/* Querying table "name_of_the_dataset" (datasetId: {DATASET_ID})*/
Python variables can be directly referenced within a SQL query by using string-formatted syntax and wrapping the variables in curly brackets ({}
), as shown in the following example:
table_name = 'name_of_the_dataset'
table_columns = ','.join(['col_1','col_2','col_3'])
%%read_sql demo -c QS_CONNECTION
SELECT {table_columns}
FROM {table_name}
Filter ExperienceEvent data python-filter
In order to access and filter an ExperienceEvent dataset in a Python notebook, you must provide the ID of the dataset ({DATASET_ID}
) along with the filter rules that define a specific time range using logical operators. When a time range is defined, any specified pagination is ignored and the entire dataset is considered.
A list of filtering operators are described below:
eq()
: Equal togt()
: Greater thange()
: Greater than or equal tolt()
: Less thanle()
: Less than or equal toAnd()
: Logical AND operatorOr()
: Logical OR operator
The following cell filters an ExperienceEvent dataset to data existing exclusively between January 1, 2019 and the end of December 31, 2019.
# Python
from platform_sdk.dataset_reader import DatasetReader
dataset_reader = DatasetReader(get_platform_sdk_client_context(), dataset_id="{DATASET_ID}")
df = dataset_reader.\
where(dataset_reader["timestamp"].gt("2019-01-01 00:00:00").\
And(dataset_reader["timestamp"].lt("2019-12-31 23:59:59"))\
).read()
R notebooks r-notebooks
R notebooks allow you to paginate data when accessing datasets. Sample code to read data with and without pagination is demonstrated below. For more information on the available starter R notebooks, visit the JupyterLab Launcher section within the JupyterLab user guide.
The R documentation below outlines the following concepts:
Read from a dataset in R r-read-dataset
Without pagination:
Executing the following code will read the entire dataset. If the execution is successful, then data will be saved as a Pandas dataframe referenced by the variable df0
.
# R
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
datetime <- import("datetime", convert = FALSE)
py_run_file("~/.ipython/profile_default/startup/platform_sdk_context.py")
DatasetReader <- psdk$dataset_reader$DatasetReader
dataset_reader <- DatasetReader(py$get_platform_sdk_client_context(), dataset_id="{DATASET_ID}")
df0 <- dataset_reader$read()
head(df0)
With pagination:
Executing the following code will read data from the specified dataset. Pagination is achieved by limiting and offsetting data through the functions limit()
and offset()
respectively. Limiting data refers to the maximum number of data points to be read, while offsetting refers to the number of data points to skip prior to reading data. If the read operation executes successfully, then data will be saved as a Pandas dataframe referenced by the variable df0
.
# R
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
datetime <- import("datetime", convert = FALSE)
py_run_file("~/.ipython/profile_default/startup/platform_sdk_context.py")
DatasetReader <- psdk$dataset_reader$DatasetReader
dataset_reader <- DatasetReader(py$get_platform_sdk_client_context(), dataset_id="{DATASET_ID}")
df0 <- dataset_reader$limit(100L)$offset(10L)$read()
Write to a dataset in R write-r
In order to write to a dataset in your JupyterLab notebook, select the Data icon tab (highlighted below) in the left-navigation of JupyterLab. The Datasets and Schemas directories appear. Select Datasets and right-click, then select the Write Data in Notebook option from the dropdown menu on the dataset you wish to use. An executable code entry appears at the bottom of your notebook.
- Use Write Data in Notebook to generate a write cell with your selected dataset.
- Use Explore Data in Notebook to generate a read cell with your selected dataset.
Alternatively ,you can copy and paste the following code cell:
psdk <- import("platform_sdk")
dataset <- psdk$models$Dataset(py$get_platform_sdk_client_context())$get_by_id(dataset_id="{DATASET_ID}")
dataset_writer <- psdk$dataset_writer$DatasetWriter(py$get_platform_sdk_client_context(), dataset)
write_tracker <- dataset_writer$write(df, file_format='json')
Filter ExperienceEvent data r-filter
In order to access and filter an ExperienceEvent dataset in a R notebook, you must provide the ID of the dataset ({DATASET_ID}
) along with the filter rules that define a specific time range using logical operators. When a time range is defined, any specified pagination is ignored and the entire dataset is considered.
A list of filtering operators are described below:
eq()
: Equal togt()
: Greater thange()
: Greater than or equal tolt()
: Less thanle()
: Less than or equal toAnd()
: Logical AND operatorOr()
: Logical OR operator
The following cell filters an ExperienceEvent dataset to data existing exclusively between January 1, 2019 and the end of December 31, 2019.
# R
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
datetime <- import("datetime", convert = FALSE)
py_run_file("~/.ipython/profile_default/startup/platform_sdk_context.py")
client_context <- py$PLATFORM_SDK_CLIENT_CONTEXT
DatasetReader <- psdk$dataset_reader$DatasetReader
dataset_reader <- DatasetReader(py$get_platform_sdk_client_context(), dataset_id="{DATASET_ID}")
df0 <- dataset_reader$
where(dataset_reader["timestamp"]$gt("2019-01-01 00:00:00")$
And(dataset_reader["timestamp"]$lt("2019-12-31 23:59:59"))
)$read()
PySpark 3 notebooks pyspark-notebook
The PySpark documentation below outlines the following concepts:
Initializing sparkSession spark-initialize
All Spark 2.4 notebooks require that you initialize the session with the following boilerplate code.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Using %dataset to read and write with a PySpark 3 notebook magic
With the introduction of Spark 2.4, %dataset
custom magic is supplied for use in PySpark 3 (Spark 2.4) notebooks. For more details on magic commands available in the IPython kernel, visit the IPython magic documentation.
Usage
%dataset {action} --datasetId {id} --dataFrame {df} --mode batch
Description
A custom Data Science Workspace magic command for reading or writing a dataset from a PySpark notebook (Python 3 kernel).
{action}
--datasetId {id}
--dataFrame {df}
The pandas dataframe.
- When the action is “read”, {df} is the variable where results of the dataset read operation are available (such as a dataframe).
- When the action is “write”, this dataframe {df} is written to the dataset.
--mode
It is recommended you “interactive” mode for increased query performance on smaller datasets.
mode
should be set to interactive
or batch
.Examples
- Read example:
%dataset read --datasetId 5e68141134492718af974841 --dataFrame pd0 --mode batch
- Write example:
%dataset write --datasetId 5e68141134492718af974842 --dataFrame pd0 --mode batch
df.cache()
before writing data can greatly improve notebook performance. This can help if you are receiving any of the following errors:- Job aborted due to stage failure … Can only zip RDDs with same number of elements in each partition.
- Remote RPC client disassociated and other memory errors.
- Poor performance when reading and writing datasets.
You can auto generate the above examples in JupyterLab buy using the following method:
Select the Data icon tab (highlighted below) in the left-navigation of JupyterLab. The Datasets and Schemas directories appear. Select Datasets and right-click, then select the Write Data in Notebook option from the dropdown menu on the dataset you wish to use. An executable code entry appears at the bottom of your notebook.
- Use Explore Data in Notebook to generate a read cell.
- Use Write Data in Notebook to generate a write cell.
Create a local dataframe pyspark-create-dataframe
To create a local dataframe using PySpark 3 use SQL queries. For example:
date_aggregation.createOrReplaceTempView("temp_df")
df = spark.sql('''
SELECT *
FROM sparkdf
''')
local_df
df = spark.sql('''
SELECT *
FROM sparkdf
LIMIT limit
''')
sample_df = df.sample(fraction)
Filter ExperienceEvent data pyspark-filter-experienceevent
Accessing and filtering an ExperienceEvent dataset in a PySpark notebook requires you to provide the dataset identity ({DATASET_ID}
), your organization’s IMS identity, and the filter rules defining a specific time range. A filtering time range is defined by using the function spark.sql()
, where the function parameter is a SQL query string.
The following cells filter an ExperienceEvent dataset to data existing exclusively between January 1, 2019 and the end of December 31, 2019.
# PySpark 3 (Spark 2.4)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
%dataset read --datasetId {DATASET_ID} --dataFrame df --mode batch
df.createOrReplaceTempView("event")
timepd = spark.sql("""
SELECT *
FROM event
WHERE timestamp > CAST('2019-01-01 00:00:00.0' AS TIMESTAMP)
AND timestamp < CAST('2019-12-31 23:59:59.9' AS TIMESTAMP)
""")
timepd.show()
Scala notebooks scala-notebook
The documentation below contains examples for the following concepts:
Initializing SparkSession scala-initialize
All Scala notebooks require that you initialize the session with the following boilerplate code:
import org.apache.spark.sql.{ SparkSession }
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
Read a dataset read-scala-dataset
In Scala, you can import clientContext
to get and return Platform values, this eliminates the need to define variables such as var userToken
. In the Scala example below, clientContext
is used to get and return all the required values needed for reading a dataset.
df.cache()
before writing data can greatly improve notebook performance. This can help if you are receiving any of the following errors:- Job aborted due to stage failure … Can only zip RDDs with same number of elements in each partition.
- Remote RPC client disassociated and other memory errors.
- Poor performance when reading and writing datasets.
import org.apache.spark.sql.{Dataset, SparkSession}
import com.adobe.platform.token.ClientContext
val spark = SparkSession.builder().master("local").config("spark.sql.warehouse.dir", "/").getOrCreate()
val clientContext = ClientContext.getClientContext()
val df1 = spark.read.format("com.adobe.platform.query")
.option("user-token", clientContext.getUserToken())
.option("ims-org", clientContext.getOrgId())
.option("api-key", clientContext.getApiKey())
.option("service-token", clientContext.getServiceToken())
.option("sandbox-name", clientContext.getSandboxName())
.option("mode", "batch")
.option("dataset-id", "5e68141134492718af974844")
.load()
df1.printSchema()
df1.show(10)
clientContext.getUserToken()
.clientContext.getServiceToken()
.clientContext.getOrgId()
.clientContext.getApiKey()
.mode
should be set to interactive
or batch
.You can auto generate the above example in JupyterLab buy using the following method:
Select the Data icon tab (highlighted below) in the left-navigation of JupyterLab. The Datasets and Schemas directories appear. Select Datasets and right-click, then select the Explore Data in Notebook option from the dropdown menu on the dataset you wish to use. An executable code entry appears at the bottom of your notebook.
And
- Use Explore Data in Notebook to generate a read cell.
- Use Write Data in Notebook to generate a write cell.
Write to a dataset scala-write-dataset
In Scala, you can import clientContext
to get and return Platform values, this eliminates the need to define variables such as var userToken
. In the Scala example below, clientContext
is used to define and return all the required values needed for writing to a dataset.
df.cache()
before writing data can greatly improve notebook performance. This can help if you are receiving any of the following errors:- Job aborted due to stage failure … Can only zip RDDs with same number of elements in each partition.
- Remote RPC client disassociated and other memory errors.
- Poor performance when reading and writing datasets.
import org.apache.spark.sql.{Dataset, SparkSession}
import com.adobe.platform.token.ClientContext
val spark = SparkSession.builder().master("local").config("spark.sql.warehouse.dir", "/").getOrCreate()
val clientContext = ClientContext.getClientContext()
df1.write.format("com.adobe.platform.query")
.option("user-token", clientContext.getUserToken())
.option("service-token", clientContext.getServiceToken())
.option("ims-org", clientContext.getOrgId())
.option("api-key", clientContext.getApiKey())
.option("sandbox-name", clientContext.getSandboxName())
.option("mode", "batch")
.option("dataset-id", "5e68141134492718af974844")
.save()
clientContext.getUserToken()
.clientContext.getServiceToken()
.clientContext.getOrgId()
.clientContext.getApiKey()
.mode
should be set to interactive
or batch
.create a local dataframe scala-create-dataframe
To create a local dataframe using Scala, SQL queries are required. For example:
sparkdf.createOrReplaceTempView("sparkdf")
val localdf = spark.sql("SELECT * FROM sparkdf LIMIT 1)
Filter ExperienceEvent data scala-experienceevent
Accessing and filtering an ExperienceEvent dataset in a Scala notebook requires you to provide the dataset identity ({DATASET_ID}
), your organization’s IMS identity, and the filter rules defining a specific time range. A Filtering time range is defined by using the function spark.sql()
, where the function parameter is a SQL query string.
The following cells filter an ExperienceEvent dataset to data existing exclusively between January 1, 2019 and the end of December 31, 2019.
// Spark (Spark 2.4)
// Turn off extra logging
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("com").setLevel(Level.OFF)
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = org.apache.spark.sql.SparkSession.builder().appName("Notebook")
.master("local")
.getOrCreate()
// Stage Exploratory
val dataSetId: String = "{DATASET_ID}"
val orgId: String = sys.env("IMS_ORG_ID")
val clientId: String = sys.env("PYDASDK_IMS_CLIENT_ID")
val userToken: String = sys.env("PYDASDK_IMS_USER_TOKEN")
val serviceToken: String = sys.env("PYDASDK_IMS_SERVICE_TOKEN")
val mode: String = "batch"
var df = spark.read.format("com.adobe.platform.query")
.option("user-token", userToken)
.option("ims-org", orgId)
.option("api-key", clientId)
.option("mode", mode)
.option("dataset-id", dataSetId)
.option("service-token", serviceToken)
.load()
df.createOrReplaceTempView("event")
val timedf = spark.sql("""
SELECT *
FROM event
WHERE timestamp > CAST('2019-01-01 00:00:00.0' AS TIMESTAMP)
AND timestamp < CAST('2019-12-31 23:59:59.9' AS TIMESTAMP)
""")
timedf.show()
Next steps
This document covered the general guidelines for accessing datasets using JupyterLab notebooks. For more in depth examples on querying datasets, visit the Query Service in JupyterLab notebooks documentation. For more information on how to explore and visualize your datasets, visit the document on analyzing your data using notebooks.
Optional SQL flags for Query Service optional-sql-flags-for-query-service
This table outlines the optional SQL flags that can be used for Query Service.
-h
, --help
-n
, --notify
-a
, --async
-d
, --display