Loading data into your project

Data Science & AI Workbench uses projects to encapsulate all of the components necessary to use or run an application: the relevant packages, channels, scripts, notebooks and other related files, environment variables, services and commands, along with a configuration file named anaconda-project.yml. You can also access and load data in a variety of formats, stored in common sources including the following:

File systems
NFS shared drives
Databases
Hadoop and Spark clusters
Distributed version control repositories such as Git and Bitbucket (if configured by your Administrator).

The amount of data you read into your project will impact the resources required to successfully run the project, whether in a notebook session or deployment. See the following section on understanding resource profiles to learn more.

Understanding resource profiles

Resource profiles are used to limit the amount of CPU cores and RAM available for use when running a project session or deployment.

Choosing a resource profile with a greater number of available cores is not guaranteed to improve performance—it will also depend on whether the libraries used by the project can take advantage of multiple cores, for example.

Memory limits are enforced by the Linux kernel, so when the memory limit is exceeded the most recent process will crash. Be sure to select a resource profile that offers sufficient runtime resources required by your project to avoid such errors. A best practice recommendation is to choose a resource profile with roughly double the amount of memory required by the size of data you need to read. To see the total memory in use, open a terminal and run the following command:

cat /sys/fs/cgroup/memory/memory.usage_in_bytes | awk '{print $1/1024/1024}'

Uploading files to a project

Open an editing session for the project, then choose the file you want to upload. The process of uploading files varies slightly, based on the editor used:

In Jupyter Notebook, click Upload and select the file to upload. Then click the blue Upload button displayed in the file’s row to add the file to the project
In JupyterLab, click the Upload files icon and select the file. In the top right corner, click Commit Changes to add the file to your project.
In Zeppelin, use the Import note feature to select a JSON file or add data from a URL.

Once a file is in the project, you can use code to read it. For example, to load the iris dataset from a comma separated value (CSV) file into a pandas DataFrame:

import pandas as pd
irisdf = pd.read_csv('iris.csv')

Accessing NFS shared drives

After your Administrator has configured Workbench to mount an NFS share, you’ll be able to access it from within your notebooks. You’ll just need to know the name of the volume, so you can access it. For example, if they named the configuration file section myvolume, the share will be mounted at /data/myvolume. From a notebook you can use code such as this to read data from the share:

import pandas as pd
irisdf = pd.read_csv('/data/myvolume/iris.csv')

Accessing data stored in databases

You can also connect to the following database engines to access data stored within them:

Cassandra

Cockroach

Cosmos

Couchbase

Db2

Elasticsearch

MariaDB

MLDB

MongoDB

MS SQL

MySQL

Neo4j

Oracle

PostgreSQL

Redis

S3

Snowflake

Vertica

See Secrets for information about adding credentials to the platform, to make them available in your projects. Any secrets you add will be available across all sessions and deployments associated with your user account.

Hadoop Distributed File System

Loading data from HDFS, Spark, Hive, and Impala is discussed in Hadoop / Spark.

Data Science & AI Workbench

​Understanding resource profiles

​Uploading files to a project

​Accessing NFS shared drives

​Accessing data stored in databases

Cassandra

Cockroach

Cosmos

Couchbase

Db2

Elasticsearch

MariaDB

MLDB

MongoDB

MS SQL

MySQL

Neo4j

Oracle

PostgreSQL

Redis

S3

Snowflake

Vertica

​Hadoop Distributed File System

Understanding resource profiles

Uploading files to a project

Accessing NFS shared drives

Accessing data stored in databases

Hadoop Distributed File System