How to use Deployment - 3C-SCSU/Avatar GitHub Wiki

Using the deployment files will require having your server configured and either Podman or Kubernetes installed. For help with server configuration, Podman, or Kubernetes see other sections of the Wiki.

If you just want to use the files, follow the below section to run the files using Podman. An example section at the bottom walks through running and accessing the K8s.yaml.

Steps:

  1. Copy the deployment directory onto your server and then navigate inside it using cd <path to where deployment was copied>
  2. First run the pvc.yaml with the command
       podman play kube pvc.yaml

  1. Next deploy your pod with
       podman play kube k8s.yaml

  1. Find the Jupyter token using
      podman logs --tail 3 <container id for the Jupyter container> 

  1. Access the VPS hosted Jupyter session by entering
     <Your VPS IP>:10000/lab<token from step 4>

What the Deployment files do

These files will allow you to launch a Jupyter notebook instance with Pyspark and Tensorflow from your VPS and accessible is through a web browser. Multiple users can connect to the same instance, and notebooks created can be saved to persistent storage on the VPS which lasts between Pods. There is a read only data file for accessing data from the VPS within the notebooks. The Jupyter container is preconfigured with a connector for accessing a Google Cloud Storage data bucket.

How Deployment works

Yaml files

pvc.yaml

pvc.yaml creates a 1gi persistent volume named notebook. This is used to store Jupyter notebook details between sessions, and if the pod crashes should allow the notebooks to be reloaded into a new instance of the pod. If you need to back up these notebooks, the default location is: /var/lib/containers/storage/volumes/notebook-pvc/_data

k8s.yaml

The k8s.yaml file loads two volume mounts and launches three containers.

  • Volume Mount 1: ./data The first volume mount requires a data directory relatively located in the same directory as k8s.yaml If the deployment directory was copied from the repository the placeholder_data file can be replaced with any data desired in the Jupyter instance. This file is read only and cannot be altered from within the notebook.

  • Volume Mount 2: notebook This is the volume created using pvc.yaml.

Containers

Container 1

This is a custom container which is a fork of the Jupyter Project notebooks. See the Dockerfile section for more details on customizing it.

This container provides a Jupyter Notebook instance, Tensorflow, Pyspark, and a Google GCP bucket connector. The container is configured to be accessed on port 10000.

Container 2

Container 2 runs Nginx on port 9000

Container 3

Container 3 contains Rust.

The Dockerfile

Documentation for Jupyter Project can be found here.

The container is stored on dockerhub. The Dockerfile is available in the DevOps directory. The Dockerfile can be customized and the container replaced with a custom container by changing the value on this line of the k8s.yaml file to an alternative container: image: docker.io/jdknuds/jupyter_pyten:latest.

If you wish to replace the GCP bucket with a different bucket, update the connector argument: ARG connector="gcs-connector-latest-hadoop3.jar" and replace the value with an alternative connector. Then, update the wget commands url target: RUN wget -P "${SPARK_HOME}/jars/" "https://storage.googleapis.com/hadoop-lib/gcs/${connector}"

Delta Lake

Delta Lake can be added to the notebook and tested by entering these commands:

  1. Install Delta Lake

    !pip install delta-spark==2.3.0
    
  2. Import delta and configure Delta

       import pyspark
       from delta import *
       
       builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
             .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
             .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
       spark = configure_spark_with_delta_pip(builder).getOrCreate()```
    
  3. Write to a delta table

      data = spark.range(0,5)
      data.write.format("delta".save("/tmp/dtela-table")
    
  4. Read from a delta table

    df = spark.read.format("delta").load("/tmp/delta-table")
    df.show()  
    

Examples

Run the K8s.yaml

image

Obtain token

image

Click Link to access browser:

image

Update IP to VPS IP and port from 8888 to 10000

(Note: The following screenshots were taken from localhost, make sure to update the IP address for use on a VPS)
image

Notebook Session:

image