How to mount a PersistentVolume for Static Provisioning using MapR CSI in GKE

December 4, 2019, 10:31 pm

≫ Next: Spark Streaming sample scala code for different sources

≪ Previous: How to submit REST requests to a distributed Kafka Connect cluster

Goal:

This article explains the detailed steps on how to mount a PersistentVolume for Static Provisioning using MapR Container Storage Interface(CSI) in Google Kubernetes Engine(GKE).

Env:

MapR 6.1 (secured)
MapR CSI 1.0.0
Kubernetes Cluster in GKE

Use Case:

We have a secured MapR Cluster (v6.1) and want to expose the storage to applications running in a Kubernetes cluster(GKE in this example).
In this example, we plan to expose a MapR volume named "mapr.apps" (mounted as /apps) to a sample POD in Kubernetes Cluster.
Inside the POD, it will be mounted as /mapr instead.

Solution:

1. Create a Kubernetes cluster named "standard-cluster-1" in GKE

You can use GUI or gcloud commands.

2. Fetch the credentials for the Kubernetes cluster

gcloud container clusters get-credentials standard-cluster-1 --zone us-central1-a

After that, make sure "kubectl cluster-info" returns correct cluster information.
This step is to make kubectl work and connect to the correct Kubernetes cluster.

3. Bind cluster-admin role to Google Cloud user

kubectl create clusterrolebinding user-cluster-admin-binding --clusterrole=cluster-admin --user=xxx@yyy.com

Note: "xxx@yyy.com" is the your Google Cloud user.
Here we grant cluster admin role to the user to avoid any permission error in the next step when we create MapR CSI ClusterRole and ClusterRoleBinding.

4. Download MapR CSI Driver custom resource definition

Please refer to the latest documentation: https://mapr.com/docs/home/CSIdriver/csi_downloads.html

git clone https://github.com/mapr/mapr-csi
cd ./mapr-csi/deploy/kubernetes/
kubectl create -f csi-maprkdf-v1.0.0.yaml

Below Kubernetes objects are created:

namespace/mapr-csi created
serviceaccount/csi-nodeplugin-sa created
clusterrole.rbac.authorization.k8s.io/csi-nodeplugin-cr created
clusterrolebinding.rbac.authorization.k8s.io/csi-nodeplugin-crb created
serviceaccount/csi-controller-sa created
clusterrole.rbac.authorization.k8s.io/csi-attacher-cr created
clusterrolebinding.rbac.authorization.k8s.io/csi-attacher-crb created
clusterrole.rbac.authorization.k8s.io/csi-controller-cr created
clusterrolebinding.rbac.authorization.k8s.io/csi-controller-crb created
daemonset.apps/csi-nodeplugin-kdf created
statefulset.apps/csi-controller-kdf created

5. Verify the PODs/DaemonSet/StatefulSet are running under namespace "mapr-csi"

PODs:

$ kubectl get pods -n mapr-csi -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP              NODE                                                NOMINATED NODE   READINESS GATES
csi-controller-kdf-0       5/5     Running   0          5m58s   xx.xx.xx.1      gke-standard-cluster-1-default-pool-aaaaaaaa-1111   <none>           <none>
csi-nodeplugin-kdf-9gmqc   3/3     Running   0          5m58s   xx.xx.xx.2      gke-standard-cluster-1-default-pool-aaaaaaaa-2222   <none>           <none>
csi-nodeplugin-kdf-qhhbh   3/3     Running   0          5m58s   xx.xx.xx.3      gke-standard-cluster-1-default-pool-aaaaaaaa-3333   <none>           <none>
csi-nodeplugin-kdf-vrq4g   3/3     Running   0          5m58s   xx.xx.xx.4      gke-standard-cluster-1-default-pool-aaaaaaaa-4444   <none>           <none>

DaemonSet:

$ kubectl get DaemonSet -n mapr-csi
NAME                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
csi-nodeplugin-kdf   3         3         3       3            3           <none>          8m58s

StatefulSet:

$ kubectl get StatefulSet -n mapr-csi
NAME                 READY   AGE
csi-controller-kdf   1/1     9m42s

6. Create a test namespace named "testns" for future test PODs

kubectl create namespace testns

7. Create a Secret for MapR ticket

7.a Logon MapR Cluster, and locate the ticket file using "maprlogin print" or generate a new ticket file using "maprlogin password".
For example, here we are using "mapr" user's ticket file located at /tmp/maprticket_5000.
7.b Convert the ticket into base64 representation and save the output.

cat /tmp/maprticket_5000 | base64

7.c Create a YAML file named "mapr-ticket-secret.yaml" for the Secret named "mapr-ticket-secret" in namespace "testns".

apiVersion: v1
kind: Secret
metadata:
  name: mapr-ticket-secret
  namespace: testns
type: Opaque
data:
  CONTAINER_TICKET: CHANGETHIS!

Note: "CHANGETHIS!" should be replaced by the output we saved in step 7.b. Make sure it is in a single line.
7.d Create this Secret.

kubectl create -f mapr-ticket-secret.yaml

8. Change the GKE default Storage Class

This is because GKE default Storage Class is named "standard".
If we do not change it, in the next steps, it will automatically create a new PV bound to our PVC.
8.a Confirm the default Storage Class is named "standard" in GKE.

$ kubectl get storageclass -o yaml
apiVersion: v1
items:
- allowVolumeExpansion: true
  apiVersion: storage.k8s.io/v1
  kind: StorageClass
  metadata:
    annotations:
      storageclass.kubernetes.io/is-default-class: "true"
    creationTimestamp: "2019-12-04T19:38:38Z"
    labels:
      addonmanager.kubernetes.io/mode: EnsureExists
      kubernetes.io/cluster-service: "true"
    name: standard
    resourceVersion: "285"
    selfLink: /apis/storage.k8s.io/v1/storageclasses/standard
    uid: ab77d472-16cd-11ea-abaf-42010a8000ad
  parameters:
    type: pd-standard
  provisioner: kubernetes.io/gce-pd
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

8.b Create a YAML file named "my_storage_class.yaml" for Storage Class named "mysc".

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mysc
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

8.c Create the Storage Class.

kubectl create -f my_storage_class.yaml

8.d Verify both Storage Classes.

$ kubectl get storageclass
NAME                 PROVISIONER                    AGE
mysc                 kubernetes.io/no-provisioner   8s
standard (default)   kubernetes.io/gce-pd           8h

8.e Change default Storage Class to "mysc".

kubectl patch storageclass mysc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'

8.f Verify both Storage Classes again.

$ kubectl get storageclass
NAME             PROVISIONER                    AGE
mysc (default)   kubernetes.io/no-provisioner   2m3s
standard         kubernetes.io/gce-pd           8h

9. Create a YAML file named "test-simplepv.yaml" for PersistentVolume (PV) named "test-simplepv"

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-simplepv
  namespace: testns
  labels:
    name: pv-simplepv-test
spec:
  storageClassName: mysc
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  capacity:
    storage: 1Gi
  csi:
    nodePublishSecretRef:
      name: "mapr-ticket-secret"
      namespace: "testns"
    driver: com.mapr.csi-kdf
    volumeHandle: mapr.apps
    volumeAttributes:
      volumePath: "/apps"
      cluster: "mycluster.cluster.com"
      cldbHosts: "mycldb.node.internal"
      securityType: "secure"
      platinum: "false"

Make sure the CLDB host can be accessed by the Kubernetes Cluster nodes.
And also the PV is using our own Storage Class "mysc".
Create the PV:

kubectl create -f test-simplepv.yaml

10. Create a YAML file named "test-simplepvc.yaml" for PersistentVolumeClaim (PVC) named "test-simplepvc"

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-simplepvc
  namespace: testns
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1G

Create the PVC:

kubectl create -f test-simplepvc.yaml

Right now, the PVC should be in "Pending" status which is fine.

$ kubectl get pv -n testns
NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
test-simplepv   1Gi        RWO            Delete           Available           mysc                    11s

$ kubectl get pvc -n testns
NAME             STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-simplepvc   Pending                                      mysc           11s

11. Create a YAML file named "testpod.yaml" for a POD named "testpod"

apiVersion: v1
kind: Pod
metadata:
  name: testpod
  namespace: testns
spec:
  securityContext:
    runAsUser: 5000
    fsGroup: 5000
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    resources:
      requests:
        memory: "2Gi"
        cpu: "500m"
    volumeMounts:
    - mountPath: /mapr
      name: maprcsi
  volumes:
    - name: maprcsi
      persistentVolumeClaim:
        claimName: test-simplepvc

Create the POD:

kubectl create -f testpod.yaml

After that, both PV and PVC should be "Bound":

$ kubectl get pvc -n testns
NAME             STATUS   VOLUME          CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-simplepvc   Bound    test-simplepv   1Gi        RWO            mysc           82s

$ kubectl get pv -n testns
NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                   STORAGECLASS   REASON   AGE
test-simplepv   1Gi        RWO            Delete           Bound    testns/test-simplepvc   mysc                    89s

12. Logon the POD to verify

kubectl exec -ti testpod -n testns -- bin/sh

Then try to read and write:

/ $ mount -v |grep mapr
posix-client-basic on /mapr type fuse.posix-client-basic (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
/ $ ls -altr /mapr
total 6
drwxrwxrwt    3 5000     5000             1 Nov 26 16:49 kafka-streams
drwxrwxrwt    3 5000     5000             1 Nov 26 16:49 ksql
drwxrwxrwx    3 5000     5000            15 Dec  4 17:10 spark
drwxr-xr-x    5 5000     5000             3 Dec  5 04:27 .
drwxr-xr-x    1 root     root          4096 Dec  5 04:40 ..
/ $ touch /mapr/testfile
/ $ rm /mapr/testfile

13. Clean up

kubectl delete -f testpod.yaml
kubectl delete -f test-simplepvc.yaml
kubectl delete -f test-simplepv.yaml
kubectl delete -f my_storage_class.yaml
kubectl delete -f mapr-ticket-secret.yaml
kubectl delete -f csi-maprkdf-v1.0.0.yaml

Common issues:

1. In step 4 when creating MapR CSI ClusterRoleBinding, it fails with below error message:

user xxx@yyy.com (groups=["system:authenticated"]) is attempting to grant rbac permissions not currently held

This is because Google Cloud user "xxx@yyy.com" does not have the permissions.
One solution is to do step 3 which is to grant cluster admin role to this user.

2. After PV and PVC are created, PVC is bound to a new PV named "pvc-...." instead of our PV named "test-simplepv".
For example:

$  kubectl get pvc -n testns
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-simplepvc   Bound    pvc-e9a0f512-16f6-11ea-abaf-42010a8000ad   1Gi        RWO            standard       16m

$  kubectl get pv -n testns
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                     STORAGECLASS   REASON   AGE
pvc-e9a0f512-16f6-11ea-abaf-42010a8000ad   1Gi        RWO            Delete           Bound       mapr-csi/test-simplepvc   standard                17m
test-simplepv                              1Gi        RWO            Delete           Available

This is because GKE has a default Storage Class "standard" which can create a new PV bound to our PVC.
For example, we can confirm this using below command:

$  kubectl get pvc test-simplepvc -o=yaml -n testns
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
  creationTimestamp: "2019-12-05T00:33:52Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: test-simplepvc
  namespace: testns
  resourceVersion: "61729"
  selfLink: /api/v1/namespaces/testns/persistentvolumeclaims/test-simplepvc
  uid: e9a0f512-16f6-11ea-abaf-42010a8000ad
spec:
  accessModes:
  - ReadWriteOnce
  dataSource: null
  resources:
    requests:
      storage: 1G
  storageClassName: standard
  volumeMode: Filesystem
  volumeName: pvc-e9a0f512-16f6-11ea-abaf-42010a8000ad
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  phase: Bound

One solution is to do step 8 which is to change the GKE default Storage Class.

Troubleshooting:

DaemonSet "csi-nodeplugin-kdf" has 3 kinds of containers:
[csi-node-driver-registrar liveness-probe mapr-kdfplugin]
StatefulSet "csi-controller-kdf" has 5 kinds of containers:
[csi-attacher csi-provisioner csi-snapshotter liveness-probe mapr-kdfprovisioner]

So we can view all of the container logs to see if there is any error.
For example:

kubectl logs csi-nodeplugin-kdf-vrq4g -c csi-node-driver-registrar -n mapr-csi
kubectl logs csi-nodeplugin-kdf-vrq4g -c liveness-probe -n mapr-csi
kubectl logs csi-nodeplugin-kdf-vrq4g -c mapr-kdfplugin -n mapr-csi

kubectl logs csi-controller-kdf-0 -c csi-provisioner -n mapr-csi
kubectl logs csi-controller-kdf-0 -c csi-attacher -n mapr-csi
kubectl logs csi-controller-kdf-0 -c csi-snapshotter -n mapr-csi
kubectl logs csi-controller-kdf-0 -c mapr-kdfprovisioner -n mapr-csi
kubectl logs csi-controller-kdf-0 -c liveness-probe -n mapr-csi

Reference:

https://mapr.com/docs/home/CSIdriver/csi_overview.html
https://mapr.com/docs/home/CSIdriver/csi_installation.html
https://mapr.com/docs/home/CSIdriver/csi_example_static_provisioning.html

↧

Spark Streaming sample scala code for different sources

December 6, 2019, 3:45 pm

≫ Next: How to create a MapR PACC using mapr-setup.sh to submit a Spark sample job

≪ Previous: How to mount a PersistentVolume for Static Provisioning using MapR CSI in GKE

Goal:

This article shares some sample Spark Streaming scala code for different sources -- socket text, text files in MapR-FS directory, kafka broker and MapR Event Store for Apache Kafka(MapR Streams).
These are wordcount code which can be run directly from spark-shell.

Env:

MapR 6.1
mapr-spark-2.3.2.0
mapr-kafka-1.1.1
mapr-kafka-ksql-4.1.1

Solution:

1. socket text

Data source:
Open a socket on port 9999 and type some words as the data source.

nc -lk 9999

Sample Code:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

2. text files in MapR-FS directory

Data source:
Create a directory on MapR-FS and put text files inside as the data source.

hadoop fs -mkdir /tmp/textfile
hadoop fs -put /opt/mapr/NOTICE.txt /tmp/textfile/

Sample Code:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("/tmp/textfile")
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

3. kafka broker

Data source:
Assuming an existing kafka server is started:

./bin/kafka-server-start.sh ./config/server.properties

Create a new topic named "mytopic":

./bin/kafka-topics.sh --create --zookeeper localhost:5181 --replication-factor 1 --partitions 1 --topic mytopic

Start a kafka console producer and type some words as data source:

./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic mytopic

OR use below producer:

./kafka-producer-perf-test.sh  --topic mytopic --num-records 1000000 --record-size 1000 \
--throughput 10000 --producer-props bootstrap.servers=localhost:9092

Sample Code:

import org.apache.kafka.clients.consumer
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{
  ConsumerStrategies,
  KafkaUtils,
  LocationStrategies
}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(10))

val kafkaParams = Map[String, Object](
  ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
  ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.GROUP_ID_CONFIG -> "mysparkgroup",
  ConsumerConfig.AUTO_OFFSET_RESET_CONFIG  -> "latest",
  ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false: java.lang.Boolean)
)

val topicsSet = Array("mytopic")
val consumerStrategy = ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
val messages = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      consumerStrategy)

val lines = messages.map(_.value())
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

4. MapR Event Store for Apache Kafka(MapR Streams)

Data source:
Create a sample MapR Streams named /sample-stream

maprcli stream create -path /sample-stream-produceperm p -consumeperm p -topicperm p

Use one of the ksql tool mentioned in this blog to generate the data:

/opt/mapr/ksql/ksql-4.1.1/bin/ksql-datagen quickstart=pageviews format=delimited topic=/sample-stream:pageviews maxInterval=10000

OR use below producer:

./kafka-producer-perf-test.sh  --topic /sample-stream:pageviews --num-records 1000000 --record-size 10000 \
--throughput 10000 --producer-props bootstrap.servers=localhost:9092

Sample code:

import org.apache.kafka.clients.consumer
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{
  ConsumerStrategies,
  KafkaUtils,
  LocationStrategies
}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(10))

val kafkaParams = Map[String, Object](
  ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.GROUP_ID_CONFIG -> "mysparkgroup",
  ConsumerConfig.AUTO_OFFSET_RESET_CONFIG  -> "latest",
  ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false: java.lang.Boolean)
)

val topicsSet = Array("/sample-stream:pageviews")
val consumerStrategy = ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
val messages = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      consumerStrategy)

val lines = messages.map(_.value())
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

↧

How to create a MapR PACC using mapr-setup.sh to submit a Spark sample job

December 12, 2019, 11:57 am

≫ Next: How to use nodeSelector to constrain POD csi-controller-kdf-0 to only be able to run on particular Node(s)

≪ Previous: Spark Streaming sample scala code for different sources

Goal:

This article shares the detailed steps on how to create a MapR Persistent Application Client Container(PACC) using mapr-setup.sh to submit a Spark sample job towards a secured MapR cluster.

Env:

MapR 6.1 (secured) with FUSE based posix client running.
Docker is installed and running on Mac

Solution:

1. Generate a service ticket for user "mapr" on the secured MapR cluster

maprlogin generateticket -type service -cluster my61.cluster.com -duration 30:0:0 -out /tmp/mapr_ticket -user mapr

2. Copy the service ticket to Mac where docker is running

Say this location on my mac is /Users/hzu/pacc/mapr_ticket

3. Download mapr-setup.sh on Mac

curl -O https://package.mapr.com/releases/installer/mapr-setup.sh
chmod +x ./mapr-setup.sh

4. Create MapR PACC Image Using mapr-setup.sh

./mapr-setup.sh docker client

Follow doc: https://mapr.com/docs/61/AdvancedInstallation/CreatingPACCImage.html
Note: We add spark client at least.

5. Edit ./docker_images/client/mapr-docker-client.sh

MAPR_CLUSTER=my61.cluster.com
MAPR_CLDB_HOSTS=v1.poc.com,v2.poc.com,v3.poc.com
MAPR_MOUNT_PATH=/maprfuse
MAPR_TICKET_FILE=/Users/hzu/pacc/mapr_ticket
MAPR_TICKETFILE_LOCATION="/tmp/$(basename $MAPR_TICKET_FILE)"
MAPR_CONTAINER_USER=mapr
MAPR_CONTAINER_UID=5000
MAPR_CONTAINER_GROUP=mapr
MAPR_CONTAINER_GID=5000
MAPR_MEMORY=0
MAPR_DOCKER_NETWORK=bridge

6. Run mapr-docker-client.sh to start the container

./docker_images/client/mapr-docker-client.sh

7. Verify the container have access to posix mount point

[mapr@8955101793bf ~]$ ls -altr  /maprfuse
total 1
drwxr-xr-x 11 mapr mapr 12 Dec  6 12:46 my61.cluster.com

[mapr@8955101793bf ~]$ rpm -qa|grep -i mapr-
mapr-client-6.1.0.20180926230239.GA-1.x86_64
mapr-posix-client-container-6.1.0.20180926230239.GA-1.x86_64
mapr-hive-2.3.201809220807-1.noarch
mapr-librdkafka-0.11.3.201803231414-1.noarch
mapr-spark-2.3.1.201809221841-1.noarch
mapr-kafka-1.1.1.201809281337-1.noarch
mapr-pig-0.16.201707251429-1.noarch

8. Verify that submitting spark job works in the container

/opt/mapr/spark/spark-2.3.1/bin/run-example --master yarn --deploy-mode cluster SparkPi 10

Common Issues:

If mapr-setup.sh fails with below error on Mac, please add Mac's IP address and hostname into /etc/hosts in advance.

ERROR: Hostname (mymacbook.local) cannot be resolved. Correct the problem and retry mapr-setup.sh

References:

https://mapr.com/products/persistent-application-client-container/
https://mapr.com/blog/persistent-storage-docker-containers-whiteboard-walkthrough/
https://mapr.com/docs/61/AdvancedInstallation/UsingtheMapRPACC.html
https://mapr.com/docs/61/AdvancedInstallation/CreatingPACCImage.html
https://mapr.com/docs/61/AdvancedInstallation/CustomizingaMapRPACC.html
https://mapr.com/blog/getting-started-mapr-client-container/
https://hub.docker.com/r/maprtech/pacc/tags/

↧

How to use nodeSelector to constrain POD csi-controller-kdf-0 to only be able to run on particular Node(s)

December 12, 2019, 3:09 pm

≫ Next: Hands-on MKE(MapR Kubernetes Ecosystem ) 1.0 release

≪ Previous: How to create a MapR PACC using mapr-setup.sh to submit a Spark sample job

Goal:

This article explains how to use nodeSelector to constrain POD csi-controller-kdf-0 to only be able to run on particular Node(s).

Env:

MapR 6.1 (secured)
MapR CSI 1.0.0
Kubernetes Cluster in GKE

Use case:

For MapR CSI, we want the POD from StatefulSet "csi-controller-kdf" to only run on specific node(s).

Solution:

1. List current nodes from Kubernetes cluster

$ kubectl get nodes
NAME                                                STATUS   ROLES    AGE   VERSION
gke-standard-cluster-1-default-pool-f6e6e4c1-45ql   Ready    <none>   22m   v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-fbhp   Ready    <none>   22m   v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5   Ready    <none>   22m   v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-r20n   Ready    <none>   22m   v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-xr3s   Ready    <none>   22m   v1.13.11-gke.14

For example, we want the POD from StatefulSet "csi-controller-kdf" to only run on node "gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5".

2. Attach a label to this node

kubectl label nodes gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 for-csi-controller=true

Here the label key is "for-csi-controller" and the label value is "true".
Verify that the label is attached on that node:

$ kubectl get nodes -l for-csi-controller=true
NAME                                                STATUS   ROLES    AGE   VERSION
gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5   Ready    <none>   34m   v1.13.11-gke.14

3. Modify csi-maprkdf-v1.0.0.yaml

cp csi-maprkdf-v1.0.0.yaml csi-maprkdf-v1.0.0_modified.yaml
vi csi-maprkdf-v1.0.0_modified.yaml

Add below to the bottom of the definiton for StatefulSet "csi-controller-kdf"

      nodeSelector:
        for-csi-controller: "true"

One full example for StatefulSet "csi-controller-kdf" is:

kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
  name: csi-controller-kdf
  namespace: mapr-csi
spec:
  serviceName: "kdf-provisioner-svc"
  replicas: 1
  template:
    metadata:
      labels:
        app: csi-controller-kdf
    spec:
      serviceAccount: csi-controller-sa
      containers:
        - name: csi-attacher
          image: quay.io/k8scsi/csi-attacher:v1.0.1
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
          env:
            - name: ADDRESS
              value: /var/lib/csi/sockets/pluginproxy/csi.sock
          imagePullPolicy: "Always"
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/csi/sockets/pluginproxy/
        - name: csi-provisioner
          image: quay.io/k8scsi/csi-provisioner:v1.0.1
          args:
            - "--provisioner=com.mapr.csi-kdf"
            - "--csi-address=$(ADDRESS)"
            - "--volume-name-prefix=mapr-pv"
            - "--v=5"
          env:
            - name: ADDRESS
              value: /var/lib/csi/sockets/pluginproxy/csi.sock
          imagePullPolicy: "Always"
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/csi/sockets/pluginproxy/
        - name: csi-snapshotter
          image: quay.io/k8scsi/csi-snapshotter:v1.0.1
          imagePullPolicy: "Always"
          args:
            - "--snapshotter=com.mapr.csi-kdf"
            - "--csi-address=$(ADDRESS)"
            - "--snapshot-name-prefix=mapr-snapshot"
            - "--v=5"
          env:
            - name: ADDRESS
              value: /var/lib/csi/sockets/pluginproxy/csi.sock
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/csi/sockets/pluginproxy/
        - name: liveness-probe
          image: quay.io/k8scsi/livenessprobe:v1.0.1
          imagePullPolicy: "Always"
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
            - "--connection-timeout=60s"
            - "--health-port=9809"
          env:
            - name: ADDRESS
              value: /var/lib/csi/sockets/pluginproxy/csi.sock
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/csi/sockets/pluginproxy/
        - name: mapr-kdfprovisioner
          image: maprtech/csi-kdfprovisioner:1.0.0
          imagePullPolicy: "Always"
          args :
            - "--nodeid=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
            - "-v=5"
          env:
            - name: NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CSI_ENDPOINT
              value: unix://plugin/csi.sock
          ports:
          - containerPort: 9809
            name: healthz
            protocol: TCP
          livenessProbe:
            failureThreshold: 20
            httpGet:
              path: /healthz
              port: healthz
            initialDelaySeconds: 10
            timeoutSeconds: 3
            periodSeconds: 5
          volumeMounts:
            - name: socket-dir
              mountPath: /plugin
            - name: k8s-log-dir
              mountPath: /var/log/csi-maprkdf
            - name: timezone
              mountPath: /etc/localtime
              readOnly: true
      volumes:
        - name: socket-dir
          emptyDir: {}
        - name: k8s-log-dir
          hostPath:
            path: /var/log/csi-maprkdf
            type: DirectoryOrCreate
        - name: timezone
          hostPath:
            path: /etc/localtime
      nodeSelector:
        for-csi-controller: "true"

4. Create StatefulSet "csi-controller-kdf" using the modified version when configuring MapR CSI

kubectl apply -f csi-maprkdf-v1.0.0_modified.yaml

Other steps to configure MapR CSI are the same as this blog.

5. Verify that POD "csi-controller-kdf-0" is running on that specific node

$ kubectl get pods -n mapr-csi -o wide  |grep csi-controller-kdf-0
csi-controller-kdf-0       5/5     Running   0          56m   xx.xx.xx.4     gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5   <none>           <none>

Disaster Recovery Test:

1. Drain this specific node and evict all the PODs except those for DaemonSets.

$ kubectl drain gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 --ignore-daemonsets --delete-local-data
node/gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.2.0-hzrq7, kube-system/prometheus-to-sd-jxhrm, mapr-csi/csi-nodeplugin-kdf-ssbxp
evicting pod "csi-controller-kdf-0"
evicting pod "kube-dns-79868f54c5-rggws"
pod/csi-controller-kdf-0 evicted
pod/kube-dns-79868f54c5-rggws evicted
node/gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 evicted

2. Check if the POD "csi-controller-kdf-0" will be rescheduled on other nodes or not.

$ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                                           READY   STATUS    RESTARTS   AGE    IP            NODE                                                NOMINATED NODE   READINESS GATES
...
mapr-csi      csi-controller-kdf-0                                           0/5     Pending   0          16m    <none>        <none>                                              <none>           <none>
...

As we can see, the POD "csi-controller-kdf-0" will be pending and can not be rescheduled on other nodes.
This proves that the label is working.

3. Mark the specific node available again

kubectl uncordon gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5

4. Verify that POD "csi-controller-kdf-0" is running on the specific node again

$ kubectl get pods --all-namespaces -o wide |grep -i csi-controller-kdf-0
mapr-csi      csi-controller-kdf-0                                           5/5     Running   0          17m    xx.xx.xx.5     gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5   <none>           <none>

5. Verify the mount point is working in the test POD

$ kubectl exec -ti testpod -n testns -- ls -altr /mapr
total 6
drwxrwxrwt    3 5000     5000             1 Nov 25 11:17 kafka-streams
drwxrwxrwt    3 5000     5000             1 Nov 25 11:18 ksql
drwxrwxrwx    3 5000     5000             2 Dec  6 12:38 spark
drwxr-xr-x    1 root     root          4096 Dec 12 22:11 ..
drwxr-xr-x    5 5000     5000             3 Dec 12 23:45 .

References:

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

↧

Hands-on MKE(MapR Kubernetes Ecosystem ) 1.0 release

January 31, 2020, 3:58 pm

≫ Next: How to check if Spark job runs out of quota in CSpace

≪ Previous: How to use nodeSelector to constrain POD csi-controller-kdf-0 to only be able to run on particular Node(s)

Goal:

MKE(MapR Kubernetes Ecosystem ) 1.0 has been released.
Basically it puts Spark and Drill into Kubernetes environment in this release.
Below is the architecture from the documentation Operators and Compute Spaces

This article shares the step-by-step commands used to install and configure a MKE 1.0 env.

Env:

MKE 1.0
MapR 6.1 secured
MacOS with kubectl installed as the client

Solution:

Currently we already have one MapR 6.1 secured cluster running in GCE(Google Compute Engine).
We just want to create a CSpace(Compute Space) in a Kubernetes Cluster which can access the existing MapR 6.1 secured cluster.
So the high-level steps are:

Create a Kubernetes Cluster in GKE(Google Kubernetes Engine).
Bootstrap the Kubernetes Cluster
Create and Deploy External Info for CSpace
Create a CSpace
Run a Drill Cluster in CSpace
Run a Spark Application in CSpace

1. Create a Kubernetes Cluster in GKE(Google Kubernetes Engine)

1.1 Create a Kubernetes cluster named "hao-cluster" in GKE

You can use GUI or gcloud commands.

1.2 Fetch the credentials for the Kubernetes cluster

gcloud container clusters get-credentials hao-cluster --zone us-central1-a

After that, make sure "kubectl cluster-info" returns correct cluster information.
This step is to make kubectl work and connect to the correct Kubernetes cluster.

1.3 Bind cluster-admin role to Google Cloud user

kubectl create clusterrolebinding user-cluster-admin-binding --clusterrole=cluster-admin --user=xxx@yyy.com

Note: "xxx@yyy.com" is the my Google Cloud user.
Here we grant cluster admin role to the user to avoid any permission error in the next step when we create MapR CSI ClusterRole and ClusterRoleBinding.

2. Bootstrap the Kubernetes Cluster

2.1 Download MKE github

git clone https://github.com/mapr/mapr-operators
cd ./mapr-operators
git checkout mke-1.0.0.0

2.2 Run the bootstrapinstall Utility

./bootstrap/bootstrapinstall.sh
>>> Installing to an Openshift environment? (yes/no) [no]:
>>> Install MapR CSI driver? (yes/no) [yes]:
...
This Kubernetes environment has been successfully bootstrapped for MapR
MapR components can now be created via the newly installed operators

2.3 Verify the PODs/DaemonSet/StatefulSet are running under namespace "mapr-csi"/"mapr-system"/"spark-operator"/"drill-operator"

kubectl get pods --all-namespaces

Make sure all of the PODs are ready and in "Running" status.

3. Create and Deploy External Info for CSpace

Follow documentation: Automatically Generating and Deploying External Info for a CSpace

3.1 Copy tools/gen-external-secrets.sh to one node of the MapR Cluster

gcloud compute scp tools/gen-external-secrets.sh scott-mapr-core-pvp1:/tmp/
chown mapr:mapr gen-external-secrets.sh

3.2 As the admin user (typically mapr), generate a user ticket

maprlogin password

3.3 Run gen-external-secrets.sh as the admin user(typically mapr)

/tmp/gen-external-secrets.sh 
...
The external information generated for this cluster are available at: mapr-external-secrets-hao.yaml
Please copy them to a machine where you can run the following command:
  kubectl apply -f mapr-external-secrets-hao.yaml

3.4 Copy above generated mapr-external-secrets-hao.yaml to the kubectl client node

gcloud compute scp scott-mapr-core-pvp1:/home/mapr/mapr-external-secrets-hao.yaml /tmp/

3.5 Apply external secrets

kubectl apply -f /tmp/mapr-external-secrets-hao.yaml

4. Create a CSpace

Follow documentation: Creating a Compute Space

4.1 Copy the sample CSpace CR

cp examples/cspaces/cr-cspace-full-gce.yaml /tmp/my_cr-cspace-full-gce.yaml

4.2 Modify the sample CSpace CR

At least, we need to modify the cluster name.

vim /tmp/my_cr-cspace-full-gce.yaml

4.3 Apply CSpace CR

kubectl apply -f /tmp/my_cr-cspace-full-gce.yaml

4.4 Verify the PODs are ready and running in namespace "mycspace"

kubectl get pods -n mycspace -o wide

Here are 3 PODs running:
CSpace terminal, Hive Metastore and Spark HistoryServer.

4.5 Logon one of the PODs to verify CSI is working fine and MapRFS is accessible

kubectl exec -ti hivemeta-f6d746f-n27h6 -n mycspace -- bash
su - mapr
maprlogin password
hadoop fs -ls /

5. Run a Drill Cluster in CSpace

Follow documentation: Running Drillbits in Compute Spaces

5.1 Copy the sample Drill CR

cp examples/drill/drill-cr-full.yaml /tmp/my_drill-cr-full.yaml

5.2 Modify the sample Drill CR

At least. we need to modify the name of CSpace.

vim /tmp/my_drill-cr-full.yaml

5.3 Apply Drill CR

kubectl apply -f /tmp/my_drill-cr-full.yaml

5.4 Verify the Drillbit PODs are ready and running inside CSpace

kubectl get pods -n mycspace

5.5 Logon drillbit POD to check the health of Drill Cluster

kubectl exec -ti drillcluster1-drillbit-0 -n mycspace -- bash
su - mapr
maprlogin password

/opt/mapr/drill/drill-1.16.0/bin/sqlline -u "jdbc:drill:zk=xxx:5181,yyy:5181,zzz:5181;auth=maprsasl"
apache drill> select * from sys.drillbits;
+-----------------------------------------------------------------------+-----------+--------------+-----------+-----------+---------+----------------+--------+
|                               hostname                                | user_port | control_port | data_port | http_port | current |    version     | state  |
+-----------------------------------------------------------------------+-----------+--------------+-----------+-----------+---------+----------------+--------+
| drillcluster1-drillbit-0.drillcluster1-svc.mycspace.svc.cluster.local | 21010     | 21011        | 21012     | 8047      | false   | 1.16.0.10-mapr | ONLINE |
| drillcluster1-drillbit-1.drillcluster1-svc.mycspace.svc.cluster.local | 21010     | 21011        | 21012     | 8047      | true    | 1.16.0.10-mapr | ONLINE |
+-----------------------------------------------------------------------+-----------+--------------+-----------+-----------+---------+----------------+--------+
2 rows selected (2.228 seconds)

5.6 Access Drillbit UI

First option is to do portforwarding:

kubectl port-forward --namespace mycspace $(kubectl get pod --namespace mycspace --selector="controller-revision-hash=drillcluster1-drillbit-57876df7bf,drill-cluster=drillcluster1,statefulset.kubernetes.io/pod-name=drillcluster1-drillbit-1" --output jsonpath='{.items[0].metadata.name}') 8080:8047

And then open UI:
https://localhost:8080/

Second option is to use the service which is already exposed as LoadBalancer type:

$ kubectl get service -n mycspace | grep drillcluster1-web-svc
drillcluster1-web-svc               LoadBalancer   10.0.0.111   xxx.xxx.xxx.123   8047:31642/TCP,21010:30945/TCP   25h

And then open UI:
https://xxx.xxx.xxx.123:8047

6. Run a Spark Application in CSpace

Follow documentation: Running Spark Applications in Compute Spaces

6.1 Logon cspace terminal POD

kubectl port-forward -n mycspace cspaceterminal-bcdcf7bbb-p6227 7777:7777

Note: the 2nd "7777" port is what you configured in CSpace CR file earlier, eg, /tmp/my_cr-cspace-full-gce.yaml:

$ grep sshPort /tmp/my_cr-cspace-full-gce.yaml
      sshPort: 7777

Then ssh to the cspace terminal POD:

ssh mapr@localhost -p 7777

6.2 Create the user ticket for the Spark Application submitter

Follow documentation: Using the Ticketcreator Utility to Generate Secrets

[mapr@cspaceterminal-bcdcf7bbb-p6227 ~]$ ticketcreator.sh
Create a ticket for tenant user: [mapr]:
Please provide 'mapr's password: [mapr]:
uid=1002(mapr) gid=1003(mapr) groups=1003(mapr),0(root)
Creating user ticket for mapr...
MapR credentials of user 'mapr' for cluster 'gce1.cluster.com' are written to '/tmp/maprticket_1002'

Please provide a name for your user secret: [mapr-user-secret-4030076998]:
secret/mapr-user-secret-4030076998 created
Please note secret name: mapr-user-secret-4030076998 for later use.

Do you want to create a dynamic MapR Volume via CSI for storage of Spark secondary dependencies?
This will create both a PVC and a PV. (y/n) [n]: y
Provide the CSI PersistentVolumeClaim Name: [mapr-csi-pvc-2696334965]:
persistentvolumeclaim/mapr-csi-pvc-2696334965 created
Please note PVC name: mapr-csi-pvc-2696334965 for later use.

Provide the CSI PersistentVolume Name: [mapr-csi-pv-2354307494]:
persistentvolume/mapr-csi-pv-2354307494 created
Please note PV name: mapr-csi-pv-2354307494 for later use.

6.3 Copy the sample Spark pi job CR

cp examples/spark/mapr-spark-pi.yaml /tmp/my_mapr-spark-pi.yaml

6.4 Modify the sample Spark pi job CR

vim /tmp/my_mapr-spark-pi.yaml

At least modify the CSpace name , spark.mapr.user.secret and serviceAccount.

6.5 Submit the spark pi job

kubectl apply -f /tmp/my_mapr-spark-pi.yaml

6.6 Verify the spark pi job is running

[mapr@cspaceterminal-bcdcf7bbb-p6227 ~]$ sparkctl list -n mycspace
+----------+---------+----------------+-----------------+
|   NAME   |  STATE  | SUBMISSION AGE | TERMINATION AGE |
+----------+---------+----------------+-----------------+
| spark-pi | RUNNING | 36s            | N.A.            |
+----------+---------+----------------+-----------------+

6.7 View the log for the spark pi job

On the CSpace terminal POD using sparkctl:

sparkctl log spark-pi  -n mycspace

OR
On the kubectl client node using kubectl:

kubectl logs spark-pi-driver -n mycspace

6.8 Access Spark HistoryServer UI

Use the service which is already exposed as LoadBalancer type:

$ kubectl get service -n mycspace | grep sparkhs-svc
sparkhs-svc                         LoadBalancer   10.0.0.222   yyy.yyy.yyy.230   18480:31507/TCP                  26h

And then open UI:
https://yyy.yyy.yyy.230:18480

==

↧

How to check if Spark job runs out of quota in CSpace

February 11, 2020, 4:47 pm

≫ Next: Hbase replication cheat sheet

≪ Previous: Hands-on MKE(MapR Kubernetes Ecosystem ) 1.0 release

Goal:

How to check if Spark job runs out of quota in CSpace.

Env:

MKE 1.0

Solution:

The example configuration file for CSpace based on MKE 1.0 version has below 3 default PODs:

terminal
hivemetastore
sparkhs

Each of them needs 2 CPUs + 8G memory.
This information is inside:

git clone https://github.com/mapr/mapr-operators
cd ./mapr-operators
git checkout mke-1.0.0.0
cat examples/cspaces/cr-cspace-full-gce.yaml

  cspaceservices:
    terminal:
      count: 1
      image: cspaceterminal-6.1.0:201912180140
      sshPort: 7777
      requestcpu: "2000m"
      requestmemory: 8Gi
      logLevel: INFO
    hivemetastore:
      count: 1
      image: hivemeta-2.3:201912180140
      requestcpu: "2000m"
      requestmemory: 8Gi
      logLevel: INFO
    sparkhs:
      count: 1
      image: spark-hs-2.4.4:201912180140
      requestcpu: "2000m"
      requestmemory: 8Gi
      logLevel: INFO

So when we are calculating how much resources are available for other ecosystems like Spark and Drill, we need to take those resource into consideration.

How to check if the Spark job is running out of quota in CSpace?
We need to get the Spark driver log using below commands:
Take pi job for example:

kubectl logs spark-pi-driver -n mycspace
or
sparkctl log spark-pi  -n mycspace

Here are 3 scenarios at least:

1. No nodes in Kubernetes cluster have sufficient resources

For example, if the CSpace quota has 50 CPUs, and no any other PODs running besides the 3 default PODs.
We still have 50-6=44 CPUs available for running one Spark job.
If the Spark driver only needs 1 CPU, then we still have 43 CPUs available for Spark executors.
For below definition in the Spark job YAML file:

  executor:
    cores: 20
    instances: 2
    memory: "1024m"
    labels:
      version: 2.4.4

I need to start 2 Spark executors with 20 CPUs each.

Symptom:
The requirement(40 CPUs) is below the available quota(43 CPUs), however it may hit below error from Spark driver log:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Troubleshooting:
2 Spark executor PODs are pending forever.

$ kubectl get pods -n mycspace
NAME                             READY   STATUS    RESTARTS   AGE
spark-pi-1581449230742-exec-1    0/1     Pending   0          17m
spark-pi-1581449230742-exec-2    0/1     Pending   0          16m
spark-pi-driver                  1/1     Running   0          17m
...

"kubectl describe executor-POD" tells the reason why they are pending:

$ kubectl describe pod spark-pi-1581449230742-exec-1 -n mycspace
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  2m24s (x29 over 17m)  default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

Basically it means on any nodes have sufficient resources.
This can be confirmed by below commands:

$ kubectl describe node
...
Allocatable:
 attachable-volumes-csi-com.mapr.csi-kdf:  20
 attachable-volumes-gce-pd:                127
 cpu:                                      15890m
 ephemeral-storage:                        47093746742
 hugepages-2Mi:                            0
 memory:                                   56288592Ki
 pods:                                     110
...

Root Cause:
In this Kubernetes cluster, we have 3 nodes.
The most empty node can allocate 15.89 CPUs at most, which is less than 20 CPUs request.

2. Spark executors run out of quota of CSpace

For example, if the CSpace quota has 10 CPUs, and no any other PODs running besides the 3 default PODs.
We still have 10-6=4 CPUs available for running one Spark job.
If the Spark driver only need 1 CPUs, then we still have 3 CPUs available for Spark executors.
For below definition in the Spark job YAML file:

  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "1024m"
    labels:
      version: 2.4.4
    serviceAccount: mapr-mycspace-cspace-sa
  executor:
    cores: 2
    instances: 2
    memory: "1024m"
    labels:
      version: 2.4.4

I need to start 2 Spark executors with 2 CPUs each.

Symptom:
The requirement(4 CPUs) is above the available quota(3 CPUs), it may show below error from Spark driver log:

ERROR util.Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-1581464839789-exec-3" is forbidden: exceeded quota: mycspacequota, requested: cpu=2, used: cpu=9, limited: cpu=10.

However the job can still completes because it will put both tasks in one Spark executor.
The SparkHistoryServer should show below from "Executors" tab:

If we reduce the CPU requirement for each Spark executor to 1 from 2, SparkHistoryServer should show below as a comparison:

3. Spark driver run out of quota of CSpace

For example, if the CSpace quota has 10 CPUs, and no any other PODs running besides the 3 default PODs.
We still have 10-6=4 CPUs available for running one Spark job.
If the Spark driver need 5 CPU then it is already above the available quota.
For below definition in the Spark job YAML file:

  driver:
    cores: 5
    coreLimit: "5000m"
    memory: "1024m"
    labels:
      version: 2.4.4
    serviceAccount: mapr-mycspace-cspace-sa

I need to start 1 Spark driver with 5 CPUs.

Symptom:
The spark job will fail by checking their status using sparkctl:

$ sparkctl list -n mycspace
+----------+--------+----------------+-----------------+
|   NAME   | STATE  | SUBMISSION AGE | TERMINATION AGE |
+----------+--------+----------------+-----------------+
| spark-pi | FAILED | 1m             | N.A.            |
+----------+--------+----------------+-----------------+

Troubleshooting:
No driver log is generated yet:

$ kubectl logs spark-pi-driver -n mycspace -f |tee /tmp/sparkjob.txt
Error from server (NotFound): pods "spark-pi-driver" not found

This is because even Spark driver POD is not started yet:

$ kubectl get pods -n mycspace
NAME                             READY   STATUS    RESTARTS   AGE
cspaceterminal-bcdcf7bbb-f68r9   1/1     Running   0          5h18m
hivemeta-f6d746f-jq5rj           1/1     Running   0          5h18m
sparkhs-667f46dcfd-24k86         1/1     Running   0          5h18m

"kubectl describe sparkapplication" should show the reason:

$ kubectl describe sparkapplication spark-pi -n mycspace
...
Application State:
    Error Message:  failed to run spark-submit for SparkApplication mycspace/spark-pi: 20/02/11 23:59:59 ERROR deploy.SparkSubmit$$anon$2: Failure executing: POST at: https://10.0.32.1/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-driver" is forbidden: exceeded quota: mycspacequota, requested: cpu=5, used: cpu=6, limited: cpu=10.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.0.32.1/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-driver" is forbidden: exceeded quota: mycspacequota, requested: cpu=5, used: cpu=6, limited: cpu=10.
...

Root Cause:
Spark driver POD could not be started because it is out of quota of CSpace already.

↧

Hbase replication cheat sheet

December 11, 2020, 3:37 pm

≫ Next: Hbase master failed to start with error "java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned"

≪ Previous: How to check if Spark job runs out of quota in CSpace

Goal:

This article records the common commands and issues for hbase replication.

Solution:

1. Add the target as peer

hbase shell> add_peer "us_east","hostname.of.zookeeper:5181:/path-to-hbase"

2. Enable and Disable table replication

hbase shell> enable_table_replication "t1"
hbase shell> disable_table_replication "t1"

3. Copy table from source to target

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=hostname.of.zookeeper:5181:/path-to-hbase t1

4. Remove target as peer

hbase shell> remove_peer "us_east"

5. List all peers

hbase shell> list_peers

6. Verify the rows between source and target table

hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication peer1 table1

Compare the GOODROWS and BADROWS.

7. Monitor Replication Status

# Prints the status of each source and its sinks, sorted by hostname.
hbase shell> status 'replication'

# Prints the status for each replication source, sorted by hostname.
hbase shell> status 'replication', 'source'

# Prints the status for each replication sink, sorted by hostname.
hbase shell> status 'replication', 'sink'

8. HBase Replication Metrics

Metric	Description
`source.sizeOfLogQueue`	Number of WALs to process (excludes the one which is being processed) at the replication source.
`source.shippedOps`	Number of of mutations shipped.
`source.logEditsRead`	Number of mutations read from WALs at the replication source.
`source.ageOfLastShippedOp`	Age of last batch shipped by the replication source.

9. Practice for replicating one existing table from cluster A to cluster B

on cluster A:
hbase shell> add_peer "B","hostname.of.zookeeper:5181:/path-to-hbase"
hbase shell> enable_table_replication "t1"
hbase shell> disable_peer 'B'

Then use either CopyTable, Export/Import or ExportSnapshot to copy table "t1" from A to B.

hbase shell> enable_peer 'B'

10. Hbase replication related parameters

<property>
<name>hbase.replication</name>
<value>true</value>
<description>Allow HBase tables to be replicated.</description>
</property>

<property>
<name>replication.source.nb.capacity</name>
<value>25000</value>
<description>The data records synchronized to the sink side each time cannot be greater than the threshold, and the default is 25000</description>
</property>

<property>
<name>replication.source.ratio</name>
<value>0.1</value>
<description>The RegionServer of this ratio is selected from the cluster to be backed up as potential ReplicationSink, and the default value is 0.1</description>
</property>

<property>
<name>replication.source.size.capacity</name>
<value>67108864</value>
<description>The size of the data synchronized to the sink side each time cannot exceed this threshold, and the default is 64M</description>
</property>

<property>
<name>replication.sleep.before.failover</name>
<value>2000</value>
<description>Before transferring the ReplicationQueue in the dead RegionServer to another RegionServer, take a nap for 2 seconds</description>
</property>

<property>
<name>replication.executor.workers</name>
<value>1</value>
<description>The number of threads engaged in replication, the default is 1</description>
</property>

Known Issues

1. HBASE-18111

The cluster connection was aborted when the ZookeeperWatcher receive a AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() method will stuck in a while loop.

One symptom is the jstack on RS shows:

java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
        at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)

This is fixed on 1.3.3, 1.4.0, 2.0.0.

2. HBASE-24359

Replication will be stuck after we delete CFs from both the source and the sink, if the source still has outstanding edits that now it could not get rid of. Now all replication is backed up behind these unreplicatable edits.

The fix is to introduce a new config hbase.replication.drop.on.deleted.columnfamily, default is false. When config to true, the replication will drop the edits for columnfamily that has been deleted from the replication source and target.

This is fixed on 2.3.0 and 3.0.0.

References

https://blog.cloudera.com/what-are-hbase-znodes/

https://blog.cloudera.com/apache-hbase-replication-overview/

https://blog.cloudera.com/online-apache-hbase-backups-with-copytable/

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/fault-tolerance/content/manually_enable_hbase_replication.html

https://blog.cloudera.com/introduction-to-apache-hbase-snapshots/

↧

Hbase master failed to start with error "java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned"

December 11, 2020, 8:16 pm

≫ Next: What does "enable_table_replication" do internally in Hbase replication?

≪ Previous: Hbase replication cheat sheet

Symptom:

Hbase master failed to start with error "java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned".

It could happen when starting or switching to master.

Sample error messages are:

2000-01-01 01:01:01,999 FATAL [myhost:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)

2000-01-01 01:01:01,999 FATAL [myhost:16000.activeMasterManager] master.HMaster: Failed to become active master
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
    at java.lang.Thread.run(Thread.java:745)

Env:

hbase 1.1.8

Root Cause:

When Hbase master is starting, it assigns meta table firstly and then assign other tables.

So hbase:namespace is the same as other tables in this assignment phase.

If there are too many tables or regions, and the default 300000ms(5mins) may not be enough.

Solution:

Increase hbase.master.namespace.init.timeout in hbase-site.xml and restart Hbase Master.

↧

What does "enable_table_replication" do internally in Hbase replication?

December 12, 2020, 1:14 pm

≫ Next: Spark Code -- How to replace Null values in DataFrame/Dataset

≪ Previous: Hbase master failed to start with error "java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned"

Goal:

This article explains what does the command "enable_table_replication" do internally in Hbase replication by looking into the source code.

It also explains the difference between below 2 commands which are shown on different articles.

hbase shell> enable_table_replication "t1"

vs.

hbase shell> disable 't1'
hbase shell> alter 't1', {NAME => 'column_family_name', REPLICATION_SCOPE => '1'}
hbase shell> enable 't1'

Env:

Hbase 1.1.8

Analysis:

1. Hbase Source code analysis for "enable_table_replication"

a. "enable_table_replication" is a ruby command in hbase shell

Inside hbase-shell/src/main/ruby/shell/commands/enable_table_replication.rb,

it is calling replication_admin.enable_tablerep(table_name).

b. "enable_tablerep"

Inside hbase-shell/src/main/ruby/hbase/replication_admin.rb,

it is calling @replication_admin.enableTableRep(tableName).

c. "enableTableRep"

Inside hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java,

it is calling:

    checkAndSyncTableDescToPeers(tableName, splits);
    setTableRep(tableName, true);

d. "checkAndSyncTableDescToPeers"

Inside hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java,

it is to Create the same table on peer when not exist, and Throw exception if the table exists on peer cluster but descriptors are not same.

e. "setTableRep"

Inside hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java,

      if (isTableRepEnabled(htd) ^ isRepEnabled) {
        boolean isOnlineSchemaUpdateEnabled =
            this.connection.getConfiguration()
                .getBoolean("hbase.online.schema.update.enable", true);
        if (!isOnlineSchemaUpdateEnabled) {
          admin.disableTable(tableName);
        }
        for (HColumnDescriptor hcd : htd.getFamilies()) {
          hcd.setScope(isRepEnabled ? HConstants.REPLICATION_SCOPE_GLOBAL
              : HConstants.REPLICATION_SCOPE_LOCAL);
        }
        admin.modifyTable(tableName, htd);
        if (!isOnlineSchemaUpdateEnabled) {
          admin.enableTable(tableName);
        }

Basically it checks the value of hbase.online.schema.update.enable(default=true).

If hbase.online.schema.update.enable=true, it modify the REPLICATION_SCOPE for ALL column families to true.

Else, it will firstly disable table, modify REPLICATION_SCOPE for ALL column families to true, and then enable table.

2. Differences

Based on above analysis, "enable_table_replication" can help create the table in target peer if not exist and detect differences on table if exist.

It can modify the REPLICATION_SCOPE for ALL column families.

It checks if hbase.online.schema.update.enable=true, and then decides if disable/enable table is needed.

↧

Spark Code -- How to replace Null values in DataFrame/Dataset

January 28, 2021, 2:41 pm

≫ Next: Spark Code -- How to drop Null values in DataFrame/Dataset

≪ Previous: What does "enable_table_replication" do internally in Hbase replication?

Goal:

This article shares some Scala example codes to explain how to replace Null values in DataFrame/Dataset.

Solution:

Note: As per the the code and API for org.apache.spark.sql, DataFrame is basically Dataset[Row].

So in the future, we are always checking the code or API for Dataset when researching on DataFrame/Dataset.

Dataset has an Untyped transformations named "na" which is DataFrameNaFunctions:

def na: DataFrameNaFunctions

DataFrameNaFunctions has methods named "fill" with different signatures to replace NULL values for different datatype columns.

Let's create a sample Dataframe firstly as the data source:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
    Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
    Row("Jerry ",null,"Brown",null,5000.67,null,false),
    Row("Henry ","B","Jones",66666,null,20015464564L,true)
   )

val simpleSchema = StructType(Array(
    StructField("firstname",StringType,true),
    StructField("middlename",StringType,true),
    StructField("lastname",StringType,true),
    StructField("zipcode", IntegerType, true),
    StructField("salary", DoubleType, true),
    StructField("account", LongType, true),
    StructField("isAlive", BooleanType, true)
  ))

val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

Data source and its schema look as below:

scala> df.printSchema()
root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- account: long (nullable = true)
 |-- isAlive: boolean (nullable = true)


scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       null|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

1. Replace Null in ALL numeric columns.

Here it includes ALL IntegerType, DoubleType and LongType columns.

scala> df.na.fill(0).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|      0|5000.67|          0|  false|
|   Henry |         B|   Jones|  66666|    0.0|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

2. Replaces Null in specified numeric columns.

For example, include only the numeric column named "account".

scala> df.na.fill(0,Array("account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|          0|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

3. Replace Null in ALL string columns.

scala> df.na.fill("").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |          |   Brown|   null|5000.67|       null|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

4. Replace Null in ALL boolean columns.

scala> df.na.fill(true).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   true|
|   Jerry |      null|   Brown|   null|5000.67|       null|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

Note: Here is the Complete Sample Code.

↧

Spark Code -- How to drop Null values in DataFrame/Dataset

January 28, 2021, 3:29 pm

≫ Next: Spark Code -- Use date_format() to convert timestamp to String

≪ Previous: Spark Code -- How to replace Null values in DataFrame/Dataset

Goal:

This article shares some Scala example codes to explain how to drop Null values in DataFrame/Dataset.

Solution:

DataFrameNaFunctions has methods named "drop" with different signatures to drop NULL values under different scenarios.

Let's create a sample Dataframe firstly as the data source:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
    Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
    Row("Jerry ",null,"Brown",null,5000.67,null,false),
    Row("Henry ","B","Jones",66666,null,20015464564L,true)
   )

val simpleSchema = StructType(Array(
    StructField("firstname",StringType,true),
    StructField("middlename",StringType,true),
    StructField("lastname",StringType,true),
    StructField("zipcode", IntegerType, true),
    StructField("salary", DoubleType, true),
    StructField("account", LongType, true),
    StructField("isAlive", BooleanType, true)
  ))

val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

Data source and its schema look as below:

scala> df.printSchema()
root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- account: long (nullable = true)
 |-- isAlive: boolean (nullable = true)


scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       null|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

1. Drop rows containing NULL in any columns.(version 1)

Here only one row does not have NULL in any columns.

scala> df.na.drop().show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
+---------+----------+--------+-------+-------+-----------+-------+

2. Drop rows containing NULL in any columns. (version 2)

Same as above. This is just another version.

scala> df.na.drop("any").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
+---------+----------+--------+-------+-------+-----------+-------+

3. Drop rows containing NULL in all columns.

Here it shows all rows because there is no such all-NULL rows.

scala> df.na.drop("all").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       null|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

4. Drop rows containing NULL in any of specified column(s).

scala> df.na.drop(Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
+---------+----------+--------+-------+-------+-----------+-------+

5. Drop rows containing NULL in all of specified column(s).

scala> df.na.drop("all",Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       null|  false|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

6. Drop rows containing less than minNonNulls non-null values.

It means we keep the rows with at least minNonNulls non-null values.

Here I want to keep the rows with all 7 non-null values.

scala> df.na.drop(7).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
+---------+----------+--------+-------+-------+-----------+-------+

Note: Here is the Complete Sample Code.

↧

Spark Code -- Use date_format() to convert timestamp to String

January 28, 2021, 4:52 pm

≫ Next: Spark Tuning -- Use Partition Discovery feature to do partition pruning

≪ Previous: Spark Code -- How to drop Null values in DataFrame/Dataset

Goal:

This article shares some Scala example codes to explain how to use date_format() to convert timestamp to String.

Solution:

data_format() is one function of org.apache.spark.sql.functions to convert data/timestamp to String.

This is the doc for datatime pattern.

Here is a simple example to show this in spark-sql way.

spark.sql("""
    SELECT current_timestamp() as ts, 
    date_format(current_timestamp(),"yyyy-MM-dd") as `yyyy-MM-dd`,
    date_format(current_timestamp(),"MMM") as `MMM`,
    date_format(current_timestamp(),"MMMM") as `MMMM`,
    date_format(current_timestamp(),"d") as `d`,
    date_format(current_timestamp(),"E") as `E`,
    date_format(current_timestamp(),"EEEE") as `EEEE`,
    date_format(current_timestamp(),"HH:mm:ss.S") as `HH:mm:ss.S`,
    date_format(current_timestamp(),"Z") as `Z`,
    date_format(current_timestamp(),"z") as `z`
""").show()

The output is:

+--------------------+----------+---+-------+---+---+--------+------------+-----+---+
|                  ts|yyyy-MM-dd|MMM|   MMMM|  d|  E|    EEEE|  HH:mm:ss.S|    Z|  z|
+--------------------+----------+---+-------+---+---+--------+------------+-----+---+
|2021-01-28 16:51:...|2021-01-28|Jan|January| 28|Thu|Thursday|16:51:48.802|-0800|PST|
+--------------------+----------+---+-------+---+---+--------+------------+-----+---+

↧

Spark Tuning -- Use Partition Discovery feature to do partition pruning

February 2, 2021, 3:08 pm

≫ Next: Spark Tuning -- Column Projection for Parquet

≪ Previous: Spark Code -- Use date_format() to convert timestamp to String

Goal:

This article explains how to use Partition Discovery feature to do partition pruning.

Solution:

If the data directories are organized using the same way that Hive partitions use, Spark can discover that partition column(s) using Partition Discovery feature.

After that, the query on top of the partitioned table can do partition pruning.

Below is one example:

1. Create a DataFrame based on sample data and add a new duplicate column.

val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/data/retail-data/by-day/*.csv")
//Add a new column named "AnotherCountry" to have the same value as column "Country" so that we can compare the different query plan.
val newdf = df.withColumn("AnotherCountry", expr("Country"))

2. Save the DataFrame as partitioned orc files.

val targetdir = "/tmp/test_partition_pruning/newdf"
newdf.write.mode("overwrite").format("orc").partitionBy("Country").save(targetdir)

3. Let's take a look at the target directory.

newdf.write.mode("overwrite").format("orc").partitionBy("Country").save(targetdir)
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path("/tmp/test_partition_pruning/newdf")).filter(_.isDir).map(_.getPath).foreach(println)

Output is:

maprfs:///tmp/test_partition_pruning/newdf/Country=Australia
maprfs:///tmp/test_partition_pruning/newdf/Country=Netherlands
maprfs:///tmp/test_partition_pruning/newdf/Country=Canada
maprfs:///tmp/test_partition_pruning/newdf/Country=Italy
maprfs:///tmp/test_partition_pruning/newdf/Country=Denmark
maprfs:///tmp/test_partition_pruning/newdf/Country=Iceland
...

4. "good" query uses partition pruning

val readdf = spark.read.format("orc").load(targetdir)
readdf.createOrReplaceTempView("readdf") 

val goodsql = "SELECT * FROM readdf WHERE Country = 'Australia'"
val goodresult = spark.sql(goodsql)
goodresult.explain
println(s"Result:   ${goodresult.count()} ")

Output:

== Physical Plan ==
*(1) FileScan orc [InvoiceNo#53,StockCode#54,Description#55,Quantity#56,InvoiceDate#57,UnitPrice#58,CustomerID#59,AnotherCountry#60,Country#61] Batched: true, Format: ORC, Location: InMemoryFileIndex[maprfs:///tmp/test_partition_pruning/newdf], PartitionCount: 1, PartitionFilters: [isnotnull(Country#61), (Country#61 = Australia)], PushedFilters: [], ReadSchema: struct<InvoiceNo:string,StockCode:string,Description:string,Quantity:int,InvoiceDate:timestamp,Un...

Result:   1259

5. "Bad" query can not use partition pruning

val badsql  = "SELECT * FROM readdf WHERE AnotherCountry = 'Australia'"
val badresult = spark.sql(badsql)
badresult.explain
println(s"Result:   ${badresult.count()} ")

Output:

== Physical Plan ==
*(1) Project [InvoiceNo#53, StockCode#54, Description#55, Quantity#56, InvoiceDate#57, UnitPrice#58, CustomerID#59, AnotherCountry#60, Country#61]
+- *(1) Filter (isnotnull(AnotherCountry#60) && (AnotherCountry#60 = Australia))
   +- *(1) FileScan orc [InvoiceNo#53,StockCode#54,Description#55,Quantity#56,InvoiceDate#57,UnitPrice#58,CustomerID#59,AnotherCountry#60,Country#61] Batched: true, Format: ORC, Location: InMemoryFileIndex[maprfs:///tmp/test_partition_pruning/newdf], PartitionCount: 38, PartitionFilters: [], PushedFilters: [IsNotNull(AnotherCountry), EqualTo(AnotherCountry,Australia)], ReadSchema: struct<InvoiceNo:string,StockCode:string,Description:string,Quantity:int,InvoiceDate:timestamp,Un...

Result:   1259

Analysis:

1. Explain plan

From above explain plans, it is pretty obvious why the "good" query uses partition pruning while the "bad" query does not -- column "Country" is the partition key.

The "good" query can actually push the "Filter" inside "FileScan" as "PartitionFilters".

So it only needs to scan 1 partition(directory):

PartitionCount: 1, PartitionFilters: [isnotnull(Country#61), (Country#61 = Australia)]

However the "bad" query has to scan all the 38 partitions(direcotries) firstly and then apply Filter:

PartitionCount: 38, PartitionFilters: []

2. Event log/Web UI

By only looking at the related Stage for the "good" query, the sum of input Size is only 80+KB while the sum of records = the final result = 1259.

By only looking at the related Stage for the "bad" query, the sum of input Size is 2+MB while the sum of records = the final result = 1259.

Of course, the execution time also has large differences.

Note: Here is the Complete Sample Code.

↧

Spark Tuning -- Column Projection for Parquet

February 2, 2021, 4:55 pm

≫ Next: Spark Tuning -- Predicate Pushdown for Parquet

≪ Previous: Spark Tuning -- Use Partition Discovery feature to do partition pruning

Goal:

This article explains the column projection for parquet format(or other columnar format) in Spark.

Solution:

Spark can do column projection for columnar format data such as Parquet.

The idea is to only read the needed columns instead of reading all of the columns.

This can reduce lots of I/O needed to improve the performance.

Below is one example.

Note: To show difference of performance for column projection, I disabled Parquet filter pushdown feature by setting spark.sql.parquet.filterPushdown=false in my configuration.

I will discuss about Parquet filter pushdown feature in another article.

1. Save a sample DataFrame as parquet files.

val df = spark.read.json("/data/activity-data/")
val targetdir = "/tmp/test_column_projection/newdf"
df.write.mode("overwrite").format("parquet").save(targetdir)

2. Select only 1 column

val somecols  = "SELECT Device FROM readdf WHERE Model='something_not_exist'"
val goodresult = spark.sql(somecols)
goodresult.explain
goodresult.collect

Output:

scala> goodresult.explain
== Physical Plan ==
*(1) Project [Device#48]
+- *(1) Filter (isnotnull(Model#50) && (Model#50 = something_not_exist))
   +- *(1) FileScan parquet [Device#48,Model#50] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Model), EqualTo(Model,something_not_exist)], ReadSchema: struct<Device:string,Model:string>

scala> goodresult.collect
res5: Array[org.apache.spark.sql.Row] = Array()

3. Select ALL columns

val allcols = "SELECT * FROM readdf where Model='something_not_exist'"
val badresult = spark.sql(allcols)
badresult.explain
badresult.collect

Output:

scala> badresult.explain
== Physical Plan ==
*(1) Project [Arrival_Time#46L, Creation_Time#47L, Device#48, Index#49L, Model#50, User#51, gt#52, x#53, y#54, z#55]
+- *(1) Filter (isnotnull(Model#50) && (Model#50 = something_not_exist))
   +- *(1) FileScan parquet [Arrival_Time#46L,Creation_Time#47L,Device#48,Index#49L,Model#50,User#51,gt#52,x#53,y#54,z#55] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Model), EqualTo(Model,something_not_exist)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

scala> badresult.collect
res7: Array[org.apache.spark.sql.Row] = Array()

Analysis:

1. Explain plan

During FileScan, we can see only the needed columns are read due to column projection feature:

FileScan parquet [Device#48,Model#50]

FileScan parquet [Arrival_Time#46L,Creation_Time#47L,Device#48,Index#49L,Model#50,User#51,gt#52,x#53,y#54,z#55]

2. Event log/Web UI

The "SELECT only 1 column"'s stage shows sum of Input Size=868.3KB.

The "SELECT ALL columns"'s stage shows sum of Input Size=142.3MB.

Note: Here is the Complete Sample Code.

↧

Spark Tuning -- Predicate Pushdown for Parquet

February 3, 2021, 12:06 pm

≫ Next: How to use pyarrow to view the metadata information inside a Parquet file

≪ Previous: Spark Tuning -- Column Projection for Parquet

Goal:

This article explains the Predicate Pushdown for Parquet in Spark.

Solution:

Spark can push down the predicate into scanning parquet phase so that it can reduce the amount of data to be read.

This is done by checking the metadata of parquet files to filter out the unnecessary data.

Note: Refer to this blog on How to use pyarrow to view the metadata information inside a Parquet file.

This feature is controlled by a parameter named spark.sql.parquet.filterPushdown (default is true).

Let's use the parquet files created in another blog for example.

1. Create a DataFrame on parquet files

val targetdir = "/tmp/test_column_projection/newdf"
val readdf = spark.read.format("parquet").load(targetdir)
readdf.createOrReplaceTempView("readdf")

2. Let's look at the data distribution for column "Index".

scala> spark.sql("SELECT min(Index), max(Index), count(distinct Index),count(*) FROM readdf").show
+----------+----------+---------------------+--------+
|min(Index)|max(Index)|count(DISTINCT Index)|count(1)|
+----------+----------+---------------------+--------+
|         0|    396342|               396343| 6240991|
+----------+----------+---------------------+--------+

As we know, the data range of this column "Index" is 0~396342.

After knowing this, we can design our tests below to show the difference performance results for different filters.

3. Query 1 and its explain plan

val q1  = "SELECT * FROM readdf WHERE Index=20000"
val result1 = spark.sql(q1)
result1.explain
result1.collect

Query 1 will have to scan lots of data because the "Index=20000" data is in most of the parquet chunks.

The explain plan:

== Physical Plan ==
*(1) Project [Arrival_Time#26L, Creation_Time#27L, Device#28, Index#29L, Model#30, User#31, gt#32, x#33, y#34, z#35]
+- *(1) Filter (isnotnull(Index#29L) && (Index#29L = 20000))
   +- *(1) FileScan parquet [Arrival_Time#26L,Creation_Time#27L,Device#28,Index#29L,Model#30,User#31,gt#32,x#33,y#34,z#35] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Index), EqualTo(Index,20000)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

4. Query 2 and its explain plan

val q2  = "SELECT * FROM readdf where Index=9999999999"
val result2 = spark.sql(q2)
result2.explain
result2.collect

Query 2 just needs to scan little data because the "Index=9999999999" data is outside the range for that column.

The explain plan:

== Physical Plan ==
*(1) Project [Arrival_Time#26L, Creation_Time#27L, Device#28, Index#29L, Model#30, User#31, gt#32, x#33, y#34, z#35]
+- *(1) Filter (isnotnull(Index#29L) && (Index#29L = 9999999999))
   +- *(1) FileScan parquet [Arrival_Time#26L,Creation_Time#27L,Device#28,Index#29L,Model#30,User#31,gt#32,x#33,y#34,z#35] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Index), EqualTo(Index,9999999999)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

5. Query 3 and its explain plan after disabling spark.sql.parquet.filterPushdown

Everything is the same as Query 2, and the only difference is we manually disabled this feature by setting below in config:

config("spark.sql.parquet.filterPushdown",false)

Because we disabled this feature, so it has to scan all the parquet data.

The explain plan:

== Physical Plan ==
*(1) Project [Arrival_Time#26L, Creation_Time#27L, Device#28, Index#29L, Model#30, User#31, gt#32, x#33, y#34, z#35]
+- *(1) Filter (isnotnull(Index#29L) && (Index#29L = 9999999999))
   +- *(1) FileScan parquet [Arrival_Time#26L,Creation_Time#27L,Device#28,Index#29L,Model#30,User#31,gt#32,x#33,y#34,z#35] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Index), EqualTo(Index,9999999999)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

Analysis:

1. Explain plan

As we can see, all of the explain plans look the same.

Even after we disabled spark.sql.parquet.filterPushdown, the explain plan did not show any difference between Query 2 and Query 3.

This means, at least from query plan, we could not tell if predicate is pushed down or not.

All of the explain plans show there is predicate push down:

PushedFilters: [IsNotNull(Index), EqualTo(Index,9999999999)]

Note: these tests are done in Spark 2.4.4, this behavior may change in the future release.

2. Event log/Web UI

Query 1's stage shows sum of Input Size is 142.3MB and sum of Records is 6240991:

Query 2's stage shows sum of Input Size is 44.4KB and sum of Records is 0:

Query 3's stage shows sum of Input Size is 142.3MB and sum of Records is 6240991:

Above metrics clearly shows the selectivity for this predicate pushdown feature based on the filter and also on the metadata of parquet files.

The performance difference between Query 2 and Query 3 shows how powerful this feature is.

Note: If the metadata of all parquet files has most/all of the data based on the filter, then this feature may not provide good selectivity. So data distribution also matters here.

Note: Here is the Complete Sample Code.

↧

How to use pyarrow to view the metadata information inside a Parquet file

February 3, 2021, 4:20 pm

≫ Next: Spark Tuning -- How to use SparkMeasure to measure Spark job metrics

≪ Previous: Spark Tuning -- Predicate Pushdown for Parquet

Goal:

This article explains how to use PyArrow to view the metadata information inside a Parquet file.

Env:

CentOS 7

Solution:

1. Create a Python 3 virtual environment

This step is because the default python version is 2.x on CentOS/Redhat 7 and it is too old to install pyArrow latest version.

Using Python 3 and its pip3 is the way to go.

However if we just use "alternatives" to config the python to use python3, it may break some other tools such as "yum" which depends on python2.

Using virtual environment is the easiest way to keep both python2 and python3 on CentOS 7.

python3 -m venv .venv
. .venv/bin/activate

2. Install PyArrow and its dependencies

pip install --upgrade pip setuptools
pip install Cython
pip install pyarrow

3. Read the metadata inside a Parquet file

>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('/.../part-00000-67861019-20bb-4396-96f8-146141351ff2-c000.snappy.parquet')

>>> parquet_file.metadata
<pyarrow._parquet.FileMetaData object at 0x7f8014250bf8>
  created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
  num_columns: 10
  num_rows: 546097
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1886

>>> parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7f808aaf4f98>
  num_columns: 10
  num_rows: 546097
  total_byte_size: 17515040

>>> parquet_file.metadata.row_group(0).column(3)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f801356cf48>
  file_offset: 6588315
  file_path:
  physical_type: INT64
  num_values: 546097
  path_in_schema: Index
  is_stats_set: True
  statistics:
<pyarrow._parquet.Statistics object at 0x7f8013fd2ea8>
      has_min_max: True
      min: 0
      max: 396316
      null_count: 0
      distinct_count: 0
      num_values: 546097
      physical_type: INT64
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('BIT_PACKED', 'RLE', 'PLAIN')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 6588315
  total_compressed_size: 2277936
  total_uncompressed_size: 4369155

>>> parquet_file.metadata.row_group(0).column(3).statistics
<pyarrow._parquet.Statistics object at 0x7f801356cef8>
  has_min_max: True
  min: 0
  max: 396316
  null_count: 0
  distinct_count: 0
  num_values: 546097
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE

From above information, we can tell that:

The parquet file version is 1.10.1.
It has only 1 row group inside.
It has 10 columns and 546097 rows.
The 4th column(.column(3)) named “Index” is a INT64 type with min=0 and max=396316.

↧

Spark Tuning -- How to use SparkMeasure to measure Spark job metrics

February 4, 2021, 5:42 pm

≫ Next: How to generate TPC-DS data and run TPC-DS performance benchmark for Spark

≪ Previous: How to use pyarrow to view the metadata information inside a Parquet file

Goal:

This article explains how to use SparkMeasure to measure Spark job metrics.

Env:

Spark 2.4.4 with Scala 2.11.12

SparkMeasure 0.17

Concept:

SparkMeasure is a very cool tool to collect the aggregated stage or task level metrics for Spark jobs or queries. Basically it creates the customized Spark listeners.

Note: Collecting at task level has additional performance overhead comparing to collecting at stage level. Unless you want to study skew effects for tasks, I would suggest we collect at stage level.

Regarding where those metrics come from, we can look into the Spark source code under "core/src/main/scala/org/apache/spark/executor" folder.

You will find the metrics explanation inside TaskMetrics.scala, ShuffleReadMetrics.scala, ShuffleWriteMetrics.scala, etc.

For example:

  /**
   * Time the executor spends actually running the task (including fetching shuffle data).
   */
  def executorRunTime: Long = _executorRunTime.sum

  /**
   * CPU Time the executor spends actually running the task
   * (including fetching shuffle data) in nanoseconds.
   */
  def executorCpuTime: Long = _executorCpuTime.sum

Or you can find those metrics explanation from the Doc.

Installation:

In this post, we will use spark-shell or spark-submit to test. So we just need to follow this doc to build or download the jar file.

Note: Before downloading/building the jar, make sure the jar should match your spark and scala version.

a. Download the Jar from Maven Central

For example, based on my spark and scala version, I will choose below version:

wget https://repo1.maven.org/maven2/ch/cern/sparkmeasure/spark-measure_2.11/0.17/spark-measure_2.11-0.17.jar

b. Build the Jar using sbt from source code

git clone https://github.com/lucacanali/sparkmeasure
cd sparkmeasure
sbt +package
ls -l target/scala-2.11/spark-measure*.jar  # location of the compiled jar

Solution:

In this post, we will use the sample data and queries from another post "Predicate Pushdown for Parquet".

1. Interactive Mode using Spark-shell for single job/query

Please refer to this doc for Interactive Mode for Spark-shell.

spark-shell --jars spark-measure_2.11-0.17.jar --master yarn --deploy-mode client --executor-memory 1G --num-executors 4

Stage metrics:

val stageMetrics = new ch.cern.sparkmeasure.StageMetrics(spark)
val q1  = "SELECT * FROM readdf WHERE Index=20000"
stageMetrics.runAndMeasure(sql(q1).show)

Output:

21/02/04 15:08:55 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
21/02/04 15:08:55 WARN StageMetrics: Stage metrics data refreshed into temp view PerfStageMetrics

Scheduling mode = FIFO
Spark Context default degree of parallelism = 4
Aggregated Spark stage metrics:
numStages => 2
numTasks => 4
elapsedTime => 3835 (4 s)
stageDuration => 3827 (4 s)
executorRunTime => 4757 (5 s)
executorCpuTime => 3672 (4 s)
executorDeserializeTime => 772 (0.8 s)
executorDeserializeCpuTime => 510 (0.5 s)
resultSerializationTime => 0 (0 ms)
jvmGCTime => 239 (0.2 s)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 0 (0 ms)
resultSize => 5441 (5.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 6240991
bytesRead => 149260233 (142.0 MB)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 0
shuffleTotalBlocksFetched => 0
shuffleLocalBlocksFetched => 0
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 0 (0 Bytes)
shuffleLocalBytesRead => 0 (0 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 0 (0 Bytes)
shuffleRecordsWritten => 0

Task Metrics:

val taskMetrics = new ch.cern.sparkmeasure.TaskMetrics(spark)
val q1  = "SELECT * FROM readdf WHERE Index=20000"
taskMetrics.runAndMeasure(spark.sql(q1).show)

Output:

21/02/04 16:52:59 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
21/02/04 16:52:59 WARN TaskMetrics: Stage metrics data refreshed into temp view PerfTaskMetrics

Scheduling mode = FIFO
Spark Contex default degree of parallelism = 4
Aggregated Spark task metrics:
numtasks => 4
elapsedTime => 3896 (4 s)
duration => 5268 (5 s)
schedulerDelayTime => 94 (94 ms)
executorRunTime => 4439 (4 s)
executorCpuTime => 3561 (4 s)
executorDeserializeTime => 734 (0.7 s)
executorDeserializeCpuTime => 460 (0.5 s)
resultSerializationTime => 1 (1 ms)
jvmGCTime => 237 (0.2 s)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 0 (0 ms)
gettingResultTime => 0 (0 ms)
resultSize => 2183 (2.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 6240991
bytesRead => 149260233 (142.0 MB)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 0
shuffleTotalBlocksFetched => 0
shuffleLocalBlocksFetched => 0
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 0 (0 Bytes)
shuffleLocalBytesRead => 0 (0 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 0 (0 Bytes)
shuffleRecordsWritten => 0

2. Interactive Mode using Spark-shell for multiple jobs/queries

Take Stage Metrics for example:

val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
stageMetrics.begin()

//Run multiple jobs/queries
val q1  = "SELECT * FROM readdf WHERE Index=20000"
val q2  = "SELECT * FROM readdf where Index=9999999999"
spark.sql(q1).show()
spark.sql(q2).show()

stageMetrics.end()
stageMetrics.printReport()

Output:

21/02/04 17:00:59 WARN StageMetrics: Stage metrics data refreshed into temp view PerfStageMetrics

Scheduling mode = FIFO
Spark Context default degree of parallelism = 4
Aggregated Spark stage metrics:
numStages => 4
numTasks => 8
elapsedTime => 3242 (3 s)
stageDuration => 1094 (1 s)
executorRunTime => 1779 (2 s)
executorCpuTime => 942 (0.9 s)
executorDeserializeTime => 96 (96 ms)
executorDeserializeCpuTime => 37 (37 ms)
resultSerializationTime => 1 (1 ms)
jvmGCTime => 42 (42 ms)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 0 (0 ms)
resultSize => 5441 (5.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 6240991
bytesRead => 149305675 (142.0 MB)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 0
shuffleTotalBlocksFetched => 0
shuffleLocalBlocksFetched => 0
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 0 (0 Bytes)
shuffleLocalBytesRead => 0 (0 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 0 (0 Bytes)
shuffleRecordsWritten => 0

Further more, below command can print additional accumulables metrics (including SQL metrics):

scala> stageMetrics.printAccumulables()
21/02/04 17:01:26 WARN StageMetrics: Accumulables metrics data refreshed into temp view AccumulablesStageMetrics

Aggregated Spark accumulables of type internal.metric. Sum of values grouped by metric name
Name => sum(value) [group by name]

executorCpuTime => 943 (0.9 s)
executorDeserializeCpuTime => 39 (39 ms)
executorDeserializeTime => 96 (96 ms)
executorRunTime => 1779 (2 s)
input.bytesRead => 149305675 (142.0 MB)
input.recordsRead => 6240991
jvmGCTime => 42 (42 ms)
resultSerializationTime => 1 (1 ms)
resultSize => 12780 (12.0 KB)

SQL Metrics and other non-internal metrics. Values grouped per accumulatorId and metric name.
Accid, Name => max(value) [group by accId, name]

  146, duration total => 1422 (1 s)
  147, number of output rows => 18
  148, number of output rows => 6240991
  151, scan time total => 1359 (1 s)
  202, duration total => 200 (0.2 s)
  207, scan time total => 198 (0.2 s)
scala> stageMetrics.printAccumulables()

3. Flight Recorder Mode

Please refer to this doc for Flight Recorder Mode.

This mode will not touch your code/program and only need to add a jar file when submitting the job.

Take Stage Metrics for example:

spark-submit --conf spark.driver.extraClassPath=./spark-measure_2.11-0.17.jar  \
             --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics \
             --conf spark.sparkmeasure.outputFormat=json    \
             --conf spark.sparkmeasure.outputFilename="/tmp/stageMetrics_flightRecorder"  \
             --conf spark.sparkmeasure.printToStdout=false   \
             --class "PredicatePushdownTest" \
             --master yarn  \
             ~/sbt/SparkScalaExample/target/scala-2.11/sparkscalaexample_2.11-1.0.jar

In the output, it will show:

WARN FlightRecorderStageMetrics: Writing Stage Metrics data serialized as json to /tmp/stageMetrics_flightRecorder

The json output file looks as:

$ cat /tmp/stageMetrics_flightRecorder
[ {
"jobId" : 0,
"jobGroup" : null,
"stageId" : 0,
"name" : "load at PredicatePushdownTest.scala:16",
"submissionTime" : 1612488772250,
"completionTime" : 1612488773352,
"stageDuration" : 1102,
"numTasks" : 1,
"executorRunTime" : 352,
"executorCpuTime" : 141,
"executorDeserializeTime" : 589,
"executorDeserializeCpuTime" : 397,
"resultSerializationTime" : 3,
"jvmGCTime" : 95,
"resultSize" : 1969,
"diskBytesSpilled" : 0,
"memoryBytesSpilled" : 0,
"peakExecutionMemory" : 0,
"recordsRead" : 0,
"bytesRead" : 0,
"recordsWritten" : 0,
"bytesWritten" : 0,
"shuffleFetchWaitTime" : 0,
"shuffleTotalBytesRead" : 0,
"shuffleTotalBlocksFetched" : 0,
"shuffleLocalBlocksFetched" : 0,
"shuffleRemoteBlocksFetched" : 0,
"shuffleLocalBytesRead" : 0,
"shuffleRemoteBytesRead" : 0,
"shuffleRemoteBytesReadToDisk" : 0,
"shuffleRecordsRead" : 0,
"shuffleWriteTime" : 0,
"shuffleBytesWritten" : 0,
"shuffleRecordsWritten" : 0
}, {
"jobId" : 1,
"jobGroup" : null,
"stageId" : 1,
"name" : "collect at PredicatePushdownTest.scala:25",
"submissionTime" : 1612488774600,
"completionTime" : 1612488776522,
"stageDuration" : 1922,
"numTasks" : 4,
"executorRunTime" : 4962,
"executorCpuTime" : 4446,
"executorDeserializeTime" : 1679,
"executorDeserializeCpuTime" : 1215,
"resultSerializationTime" : 2,
"jvmGCTime" : 309,
"resultSize" : 7545,
"diskBytesSpilled" : 0,
"memoryBytesSpilled" : 0,
"peakExecutionMemory" : 0,
"recordsRead" : 6240991,
"bytesRead" : 149260233,
"recordsWritten" : 0,
"bytesWritten" : 0,
"shuffleFetchWaitTime" : 0,
"shuffleTotalBytesRead" : 0,
"shuffleTotalBlocksFetched" : 0,
"shuffleLocalBlocksFetched" : 0,
"shuffleRemoteBlocksFetched" : 0,
"shuffleLocalBytesRead" : 0,
"shuffleRemoteBytesRead" : 0,
"shuffleRemoteBytesReadToDisk" : 0,
"shuffleRecordsRead" : 0,
"shuffleWriteTime" : 0,
"shuffleBytesWritten" : 0,
"shuffleRecordsWritten" : 0
}, {
"jobId" : 2,
"jobGroup" : null,
"stageId" : 2,
"name" : "collect at PredicatePushdownTest.scala:30",
"submissionTime" : 1612488776656,
"completionTime" : 1612488776833,
"stageDuration" : 177,
"numTasks" : 4,
"executorRunTime" : 427,
"executorCpuTime" : 261,
"executorDeserializeTime" : 89,
"executorDeserializeCpuTime" : 27,
"resultSerializationTime" : 0,
"jvmGCTime" : 0,
"resultSize" : 5884,
"diskBytesSpilled" : 0,
"memoryBytesSpilled" : 0,
"peakExecutionMemory" : 0,
"recordsRead" : 0,
"bytesRead" : 45442,
"recordsWritten" : 0,
"bytesWritten" : 0,
"shuffleFetchWaitTime" : 0,
"shuffleTotalBytesRead" : 0,
"shuffleTotalBlocksFetched" : 0,
"shuffleLocalBlocksFetched" : 0,
"shuffleRemoteBlocksFetched" : 0,
"shuffleLocalBytesRead" : 0,
"shuffleRemoteBytesRead" : 0,
"shuffleRemoteBytesReadToDisk" : 0,
"shuffleRecordsRead" : 0,
"shuffleWriteTime" : 0,
"shuffleBytesWritten" : 0,
"shuffleRecordsWritten" : 0
} ]

References:

On Measuring Apache Spark Workload Metrics for Performance Troubleshooting

Example analysis of Spark metrics collected with sparkMeasure

↧

How to generate TPC-DS data and run TPC-DS performance benchmark for Spark

February 5, 2021, 5:20 pm

≫ Next: Spark Tuning -- Understand Cost Based Optimizer in Spark

≪ Previous: Spark Tuning -- How to use SparkMeasure to measure Spark job metrics

Goal:

This article explains how to use databricks/spark-sql-perf and databricks/tpcds-kit to generate TPC-DS data for Spark and run TPC-DS performance benchmark.

Env:

Spark 2.4.4 with Scala 2.11.12

MapR 6.1

Solution:

1. Download and build the databricks/tpcds-kit from github.

sudo yum install gcc make flex bison byacc git
cd /tmp/
git clone https://github.com/databricks/tpcds-kit.git
cd tpcds-kit/tools
make OS=LINUX

Note: This should be installed on all cluster nodes with the same location.

Here we downloaded it at "/tmp/tpcds-kit" on ALL cluster nodes.

2. Download and build the databricks/spark-sql-perf from github.

git clone https://github.com/databricks/spark-sql-perf
cd spark-sql-perf

Note: Make sure your Spark version and Scala version match this version of spark-sql-perf.

Here I am using Spark 2.4.4 with Scala 2.11.12. So I have to checkout an older branch:

git checkout remotes/origin/newversion

Now the build.sbt contains below entries which should be compatible with my env:

scalaVersion := "2.11.8"
sparkVersion := "2.3.0"

Then build:

sbt +package

Note: If you checkout a much older branch of spark-sql-perf say "remotes/origin/branch-0.4" which is based on spark 2.0.1, then you may hit below error when running the TPC-DS benchmark in step 6. This is because starting from Spark 2.2 there is no such method getExecutorStorageStatus in class org.apache.spark.SparkContext.

java.lang.NoSuchMethodError: org.apache.spark.SparkContext.getExecutorStorageStatus()[Lorg/apache/spark/storage/StorageStatus;
	at com.databricks.spark.sql.perf.Benchmarkable$class.afterBenchmark(Benchmarkable.scala:63)

3. create gendata.scala

import com.databricks.spark.sql.perf.tpcds.TPCDSTables

// Note: Declare "sqlContext" for Spark 2.x version
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Set:
// Note: Here my env is using MapRFS, so I changed it to "hdfs:///tpcds".
// Note: If you are using HDFS, the format should be like "hdfs://namenode:9000/tpcds"
val rootDir = "hdfs:///tpcds" // root directory of location to create data in.

val databaseName = "tpcds" // name of database to create.
val scaleFactor = "10" // scaleFactor defines the size of the dataset to generate (in GB).
val format = "parquet" // valid spark format like parquet "parquet".
// Run:
val tables = new TPCDSTables(sqlContext,
    dsdgenDir = "/tmp/tpcds-kit/tools", // location of dsdgen
    scaleFactor = scaleFactor,
    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
    useStringForDate = false) // true to replace DateType with StringType


tables.genData(
    location = rootDir,
    format = format,
    overwrite = true, // overwrite the data that is already there
    partitionTables = true, // create the partitioned fact tables 
    clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files. 
    filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
    tableFilter = "", // "" means generate all tables
    numPartitions = 20) // how many dsdgen partitions to run - number of input tasks.

// Create the specified database
sql(s"create database $databaseName")
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = true)
// Or, if you want to create temporary tables
// tables.createTemporaryTables(location, format)

// For CBO only, gather statistics on all columns:
tables.analyzeTables(databaseName, analyzeColumns = true)

4. Run the gendata.scala using spark-shell

spark-shell --jars ~/hao/spark-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.0-SNAPSHOT.jar \
            --master yarn \
            --deploy-mode client \
            --executor-memory 4G \
            --num-executors 4 \
            --executor-cores 2 \
            -i ~/hao/gendata.scala

Note: Tune --executor-memory , --num-executors and --executor-cores to make sure no OOM happens.
Note: If we just need to generate 10G data, reduce "numPartitions" in above gendata.scala say 20 to reduce the overhead of too many tasks.

5. Confirm the data files and Hive tables are created.

It should create 24 tables.

Check Data:

# hadoop fs -du -s -h /tpcds/*
11.8 K  /tpcds/call_center
695.0 K  /tpcds/catalog_page
133.8 M  /tpcds/catalog_returns
1.1 G  /tpcds/catalog_sales
25.4 M  /tpcds/customer
4.7 M  /tpcds/customer_address
7.5 M  /tpcds/customer_demographics
1.8 M  /tpcds/date_dim
30.1 K  /tpcds/household_demographics
1.1 K  /tpcds/income_band
467.9 M  /tpcds/inventory
9.4 M  /tpcds/item
30.7 K  /tpcds/promotion
1.8 K  /tpcds/reason
2.3 K  /tpcds/ship_mode
18.3 K  /tpcds/store
190.4 M  /tpcds/store_returns
1.4 G  /tpcds/store_sales
1.1 M  /tpcds/time_dim
4.3 K  /tpcds/warehouse
7.7 K  /tpcds/web_page
69.7 M  /tpcds/web_returns
516.3 M  /tpcds/web_sales
13.1 K  /tpcds/web_site

Check Hive tables in hive CLI(or spark-sql):

hive> use tpcds;
OK
Time taken: 0.011 seconds

hive> show tables;
OK
call_center
catalog_page
catalog_returns
catalog_sales
customer
customer_address
customer_demographics
date_dim
household_demographics
income_band
inventory
item
promotion
reason
ship_mode
store
store_returns
store_sales
time_dim
warehouse
web_page
web_returns
web_sales
web_site
Time taken: 0.012 seconds, Fetched: 24 row(s)

6. Run TPC-DS benchmark

After the tables are created, we can run the 99 TPC-DS queries which are located under folder "./src/main/resources/tpcds_2_4/".

Create runtpcds.scala:

import com.databricks.spark.sql.perf.tpcds.TPCDS

// Note: Declare "sqlContext" for Spark 2.x version
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val tpcds = new TPCDS (sqlContext = sqlContext)
// Set:
val databaseName = "tpcds" // name of database with TPCDS data.
sql(s"use $databaseName")
val resultLocation = "/tmp/tpcds_results" // place to write results
val iterations = 1 // how many iterations of queries to run.
val queries = tpcds.tpcds2_4Queries // queries to run.
val timeout = 24*60*60 // timeout, in seconds.
// Run:
val experiment = tpcds.runExperiment(
  queries, 
  iterations = iterations,
  resultLocation = resultLocation,
  forkThread = true)
experiment.waitForFinish(timeout)

Run runtpcds.scala using spark-shell:

spark-shell --jars ~/hao/spark-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.0-SNAPSHOT.jar \
            --driver-class-path /home/mapr/.ivy2/cache/com.typesafe.scala-logging/scala-logging-slf4j_2.10/jars/scala-logging-slf4j_2.10-2.1.2.jar:/home/mapr/.ivy2/cache/com.typesafe.scala-logging/scala-logging-api_2.11/jars/scala-logging-api_2.11-2.1.2.jar \
            --master yarn \
            --deploy-mode client \
            --executor-memory 2G \
            --driver-memory 4G \
            --num-executors 4  \
            -i ~/hao/runtpcds.scala

Note: We need to include scala-logging-slf4j and also scala-logging-api jars otherwise error java.lang.ClassNotFoundException will show up for related classes. Good thing is that you can find those jars in ivy cache directories when building the spark-sql-perf using "sbt +package".
Note: "sql(s"use $databaseName")" should be put before declaring "val queries = tpcds.tpcds2_4Queries". Otherwise you will not see the explain plan for each query because it could not find the table in default database. So the Doc on github should be corrected.
Note: We need to increase the --driver-memory to large enough because broadcast join needs much memory. Otherwise you may hit below error when running q10.sql:

failure in runBenchmark: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. 
As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value

7. Run customized query benchmark

If we do not want to run the whole TPC-DS benchmark, and only want to test the benchmark result for certain customized query(eg. subset of TPC-DS queries, or just some other ad-hoc queries), we can do so. But before that, we need to understand the source code on this project firstly.

Code Analysis

Previous example uses "tpcds.tpcds2_4Queries" which is Seq[com.databricks.spark.sql.perf.Query].

"tpcds" is the object for class "TPCDS".

Because class "TPCDS" extends trait "Tpcds_2_4_Queries", so "tpcds2_4Queries" is actually the member of trait "Tpcds_2_4_Queries":

  val tpcds2_4Queries = queryNames.map { queryName =>
    val queryContent: String = IOUtils.toString(
      getClass().getClassLoader().getResourceAsStream(s"tpcds_2_4/$queryName.sql"))
    Query(queryName + "-v2.4", queryContent, description = "TPCDS 2.4 Query",
      executionMode = CollectResults)
  }

Initially, I mistook the "Query" for "com.databricks.spark.sql.perf.Query". So I failed to define the object for Query many times.

Later on I found that trait "Tpcds_2_4_Queries" actually extends abstract class "Benchmark" which contains the Factory object for benchmark queries as below:

  /** Factory object for benchmark queries. */
  case object Query {
    def apply(
        name: String,
        sqlText: String,
        description: String,
        executionMode: ExecutionMode = ExecutionMode.ForeachResults): Query = {
      new Query(name, sqlContext.sql(sqlText), description, Some(sqlText), executionMode)
    }

    def apply(
        name: String,
        dataFrameBuilder: => DataFrame,
        description: String): Query = {
      new Query(name, dataFrameBuilder, description, None, ExecutionMode.CollectResults)
    }
  }

After understanding the logic, if we want to customize the query, we just need to create a class extends abstract class "Benchmark".

Example of customized query

import com.databricks.spark.sql.perf.{Benchmark, ExecutionMode}

// Customized a query
class customized_query extends Benchmark {
	import ExecutionMode._
	private val sqlText = "select * from customer limit 10"
	val q1 = Seq(Query(name = "my customized query", sqlText = sqlText, description = "check some customer info", executionMode = CollectResults))
}
val queries = new customized_query().q1

Everything else is the same as previous example in step 6.

8. View Benchmark results

a. If experiment is still running, use "experiment.getCurrentResults".

experiment.getCurrentResults.createOrReplaceTempView("result") 
spark.sql("select substring(name,1,100) as Name, bround((parsingTime+analysisTime+optimizationTime+planningTime+executionTime)/1000.0,1) as Runtime_sec  from result").show()

Sample Output:

+---------+-----------+
|     Name|Runtime_sec|
+---------+-----------+
|  q1-v2.4|       21.1|
|  q2-v2.4|       13.2|
|  q3-v2.4|        6.0|
|  q4-v2.4|      135.1|
|  q5-v2.4|       38.9|
|  q6-v2.4|       43.4|
|  q7-v2.4|       10.6|
|  q8-v2.4|        9.9|
|  q9-v2.4|       51.7|
| q10-v2.4|       25.8|
| q11-v2.4|       92.3|
| q12-v2.4|        6.8|
| q13-v2.4|       12.5|
|q14a-v2.4|      130.7|
|q14b-v2.4|       91.3|
| q15-v2.4|        8.8|
| q16-v2.4|       30.8|
| q17-v2.4|       46.6|
| q18-v2.4|       14.2|
| q19-v2.4|        7.9|
+---------+-----------+
only showing top 20 rows

b. If experiment has ended, read the result json file.

Note: since the json file contains nested columns, we need to flatten the data using "explode" function.

import org.apache.spark.sql.functions._
val result = spark.read.json(resultLocation).filter("timestamp = 1612560709933").select(explode($"results").as("r"))
result.createOrReplaceTempView("result") 
spark.sql("select substring(r.name,1,100) as Name, bround((r.parsingTime+r.analysisTime+r.optimizationTime+r.planningTime+r.executionTime)/1000.0,1) as Runtime_sec  from result").show()

Sample Output:

+---------+-----------+
|     Name|Runtime_sec|
+---------+-----------+
|  q1-v2.4|       21.1|
|  q2-v2.4|       13.2|
|  q3-v2.4|        6.0|
|  q4-v2.4|      135.1|
|  q5-v2.4|       38.9|
|  q6-v2.4|       43.4|
|  q7-v2.4|       10.6|
|  q8-v2.4|        9.9|
|  q9-v2.4|       51.7|
| q10-v2.4|       25.8|
| q11-v2.4|       92.3|
| q12-v2.4|        6.8|
| q13-v2.4|       12.5|
|q14a-v2.4|      130.7|
|q14b-v2.4|       91.3|
| q15-v2.4|        8.8|
| q16-v2.4|       30.8|
| q17-v2.4|       46.6|
| q18-v2.4|       14.2|
| q19-v2.4|        7.9|
+---------+-----------+
only showing top 20 rows

↧

Spark Tuning -- Understand Cost Based Optimizer in Spark

February 8, 2021, 4:57 pm

≫ Next: Spark Code -- Which Spark SQL data type isOrderable?

≪ Previous: How to generate TPC-DS data and run TPC-DS performance benchmark for Spark

Goal:

This article explains Spark CBO(Cost Based Optimizer) with examples and shares how to check the table statistics.

Env:

Spark 2.4.4

MapR 6.1

MySQL as backend database for Hive Metastore

Concept:

Like in any transitional RDBMS, CBO is to determine the best query execution plan based on table statistics.

Starting from Spark 2.2, CBO was introduced. Before that, RBO(Rule Based Optimizer) is used.

Before using CBO, we need to collect the table/column level statistics(including histogram) using Analyze Table command.

Note: As of Spark 2.4.4, the CBO is disabled by default and the parameter spark.sql.cbo.enabled controls it.
Note: As of Spark 2.4.4, histogram statistics collection is disabled by default and the parameter spark.sql.statistics.histogram.enabled controls it.
Note: Spark uses Equal-Height Histogram instead of Equal-Width Histogram.
Note: As of Spark 2.4.4, the default number of histogram buckets is 254 which is controlled by parameter spark.sql.statistics.histogram.numBins.

What is included for column level statistics?

For Numeric/Date/Timestamp type: Distinct Count, Max, Min, Null Count, Average Length, Max Length.
For String/Binary type: Distinct Count, Null Count, Average Length, Max Length.

CBO uses logical optimization rules to optimize the logical plan.

So if we want to examine the statistics inside query explain plan, we can find them inside “Optimized Logical Plan” section.

Solution:

Here we will use some simple query examples based on test table named "customer"(generated by TPC-DS tool shared in this post) to demonstrate the CBO and statistics in Spark.

All below SQL statements are executed in spark-sql by default.

1. Collect Table/Column statistics

1.1 Table level statistics including total number of rows and data size:

ANALYZE TABLE customer COMPUTE STATISTICS;

1.2 Table + Column statistics:

ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk;

1.3 Table + Column statistics with histogram:

set spark.sql.statistics.histogram.enabled=true;
ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk;

2. View Table/Column statistics

2.1 Table level statistics:

DESCRIBE EXTENDED customer;

Output:

Statistics	26670841 bytes, 500000 rows

2.2 Column level statistics including histogram:

DESCRIBE EXTENDED customer c_customer_sk;

Output:

col_name	c_customer_sk
data_type	int
comment	NULL
min	1
max	500000
num_nulls	0
distinct_count	500000
avg_col_len	4
max_col_len	4
histogram	height: 1968.5039370078741, num_of_bins: 254
bin_0	lower_bound: 1.0, upper_bound: 1954.0, distinct_count: 1982
bin_1	lower_bound: 1954.0, upper_bound: 3898.0, distinct_count: 1893
...
bin_253	lower_bound: 497982.0, upper_bound: 500000.0, distinct_count: 2076

2.3 Check statistics in backend database for Hive Metastore:(eg. mysql)

select tp.PARAM_KEY, tp.PARAM_VALUE
from DBS d,TBLS t, TABLE_PARAMS tp
where t.DB_ID = d.DB_ID
and tp.TBL_ID=t.TBL_ID
and d.NAME='tpcds' and t.TBL_NAME='customer'
and (
  tp.PARAM_KEY in (
'spark.sql.statistics.numRows',
'spark.sql.statistics.totalSize'
    )
or
  tp.PARAM_KEY like 'spark.sql.statistics.colStats.c_customer_sk.%'
)
and tp.PARAM_KEY not like 'spark.sql.statistics.colStats.%.histogram'
;

Output:

+-----------------------------------------------------------+-------------+
| PARAM_KEY                                                 | PARAM_VALUE |
+-----------------------------------------------------------+-------------+
| spark.sql.statistics.colStats.c_customer_sk.avgLen        | 4           |
| spark.sql.statistics.colStats.c_customer_sk.distinctCount | 500000      |
| spark.sql.statistics.colStats.c_customer_sk.max           | 500000      |
| spark.sql.statistics.colStats.c_customer_sk.maxLen        | 4           |
| spark.sql.statistics.colStats.c_customer_sk.min           | 1           |
| spark.sql.statistics.colStats.c_customer_sk.nullCount     | 0           |
| spark.sql.statistics.colStats.c_customer_sk.version       | 1           |
| spark.sql.statistics.numRows                              | 500000      |
| spark.sql.statistics.totalSize                            | 26670841    |
+-----------------------------------------------------------+-------------+
9 rows in set (0.00 sec)

2.4 View statistics in spark-shell to understand which classes are used to store statistics

val db = "tpcds"
val tableName = "customer"
val colName = "c_customer_sk"

val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
val stats = metadata.stats.get
val colStats = stats.colStats
val c_customer_sk_stats = colStats(colName)

val props = c_customer_sk_stats.toMap(colName)
println(props)

Output:

scala> println(props)
Map(c_customer_sk.avgLen -> 4, c_customer_sk.nullCount -> 0, c_customer_sk.distinctCount -> 500000, c_customer_sk.histogram -> XXXYYYZZZ, c_customer_sk.min -> 1, c_customer_sk.max -> 500000, c_customer_sk.version -> 1, c_customer_sk.maxLen -> 4)

Basically above "c_customer_sk_stats" is of class org.apache.spark.sql.catalyst.catalog.CatalogColumnStat which is defined inside ./sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala

3. Check cardinality based on statistics

From above statistics for column "c_customer_sk" in table "customer", we know that this column is unique and has totally 500000 distinct values ranging from 1 ~ 500000.

In RBO world, no matter the filter is based on "where c_customer_sk < 500" or "where c_customer_sk < 500000", the Filter operator always shows "sizeInBytes=119.2 MB" which is the total table size. And there is no rowCount shown.

spark-sql> set spark.sql.cbo.enabled=false;
spark.sql.cbo.enabled	false
Time taken: 0.013 seconds, Fetched 1 row(s)

spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500;
== Optimized Logical Plan ==
Project [c_customer_sk#724], Statistics(sizeInBytes=6.4 MB, hints=none)
+- Filter (isnotnull(c_customer_sk#724) && (c_customer_sk#724 < 500)), Statistics(sizeInBytes=119.2 MB, hints=none)
   +- Relation[c_customer_sk#724,c_customer_id#725,c_current_cdemo_sk#726,c_current_hdemo_sk#727,c_current_addr_sk#728,c_first_shipto_date_sk#729,c_first_sales_date_sk#730,c_salutation#731,c_first_name#732,c_last_name#733,c_preferred_cust_flag#734,c_birth_day#735,c_birth_month#736,c_birth_year#737,c_birth_country#738,c_login#739,c_email_address#740,c_last_review_date#741] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500000;
== Optimized Logical Plan ==
Project [c_customer_sk#724], Statistics(sizeInBytes=6.4 MB, hints=none)
+- Filter (isnotnull(c_customer_sk#724) && (c_customer_sk#724 < 500000)), Statistics(sizeInBytes=119.2 MB, hints=none)
   +- Relation[c_customer_sk#724,c_customer_id#725,c_current_cdemo_sk#726,c_current_hdemo_sk#727,c_current_addr_sk#728,c_first_shipto_date_sk#729,c_first_sales_date_sk#730,c_salutation#731,c_first_name#732,c_last_name#733,c_preferred_cust_flag#734,c_birth_day#735,c_birth_month#736,c_birth_year#737,c_birth_country#738,c_login#739,c_email_address#740,c_last_review_date#741] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

In CBO world, we can see the estimated data size and rowCount based on column level statistics and the Filter:

sizeInBytes=122.8 KB, rowCount=503 VS sizeInBytes=119.2 MB, rowCount=5.00E+5.

spark-sql> set spark.sql.cbo.enabled=true;
spark.sql.cbo.enabled	true
Time taken: 0.02 seconds, Fetched 1 row(s)

spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500;
== Optimized Logical Plan ==
Project [c_customer_sk#1024], Statistics(sizeInBytes=5.9 KB, rowCount=503, hints=none)
+- Filter (isnotnull(c_customer_sk#1024) && (c_customer_sk#1024 < 500)), Statistics(sizeInBytes=122.8 KB, rowCount=503, hints=none)
   +- Relation[c_customer_sk#1024,c_customer_id#1025,c_current_cdemo_sk#1026,c_current_hdemo_sk#1027,c_current_addr_sk#1028,c_first_shipto_date_sk#1029,c_first_sales_date_sk#1030,c_salutation#1031,c_first_name#1032,c_last_name#1033,c_preferred_cust_flag#1034,c_birth_day#1035,c_birth_month#1036,c_birth_year#1037,c_birth_country#1038,c_login#1039,c_email_address#1040,c_last_review_date#1041] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500000;
== Optimized Logical Plan ==
Project [c_customer_sk#1024], Statistics(sizeInBytes=5.7 MB, rowCount=5.00E+5, hints=none)
+- Filter (isnotnull(c_customer_sk#1024) && (c_customer_sk#1024 < 500000)), Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)
   +- Relation[c_customer_sk#1024,c_customer_id#1025,c_current_cdemo_sk#1026,c_current_hdemo_sk#1027,c_current_addr_sk#1028,c_first_shipto_date_sk#1029,c_first_sales_date_sk#1030,c_salutation#1031,c_first_name#1032,c_last_name#1033,c_preferred_cust_flag#1034,c_birth_day#1035,c_birth_month#1036,c_birth_year#1037,c_birth_country#1038,c_login#1039,c_email_address#1040,c_last_review_date#1041] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

Note: In spark-shell, we can use below way to fetch the same stuff:

spark.conf.set("spark.sql.cbo.enabled","true")
sql(s"use tpcds")
val stats = spark.sql("select c_customer_sk from customer where c_customer_sk < 500").queryExecution.stringWithStats

More information regarding how Spark CBO calculates the cardinality for Filter/Join/other operators, please refer to this slide and training session:

Slides: Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang

Training: Cardinality Estimation through Histogram in Apache Spark 2.3

4. Broadcast Join

Like in any MPP architecture query engine or SQL on Hadoop products(Such as Hive, Impala, Drill), Broadcast Join is not a new thing.

By default in Spark, the table/data size below 10MB(configured by spark.sql.autoBroadcastJoinThreshold) can be broadcasted to all worker nodes.

Look at below example join query:

spark-sql> explain cost select * from customer a, customer b where a.c_first_name=b.c_first_name and a.c_customer_sk<500000 and b.c_customer_sk<500;
== Optimized Logical Plan ==
Join Inner, (c_first_name#18 = c_first_name#62), Statistics(sizeInBytes=22.5 MB, rowCount=4.80E+4, hints=none)
:- Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18)), Statistics(sizeInBytes=115.1 MB, rowCount=4.83E+5, hints=none)
:  +- Relation[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)
+- Filter ((isnotnull(c_customer_sk#54) && (c_customer_sk#54 < 500)) && isnotnull(c_first_name#62)), Statistics(sizeInBytes=118.7 KB, rowCount=486, hints=none)
   +- Relation[c_customer_sk#54,c_customer_id#55,c_current_cdemo_sk#56,c_current_hdemo_sk#57,c_current_addr_sk#58,c_first_shipto_date_sk#59,c_first_sales_date_sk#60,c_salutation#61,c_first_name#62,c_last_name#63,c_preferred_cust_flag#64,c_birth_day#65,c_birth_month#66,c_birth_year#67,c_birth_country#68,c_login#69,c_email_address#70,c_last_review_date#71] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

== Physical Plan ==
*(2) BroadcastHashJoin [c_first_name#18], [c_first_name#62], Inner, BuildRight
:- *(2) Project [c_customer_sk#10, c_customer_id#11, c_current_cdemo_sk#12, c_current_hdemo_sk#13, c_current_addr_sk#14, c_first_shipto_date_sk#15, c_first_sales_date_sk#16, c_salutation#17, c_first_name#18, c_last_name#19, c_preferred_cust_flag#20, c_birth_day#21, c_birth_month#22, c_birth_year#23, c_birth_country#24, c_login#25, c_email_address#26, c_last_review_date#27]
:  +- *(2) Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18))
:     +- *(2) FileScan parquet tpcds.customer[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500000), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
+- BroadcastExchange HashedRelationBroadcastMode(List(input[8, string, true]))
   +- *(1) Project [c_customer_sk#54, c_customer_id#55, c_current_cdemo_sk#56, c_current_hdemo_sk#57, c_current_addr_sk#58, c_first_shipto_date_sk#59, c_first_sales_date_sk#60, c_salutation#61, c_first_name#62, c_last_name#63, c_preferred_cust_flag#64, c_birth_day#65, c_birth_month#66, c_birth_year#67, c_birth_country#68, c_login#69, c_email_address#70, c_last_review_date#71]
      +- *(1) Filter ((isnotnull(c_customer_sk#54) && (c_customer_sk#54 < 500)) && isnotnull(c_first_name#62))
         +- *(1) FileScan parquet tpcds.customer[c_customer_sk#54,c_customer_id#55,c_current_cdemo_sk#56,c_current_hdemo_sk#57,c_current_addr_sk#58,c_first_shipto_date_sk#59,c_first_sales_date_sk#60,c_salutation#61,c_first_name#62,c_last_name#63,c_preferred_cust_flag#64,c_birth_day#65,c_birth_month#66,c_birth_year#67,c_birth_country#68,c_login#69,c_email_address#70,c_last_review_date#71] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
Time taken: 0.098 seconds, Fetched 1 row(s)

From the "Optimized Logical Plan", we know the estimated size of the smaller table is (sizeInBytes=118.7 KB, rowCount=486). So it can be broadcasted and that is why we see "BroadcastHashJoin" in "Physical Plan".

If we decrease spark.sql.autoBroadcastJoinThreshold to 118KB(118*1024=120832), then it will be converted to a SortMergeJoin:

spark-sql> set spark.sql.autoBroadcastJoinThreshold=120832;
spark.sql.autoBroadcastJoinThreshold	120832
Time taken: 0.016 seconds, Fetched 1 row(s)
spark-sql> explain cost select * from customer a, customer b where a.c_first_name=b.c_first_name and a.c_customer_sk<500000 and b.c_customer_sk<500;
== Optimized Logical Plan ==
Join Inner, (c_first_name#18 = c_first_name#94), Statistics(sizeInBytes=22.5 MB, rowCount=4.80E+4, hints=none)
:- Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18)), Statistics(sizeInBytes=115.1 MB, rowCount=4.83E+5, hints=none)
:  +- Relation[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)
+- Filter ((isnotnull(c_customer_sk#86) && (c_customer_sk#86 < 500)) && isnotnull(c_first_name#94)), Statistics(sizeInBytes=118.7 KB, rowCount=486, hints=none)
   +- Relation[c_customer_sk#86,c_customer_id#87,c_current_cdemo_sk#88,c_current_hdemo_sk#89,c_current_addr_sk#90,c_first_shipto_date_sk#91,c_first_sales_date_sk#92,c_salutation#93,c_first_name#94,c_last_name#95,c_preferred_cust_flag#96,c_birth_day#97,c_birth_month#98,c_birth_year#99,c_birth_country#100,c_login#101,c_email_address#102,c_last_review_date#103] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

== Physical Plan ==
*(5) SortMergeJoin [c_first_name#18], [c_first_name#94], Inner
:- *(2) Sort [c_first_name#18 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(c_first_name#18, 200)
:     +- *(1) Project [c_customer_sk#10, c_customer_id#11, c_current_cdemo_sk#12, c_current_hdemo_sk#13, c_current_addr_sk#14, c_first_shipto_date_sk#15, c_first_sales_date_sk#16, c_salutation#17, c_first_name#18, c_last_name#19, c_preferred_cust_flag#20, c_birth_day#21, c_birth_month#22, c_birth_year#23, c_birth_country#24, c_login#25, c_email_address#26, c_last_review_date#27]
:        +- *(1) Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18))
:           +- *(1) FileScan parquet tpcds.customer[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500000), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
+- *(4) Sort [c_first_name#94 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c_first_name#94, 200)
      +- *(3) Project [c_customer_sk#86, c_customer_id#87, c_current_cdemo_sk#88, c_current_hdemo_sk#89, c_current_addr_sk#90, c_first_shipto_date_sk#91, c_first_sales_date_sk#92, c_salutation#93, c_first_name#94, c_last_name#95, c_preferred_cust_flag#96, c_birth_day#97, c_birth_month#98, c_birth_year#99, c_birth_country#100, c_login#101, c_email_address#102, c_last_review_date#103]
         +- *(3) Filter ((isnotnull(c_customer_sk#86) && (c_customer_sk#86 < 500)) && isnotnull(c_first_name#94))
            +- *(3) FileScan parquet tpcds.customer[c_customer_sk#86,c_customer_id#87,c_current_cdemo_sk#88,c_current_hdemo_sk#89,c_current_addr_sk#90,c_first_shipto_date_sk#91,c_first_sales_date_sk#92,c_salutation#93,c_first_name#94,c_last_name#95,c_preferred_cust_flag#96,c_birth_day#97,c_birth_month#98,c_birth_year#99,c_birth_country#100,c_login#101,c_email_address#102,c_last_review_date#103] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
Time taken: 0.113 seconds, Fetched 1 row(s)

If we run this query "select * from customer a, customer b where a.c_first_name=b.c_first_name and a.c_customer_sk<5 and b.c_customer_sk<500;" and look at the web UI, we can also find the estimated cardinality:

In all, CBO is a huge topic in any database/query engine. I will discuss more in future posts.

References:

Cost Based Optimizer in Apache Spark 2.2

Slides: Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang

Training: Cardinality Estimation through Histogram in Apache Spark 2.

↧

Spark Code -- Which Spark SQL data type isOrderable?

February 10, 2021, 12:04 pm

≫ Next: Spark Tuning -- explaining Spark SQL Join Types

≪ Previous: Spark Tuning -- Understand Cost Based Optimizer in Spark

Goal:

This article does some code analysis on which Spark SQL data type is Order-able or Sort-able.

We will look into the source code logic for method "isOrderable" of object org.apache.spark.sql.catalyst.expressions.RowOrdering.

The reason why we are interested into method "isOrderable" is this method is used by SparkStrategies.scala to choose join types which we will dig deeper more in another post.

Env:

Spark 2.4 source code

Solution:

The source code for method "isOrderable" is:

  /**
   * Returns true iff the data type can be ordered (i.e. can be sorted).
   */
  def isOrderable(dataType: DataType): Boolean = dataType match {
    case NullType => true
    case dt: AtomicType => true
    case struct: StructType => struct.fields.forall(f => isOrderable(f.dataType))
    case array: ArrayType => isOrderable(array.elementType)
    case udt: UserDefinedType[_] => isOrderable(udt.sqlType)
    case _ => false
  }

So basically any NullType or AtomicType should be Order-able. For other complex types, it depends on their element types.

Let's now take a look at ALL of the Spark SQL data types in org.apache.spark.sql.types

1. NullType

import org.apache.spark.sql.catalyst.expressions.RowOrdering
import org.apache.spark.sql.types._

scala> RowOrdering.isOrderable(NullType)
res0: Boolean = true

2. class which extends AtomicType

scala> RowOrdering.isOrderable(BinaryType)
res29: Boolean = true

scala> RowOrdering.isOrderable(BooleanType)
res6: Boolean = true

scala> RowOrdering.isOrderable(DateType)
res31: Boolean = true

RowOrdering.isOrderable(HiveStringType)
scala> RowOrdering.isOrderable(StringType)
res1: Boolean = true

scala> RowOrdering.isOrderable(TimestampType)
res19: Boolean = true

Here is another abstract class HiveStringType which also extends AtomicType.

But as per the comment below, it should be replaced by a StringType. And it is even removed in Spark 3.1.

/**
 * A hive string type for compatibility. These datatypes should only used for parsing,
 * and should NOT be used anywhere else. Any instance of these data types should be
 * replaced by a [[StringType]] before analysis.
 */

sealed abstract class HiveStringType extends AtomicType

3. class which extends IntegralType

Basically "abstract class IntegralType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.

So any class which extends IntegralType should also be order-able:

scala> RowOrdering.isOrderable(ByteType)
res11: Boolean = true

scala> RowOrdering.isOrderable(IntegerType)
res3: Boolean = true

scala> RowOrdering.isOrderable(LongType)
res13: Boolean = true

scala> RowOrdering.isOrderable(ShortType)
res14: Boolean = true

4. class which extends FractionalType

Basically "abstract class FractionalType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.

So any class which extends FractionalType should also be order-able:

scala> RowOrdering.isOrderable(DecimalType(10,5))
res17: Boolean = true

scala> RowOrdering.isOrderable(DoubleType)
res2: Boolean = true

scala> RowOrdering.isOrderable(FloatType)
res13: Boolean = true

5. Spark SQL data types which are not order-able

scala> RowOrdering.isOrderable(CalendarIntervalType)
res26: Boolean = false

scala> RowOrdering.isOrderable(DataTypes.createMapType(StringType,StringType))
res9: Boolean = false

scala> RowOrdering.isOrderable(ObjectType(classOf[java.lang.Integer]))
res23: Boolean = false

6. Complex Spark SQL data types

If the ArrayType's element type is order-able, then ArrayType is order-able. Vice Versa.

scala> RowOrdering.isOrderable(ArrayType(IntegerType))
res22: Boolean = true

scala> RowOrdering.isOrderable(ArrayType(CalendarIntervalType))
res27: Boolean = false

If all of the field types of StructType is order-able, then StructType is order-able. Vice Versa.

scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", StringType))
res6: Boolean = true

scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", CalendarIntervalType))
res7: Boolean = false

7. UserDefinedType

As per below comment, it should not be used and becomes private after Spark 2.x.

 * Note: This was previously a developer API in Spark 1.x. We are making this private in Spark 2.0
 * because we will very likely create a new version of this that works better with Datasets.

↧