Quantcast
Channel: Open Knowledge Base
Viewing all 137 articles
Browse latest View live

How to mount a PersistentVolume for Static Provisioning using MapR CSI in GKE

$
0
0

Goal:

This article explains the detailed steps on how to mount a PersistentVolume for Static Provisioning using MapR Container Storage Interface(CSI) in Google Kubernetes Engine(GKE).

Env:

MapR 6.1 (secured)
MapR CSI 1.0.0
Kubernetes Cluster in GKE

Use Case:

We have a secured MapR Cluster (v6.1) and want to expose the storage to applications running in a Kubernetes cluster(GKE in this example).
In this example, we plan to expose a MapR volume named "mapr.apps" (mounted as /apps) to a sample POD in Kubernetes Cluster.
Inside the POD, it will be mounted as /mapr instead.

Solution:

1. Create a Kubernetes cluster named "standard-cluster-1" in GKE

You can use GUI or gcloud commands.

2. Fetch the credentials for the Kubernetes cluster

gcloud container clusters get-credentials standard-cluster-1 --zone us-central1-a
After that, make sure "kubectl cluster-info" returns correct cluster information.
This step is to make kubectl work and connect to the correct Kubernetes cluster.

3. Bind cluster-admin role to Google Cloud user

kubectl create clusterrolebinding user-cluster-admin-binding --clusterrole=cluster-admin --user=xxx@yyy.com
Note: "xxx@yyy.com" is the your Google Cloud user.
Here we grant cluster admin role to the user to avoid any permission error in the next step when we create MapR CSI ClusterRole and ClusterRoleBinding. 

4. Download MapR CSI Driver custom resource definition

Please refer to the latest documentation: https://mapr.com/docs/home/CSIdriver/csi_downloads.html 
git clone https://github.com/mapr/mapr-csi
cd ./mapr-csi/deploy/kubernetes/
kubectl create -f csi-maprkdf-v1.0.0.yaml
Below Kubernetes objects are created:
namespace/mapr-csi created
serviceaccount/csi-nodeplugin-sa created
clusterrole.rbac.authorization.k8s.io/csi-nodeplugin-cr created
clusterrolebinding.rbac.authorization.k8s.io/csi-nodeplugin-crb created
serviceaccount/csi-controller-sa created
clusterrole.rbac.authorization.k8s.io/csi-attacher-cr created
clusterrolebinding.rbac.authorization.k8s.io/csi-attacher-crb created
clusterrole.rbac.authorization.k8s.io/csi-controller-cr created
clusterrolebinding.rbac.authorization.k8s.io/csi-controller-crb created
daemonset.apps/csi-nodeplugin-kdf created
statefulset.apps/csi-controller-kdf created

5. Verify the PODs/DaemonSet/StatefulSet are running under namespace "mapr-csi"

PODs:
$ kubectl get pods -n mapr-csi -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-controller-kdf-0 5/5 Running 0 5m58s xx.xx.xx.1 gke-standard-cluster-1-default-pool-aaaaaaaa-1111 <none> <none>
csi-nodeplugin-kdf-9gmqc 3/3 Running 0 5m58s xx.xx.xx.2 gke-standard-cluster-1-default-pool-aaaaaaaa-2222 <none> <none>
csi-nodeplugin-kdf-qhhbh 3/3 Running 0 5m58s xx.xx.xx.3 gke-standard-cluster-1-default-pool-aaaaaaaa-3333 <none> <none>
csi-nodeplugin-kdf-vrq4g 3/3 Running 0 5m58s xx.xx.xx.4 gke-standard-cluster-1-default-pool-aaaaaaaa-4444 <none> <none>
DaemonSet:
$ kubectl get DaemonSet -n mapr-csi
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
csi-nodeplugin-kdf 3 3 3 3 3 <none> 8m58s
StatefulSet:
$ kubectl get StatefulSet -n mapr-csi
NAME READY AGE
csi-controller-kdf 1/1 9m42s

6. Create a test namespace named "testns" for future test PODs

kubectl create namespace testns

7. Create a Secret for MapR ticket

7.a Logon MapR Cluster, and locate the ticket file using "maprlogin print" or generate a new ticket file using "maprlogin password".
For example, here we are using "mapr" user's ticket file located at /tmp/maprticket_5000.
7.b Convert the ticket into base64 representation and save the output.
cat /tmp/maprticket_5000 | base64
7.c Create a YAML file named "mapr-ticket-secret.yaml" for the Secret named "mapr-ticket-secret" in namespace "testns".
apiVersion: v1
kind: Secret
metadata:
name: mapr-ticket-secret
namespace: testns
type: Opaque
data:
CONTAINER_TICKET: CHANGETHIS!
Note: "CHANGETHIS!" should be replaced by the output we saved in step 7.b. Make sure it is in a single line.
7.d Create this Secret.
kubectl create -f mapr-ticket-secret.yaml

8. Change the GKE default Storage Class

This is because GKE default Storage Class is named "standard".
If we do not change it, in the next steps, it will automatically create a new PV bound to our PVC.
8.a Confirm the default Storage Class is named "standard" in GKE.
$ kubectl get storageclass -o yaml
apiVersion: v1
items:
- allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
creationTimestamp: "2019-12-04T19:38:38Z"
labels:
addonmanager.kubernetes.io/mode: EnsureExists
kubernetes.io/cluster-service: "true"
name: standard
resourceVersion: "285"
selfLink: /apis/storage.k8s.io/v1/storageclasses/standard
uid: ab77d472-16cd-11ea-abaf-42010a8000ad
parameters:
type: pd-standard
provisioner: kubernetes.io/gce-pd
reclaimPolicy: Delete
volumeBindingMode: Immediate
kind: List
metadata:
resourceVersion: ""
selfLink: ""
8.b Create a YAML file named "my_storage_class.yaml" for Storage Class named "mysc".
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: mysc
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
8.c Create the Storage Class.
kubectl create -f my_storage_class.yaml
8.d Verify both Storage Classes.
$ kubectl get storageclass
NAME PROVISIONER AGE
mysc kubernetes.io/no-provisioner 8s
standard (default) kubernetes.io/gce-pd 8h
8.e Change default Storage Class to "mysc".
kubectl patch storageclass mysc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
8.f Verify both Storage Classes again.
$ kubectl get storageclass
NAME PROVISIONER AGE
mysc (default) kubernetes.io/no-provisioner 2m3s
standard kubernetes.io/gce-pd 8h

9. Create a YAML file named "test-simplepv.yaml" for PersistentVolume (PV) named "test-simplepv"

apiVersion: v1
kind: PersistentVolume
metadata:
name: test-simplepv
namespace: testns
labels:
name: pv-simplepv-test
spec:
storageClassName: mysc
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
capacity:
storage: 1Gi
csi:
nodePublishSecretRef:
name: "mapr-ticket-secret"
namespace: "testns"
driver: com.mapr.csi-kdf
volumeHandle: mapr.apps
volumeAttributes:
volumePath: "/apps"
cluster: "mycluster.cluster.com"
cldbHosts: "mycldb.node.internal"
securityType: "secure"
platinum: "false"
Make sure the CLDB host can be accessed by the Kubernetes Cluster nodes.
And also the PV is using our own Storage Class "mysc".
Create the PV:
kubectl create -f test-simplepv.yaml

10. Create a YAML file named "test-simplepvc.yaml" for PersistentVolumeClaim (PVC) named "test-simplepvc"

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: test-simplepvc
namespace: testns
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1G
Create the PVC:
kubectl create -f test-simplepvc.yaml
Right now, the PVC should be in "Pending" status which is fine.
$ kubectl get pv -n testns
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
test-simplepv 1Gi RWO Delete Available mysc 11s

$ kubectl get pvc -n testns
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
test-simplepvc Pending mysc 11s

11. Create a YAML file named "testpod.yaml" for a POD named "testpod"

apiVersion: v1
kind: Pod
metadata:
name: testpod
namespace: testns
spec:
securityContext:
runAsUser: 5000
fsGroup: 5000
containers:
- name: busybox
image: busybox
args:
- sleep
- "1000000"
resources:
requests:
memory: "2Gi"
cpu: "500m"
volumeMounts:
- mountPath: /mapr
name: maprcsi
volumes:
- name: maprcsi
persistentVolumeClaim:
claimName: test-simplepvc
Create the POD:
kubectl create -f testpod.yaml

After that, both PV and PVC should be "Bound":
$ kubectl get pvc -n testns
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
test-simplepvc Bound test-simplepv 1Gi RWO mysc 82s

$ kubectl get pv -n testns
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
test-simplepv 1Gi RWO Delete Bound testns/test-simplepvc mysc 89s

12. Logon the POD to verify

kubectl exec -ti testpod -n testns -- bin/sh
Then try to read and write:
/ $ mount -v |grep mapr
posix-client-basic on /mapr type fuse.posix-client-basic (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
/ $ ls -altr /mapr
total 6
drwxrwxrwt 3 5000 5000 1 Nov 26 16:49 kafka-streams
drwxrwxrwt 3 5000 5000 1 Nov 26 16:49 ksql
drwxrwxrwx 3 5000 5000 15 Dec 4 17:10 spark
drwxr-xr-x 5 5000 5000 3 Dec 5 04:27 .
drwxr-xr-x 1 root root 4096 Dec 5 04:40 ..
/ $ touch /mapr/testfile
/ $ rm /mapr/testfile

13. Clean up

kubectl delete -f testpod.yaml
kubectl delete -f test-simplepvc.yaml
kubectl delete -f test-simplepv.yaml
kubectl delete -f my_storage_class.yaml
kubectl delete -f mapr-ticket-secret.yaml
kubectl delete -f csi-maprkdf-v1.0.0.yaml

Common issues:

1. In step 4 when creating MapR CSI ClusterRoleBinding, it fails with below error message:
user xxx@yyy.com (groups=["system:authenticated"]) is attempting to grant rbac permissions not currently held
This is because Google Cloud user "xxx@yyy.com" does not have the permissions.
One solution is to do step 3 which is to grant cluster admin role to this user.

2. After PV and PVC are created, PVC is bound to a new PV named "pvc-...." instead of our PV named "test-simplepv".
For example:
$  kubectl get pvc -n testns
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
test-simplepvc Bound pvc-e9a0f512-16f6-11ea-abaf-42010a8000ad 1Gi RWO standard 16m

$ kubectl get pv -n testns
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-e9a0f512-16f6-11ea-abaf-42010a8000ad 1Gi RWO Delete Bound mapr-csi/test-simplepvc standard 17m
test-simplepv 1Gi RWO Delete Available
This is because GKE has a default Storage Class "standard" which can create a new PV bound to our PVC.
For example, we can confirm this using below command:
$  kubectl get pvc test-simplepvc -o=yaml -n testns
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
creationTimestamp: "2019-12-05T00:33:52Z"
finalizers:
- kubernetes.io/pvc-protection
name: test-simplepvc
namespace: testns
resourceVersion: "61729"
selfLink: /api/v1/namespaces/testns/persistentvolumeclaims/test-simplepvc
uid: e9a0f512-16f6-11ea-abaf-42010a8000ad
spec:
accessModes:
- ReadWriteOnce
dataSource: null
resources:
requests:
storage: 1G
storageClassName: standard
volumeMode: Filesystem
volumeName: pvc-e9a0f512-16f6-11ea-abaf-42010a8000ad
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
phase: Bound
One solution is to do step 8 which is to change the GKE default Storage Class.

Troubleshooting:

DaemonSet "csi-nodeplugin-kdf" has 3 kinds of containers:
[csi-node-driver-registrar liveness-probe mapr-kdfplugin]
StatefulSet "csi-controller-kdf" has 5 kinds of containers:
[csi-attacher csi-provisioner csi-snapshotter liveness-probe mapr-kdfprovisioner]

So we can view all of the container logs to see if there is any error.
For example:
kubectl logs csi-nodeplugin-kdf-vrq4g -c csi-node-driver-registrar -n mapr-csi
kubectl logs csi-nodeplugin-kdf-vrq4g -c liveness-probe -n mapr-csi
kubectl logs csi-nodeplugin-kdf-vrq4g -c mapr-kdfplugin -n mapr-csi

kubectl logs csi-controller-kdf-0 -c csi-provisioner -n mapr-csi
kubectl logs csi-controller-kdf-0 -c csi-attacher -n mapr-csi
kubectl logs csi-controller-kdf-0 -c csi-snapshotter -n mapr-csi
kubectl logs csi-controller-kdf-0 -c mapr-kdfprovisioner -n mapr-csi
kubectl logs csi-controller-kdf-0 -c liveness-probe -n mapr-csi

Reference:

https://mapr.com/docs/home/CSIdriver/csi_overview.html
https://mapr.com/docs/home/CSIdriver/csi_installation.html
https://mapr.com/docs/home/CSIdriver/csi_example_static_provisioning.html


Spark Streaming sample scala code for different sources

$
0
0

Goal:

This article shares some sample Spark Streaming scala code for different sources -- socket text, text files in MapR-FS directory, kafka broker and MapR Event Store for Apache Kafka(MapR Streams).
These are wordcount code which can be run directly from spark-shell.

Env:

MapR 6.1
mapr-spark-2.3.2.0
mapr-kafka-1.1.1
mapr-kafka-ksql-4.1.1

Solution:

1. socket text

Data source:
Open a socket on port 9999 and type some words as the data source.
nc -lk 9999
Sample Code:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

2. text files in MapR-FS directory

Data source:
Create a directory on MapR-FS and put text files inside as the data source.
hadoop fs -mkdir /tmp/textfile
hadoop fs -put /opt/mapr/NOTICE.txt /tmp/textfile/
Sample Code:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.textFileStream("/tmp/textfile")
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

3. kafka broker

Data source:
Assuming an existing kafka server is started:
./bin/kafka-server-start.sh ./config/server.properties
Create a new topic named "mytopic":
./bin/kafka-topics.sh --create --zookeeper localhost:5181 --replication-factor 1 --partitions 1 --topic mytopic
Start a kafka console producer and type some words as data source:
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic mytopic
OR use below producer:
./kafka-producer-perf-test.sh  --topic mytopic --num-records 1000000 --record-size 1000 \
--throughput 10000 --producer-props bootstrap.servers=localhost:9092
Sample Code:
import org.apache.kafka.clients.consumer
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{
ConsumerStrategies,
KafkaUtils,
LocationStrategies
}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(10))

val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.GROUP_ID_CONFIG -> "mysparkgroup",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false: java.lang.Boolean)
)

val topicsSet = Array("mytopic")
val consumerStrategy = ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
consumerStrategy)

val lines = messages.map(_.value())
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()

4. MapR Event Store for Apache Kafka(MapR Streams)

Data source:
Create a sample MapR Streams named /sample-stream
maprcli stream create -path /sample-stream-produceperm p -consumeperm p -topicperm p
Use one of the ksql tool mentioned in this blog to generate the data:
/opt/mapr/ksql/ksql-4.1.1/bin/ksql-datagen quickstart=pageviews format=delimited topic=/sample-stream:pageviews maxInterval=10000
OR use below producer:
./kafka-producer-perf-test.sh  --topic /sample-stream:pageviews --num-records 1000000 --record-size 10000 \
--throughput 10000 --producer-props bootstrap.servers=localhost:9092

Sample code:
import org.apache.kafka.clients.consumer
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{
ConsumerStrategies,
KafkaUtils,
LocationStrategies
}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(10))

val kafkaParams = Map[String, Object](
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.GROUP_ID_CONFIG -> "mysparkgroup",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false: java.lang.Boolean)
)

val topicsSet = Array("/sample-stream:pageviews")
val consumerStrategy = ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
consumerStrategy)

val lines = messages.map(_.value())
val words = lines.flatMap(_.split(""))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()


How to create a MapR PACC using mapr-setup.sh to submit a Spark sample job

$
0
0

Goal:

This article shares the detailed steps on how to create a MapR Persistent Application Client Container(PACC) using mapr-setup.sh to submit a Spark sample job towards a secured MapR cluster.

Env:

MapR 6.1 (secured) with FUSE based posix client running.
Docker is installed and running on Mac

Solution:

1. Generate a service ticket for user "mapr" on the secured MapR cluster

maprlogin generateticket -type service -cluster my61.cluster.com -duration 30:0:0 -out /tmp/mapr_ticket -user mapr

2. Copy the service ticket to Mac where docker is running

Say this location on my mac is /Users/hzu/pacc/mapr_ticket

3. Download mapr-setup.sh on Mac

curl -O https://package.mapr.com/releases/installer/mapr-setup.sh
chmod +x ./mapr-setup.sh

4. Create MapR PACC Image Using mapr-setup.sh

./mapr-setup.sh docker client
Follow doc: https://mapr.com/docs/61/AdvancedInstallation/CreatingPACCImage.html
Note: We add spark client at least.

5. Edit ./docker_images/client/mapr-docker-client.sh

MAPR_CLUSTER=my61.cluster.com
MAPR_CLDB_HOSTS=v1.poc.com,v2.poc.com,v3.poc.com
MAPR_MOUNT_PATH=/maprfuse
MAPR_TICKET_FILE=/Users/hzu/pacc/mapr_ticket
MAPR_TICKETFILE_LOCATION="/tmp/$(basename $MAPR_TICKET_FILE)"
MAPR_CONTAINER_USER=mapr
MAPR_CONTAINER_UID=5000
MAPR_CONTAINER_GROUP=mapr
MAPR_CONTAINER_GID=5000
MAPR_MEMORY=0
MAPR_DOCKER_NETWORK=bridge

6. Run mapr-docker-client.sh to start the container

./docker_images/client/mapr-docker-client.sh

7. Verify the container have access to posix mount point

[mapr@8955101793bf ~]$ ls -altr  /maprfuse
total 1
drwxr-xr-x 11 mapr mapr 12 Dec 6 12:46 my61.cluster.com

[mapr@8955101793bf ~]$ rpm -qa|grep -i mapr-
mapr-client-6.1.0.20180926230239.GA-1.x86_64
mapr-posix-client-container-6.1.0.20180926230239.GA-1.x86_64
mapr-hive-2.3.201809220807-1.noarch
mapr-librdkafka-0.11.3.201803231414-1.noarch
mapr-spark-2.3.1.201809221841-1.noarch
mapr-kafka-1.1.1.201809281337-1.noarch
mapr-pig-0.16.201707251429-1.noarch

8. Verify that submitting spark job works in the container

/opt/mapr/spark/spark-2.3.1/bin/run-example --master yarn --deploy-mode cluster SparkPi 10

Common Issues:

If mapr-setup.sh fails with below error on Mac, please add Mac's IP address and hostname into /etc/hosts in advance.
ERROR: Hostname (mymacbook.local) cannot be resolved. Correct the problem and retry mapr-setup.sh

References:

https://mapr.com/products/persistent-application-client-container/
https://mapr.com/blog/persistent-storage-docker-containers-whiteboard-walkthrough/
https://mapr.com/docs/61/AdvancedInstallation/UsingtheMapRPACC.html
https://mapr.com/docs/61/AdvancedInstallation/CreatingPACCImage.html
https://mapr.com/docs/61/AdvancedInstallation/CustomizingaMapRPACC.html
https://mapr.com/blog/getting-started-mapr-client-container/
https://hub.docker.com/r/maprtech/pacc/tags/

How to use nodeSelector to constrain POD csi-controller-kdf-0 to only be able to run on particular Node(s)

$
0
0

Goal:

This article explains how to use nodeSelector to constrain POD csi-controller-kdf-0 to only be able to run on particular Node(s).

Env:

MapR 6.1 (secured)
MapR CSI 1.0.0
Kubernetes Cluster in GKE

Use case:

For MapR CSI, we want the POD from StatefulSet "csi-controller-kdf" to only run on specific node(s).

Solution:

1. List current nodes from Kubernetes cluster

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-standard-cluster-1-default-pool-f6e6e4c1-45ql Ready <none> 22m v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-fbhp Ready <none> 22m v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 Ready <none> 22m v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-r20n Ready <none> 22m v1.13.11-gke.14
gke-standard-cluster-1-default-pool-f6e6e4c1-xr3s Ready <none> 22m v1.13.11-gke.14

For example, we want the POD from StatefulSet "csi-controller-kdf" to only run on node "gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5".

2. Attach a label to this node

kubectl label nodes gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 for-csi-controller=true
Here the label key is "for-csi-controller" and the label value is "true".
Verify that the label is attached on that node:
$ kubectl get nodes -l for-csi-controller=true
NAME STATUS ROLES AGE VERSION
gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 Ready <none> 34m v1.13.11-gke.14

3. Modify csi-maprkdf-v1.0.0.yaml

cp csi-maprkdf-v1.0.0.yaml csi-maprkdf-v1.0.0_modified.yaml
vi csi-maprkdf-v1.0.0_modified.yaml
Add below to the bottom of the definiton for StatefulSet "csi-controller-kdf"
      nodeSelector:
for-csi-controller: "true"
One full example for StatefulSet "csi-controller-kdf" is:
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
name: csi-controller-kdf
namespace: mapr-csi
spec:
serviceName: "kdf-provisioner-svc"
replicas: 1
template:
metadata:
labels:
app: csi-controller-kdf
spec:
serviceAccount: csi-controller-sa
containers:
- name: csi-attacher
image: quay.io/k8scsi/csi-attacher:v1.0.1
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
env:
- name: ADDRESS
value: /var/lib/csi/sockets/pluginproxy/csi.sock
imagePullPolicy: "Always"
volumeMounts:
- name: socket-dir
mountPath: /var/lib/csi/sockets/pluginproxy/
- name: csi-provisioner
image: quay.io/k8scsi/csi-provisioner:v1.0.1
args:
- "--provisioner=com.mapr.csi-kdf"
- "--csi-address=$(ADDRESS)"
- "--volume-name-prefix=mapr-pv"
- "--v=5"
env:
- name: ADDRESS
value: /var/lib/csi/sockets/pluginproxy/csi.sock
imagePullPolicy: "Always"
volumeMounts:
- name: socket-dir
mountPath: /var/lib/csi/sockets/pluginproxy/
- name: csi-snapshotter
image: quay.io/k8scsi/csi-snapshotter:v1.0.1
imagePullPolicy: "Always"
args:
- "--snapshotter=com.mapr.csi-kdf"
- "--csi-address=$(ADDRESS)"
- "--snapshot-name-prefix=mapr-snapshot"
- "--v=5"
env:
- name: ADDRESS
value: /var/lib/csi/sockets/pluginproxy/csi.sock
volumeMounts:
- name: socket-dir
mountPath: /var/lib/csi/sockets/pluginproxy/
- name: liveness-probe
image: quay.io/k8scsi/livenessprobe:v1.0.1
imagePullPolicy: "Always"
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
- "--connection-timeout=60s"
- "--health-port=9809"
env:
- name: ADDRESS
value: /var/lib/csi/sockets/pluginproxy/csi.sock
volumeMounts:
- name: socket-dir
mountPath: /var/lib/csi/sockets/pluginproxy/
- name: mapr-kdfprovisioner
image: maprtech/csi-kdfprovisioner:1.0.0
imagePullPolicy: "Always"
args :
- "--nodeid=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
- "-v=5"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix://plugin/csi.sock
ports:
- containerPort: 9809
name: healthz
protocol: TCP
livenessProbe:
failureThreshold: 20
httpGet:
path: /healthz
port: healthz
initialDelaySeconds: 10
timeoutSeconds: 3
periodSeconds: 5
volumeMounts:
- name: socket-dir
mountPath: /plugin
- name: k8s-log-dir
mountPath: /var/log/csi-maprkdf
- name: timezone
mountPath: /etc/localtime
readOnly: true
volumes:
- name: socket-dir
emptyDir: {}
- name: k8s-log-dir
hostPath:
path: /var/log/csi-maprkdf
type: DirectoryOrCreate
- name: timezone
hostPath:
path: /etc/localtime
nodeSelector:
for-csi-controller: "true"

 4. Create StatefulSet "csi-controller-kdf" using the modified version when configuring MapR CSI

kubectl apply -f csi-maprkdf-v1.0.0_modified.yaml
Other steps to configure MapR CSI are the same as this blog.

5. Verify that POD "csi-controller-kdf-0" is running on that specific node

$ kubectl get pods -n mapr-csi -o wide  |grep csi-controller-kdf-0
csi-controller-kdf-0 5/5 Running 0 56m xx.xx.xx.4 gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 <none> <none>

Disaster Recovery Test: 

1. Drain this specific node and evict all the PODs except those for DaemonSets.

$ kubectl drain gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 --ignore-daemonsets --delete-local-data
node/gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.2.0-hzrq7, kube-system/prometheus-to-sd-jxhrm, mapr-csi/csi-nodeplugin-kdf-ssbxp
evicting pod "csi-controller-kdf-0"
evicting pod "kube-dns-79868f54c5-rggws"
pod/csi-controller-kdf-0 evicted
pod/kube-dns-79868f54c5-rggws evicted
node/gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 evicted

2. Check if the POD "csi-controller-kdf-0" will be rescheduled on other nodes or not.

$ kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
...
mapr-csi csi-controller-kdf-0 0/5 Pending 0 16m <none> <none> <none> <none>
...
As we can see, the POD "csi-controller-kdf-0" will be pending and can not be rescheduled on other nodes.
This proves that the label is working.

3. Mark the specific node available again

kubectl uncordon gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5

4. Verify that POD "csi-controller-kdf-0" is running on the specific node again

$ kubectl get pods --all-namespaces -o wide |grep -i csi-controller-kdf-0
mapr-csi csi-controller-kdf-0 5/5 Running 0 17m xx.xx.xx.5 gke-standard-cluster-1-default-pool-f6e6e4c1-hzh5 <none> <none>

5.  Verify the mount point is working in the test POD

$ kubectl exec -ti testpod -n testns -- ls -altr /mapr
total 6
drwxrwxrwt 3 5000 5000 1 Nov 25 11:17 kafka-streams
drwxrwxrwt 3 5000 5000 1 Nov 25 11:18 ksql
drwxrwxrwx 3 5000 5000 2 Dec 6 12:38 spark
drwxr-xr-x 1 root root 4096 Dec 12 22:11 ..
drwxr-xr-x 5 5000 5000 3 Dec 12 23:45 .

References:

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

Hands-on MKE(MapR Kubernetes Ecosystem ) 1.0 release

$
0
0

Goal:

MKE(MapR Kubernetes Ecosystem ) 1.0 has been released.
Basically it puts Spark and Drill into Kubernetes environment in this release.
Below is the architecture from the documentation Operators and Compute Spaces
This article shares the step-by-step commands used to install and configure a MKE 1.0 env.

Env:

MKE 1.0
MapR 6.1 secured
MacOS with kubectl installed as the client

Solution:

Currently we already have one MapR 6.1 secured cluster running in GCE(Google Compute Engine).
We just want to create a CSpace(Compute Space) in a Kubernetes Cluster which can access the existing MapR 6.1 secured cluster.
So the high-level steps are:
    1. Create a Kubernetes Cluster in GKE(Google Kubernetes Engine).
    2. Bootstrap the Kubernetes Cluster
    3. Create and Deploy External Info for CSpace
    4. Create a CSpace
    5. Run a Drill Cluster in CSpace
    6. Run a Spark Application in CSpace

      1. Create a Kubernetes Cluster in GKE(Google Kubernetes Engine)

      1.1 Create a Kubernetes cluster named "hao-cluster" in GKE

      You can use GUI or gcloud commands.

      1.2 Fetch the credentials for the Kubernetes cluster

      gcloud container clusters get-credentials hao-cluster --zone us-central1-a
      After that, make sure "kubectl cluster-info" returns correct cluster information.
      This step is to make kubectl work and connect to the correct Kubernetes cluster.

      1.3 Bind cluster-admin role to Google Cloud user

      kubectl create clusterrolebinding user-cluster-admin-binding --clusterrole=cluster-admin --user=xxx@yyy.com

      Note: "xxx@yyy.com" is the my Google Cloud user.
      Here we grant cluster admin role to the user to avoid any permission error in the next step when we create MapR CSI ClusterRole and ClusterRoleBinding. 

      2. Bootstrap the Kubernetes Cluster

      2.1 Download MKE github

      git clone https://github.com/mapr/mapr-operators
      cd ./mapr-operators
      git checkout mke-1.0.0.0

      2.2 Run the bootstrapinstall Utility

      ./bootstrap/bootstrapinstall.sh
      >>> Installing to an Openshift environment? (yes/no) [no]:
      >>> Install MapR CSI driver? (yes/no) [yes]:
      ...
      This Kubernetes environment has been successfully bootstrapped for MapR
      MapR components can now be created via the newly installed operators

      2.3 Verify the PODs/DaemonSet/StatefulSet are running under namespace "mapr-csi"/"mapr-system"/"spark-operator"/"drill-operator"

      kubectl get pods --all-namespaces
      Make sure all of the PODs are ready and in "Running" status.

      3. Create and Deploy External Info for CSpace

      Follow documentation: Automatically Generating and Deploying External Info for a CSpace

      3.1 Copy tools/gen-external-secrets.sh to one node of the MapR Cluster

      gcloud compute scp tools/gen-external-secrets.sh scott-mapr-core-pvp1:/tmp/
      chown mapr:mapr gen-external-secrets.sh

      3.2 As the admin user (typically mapr), generate a user ticket

      maprlogin password

      3.3 Run gen-external-secrets.sh as the admin user(typically mapr)

      /tmp/gen-external-secrets.sh 
      ...
      The external information generated for this cluster are available at: mapr-external-secrets-hao.yaml
      Please copy them to a machine where you can run the following command:
      kubectl apply -f mapr-external-secrets-hao.yaml

      3.4 Copy above generated mapr-external-secrets-hao.yaml to the kubectl client node

      gcloud compute scp scott-mapr-core-pvp1:/home/mapr/mapr-external-secrets-hao.yaml /tmp/

      3.5 Apply external secrets

      kubectl apply -f /tmp/mapr-external-secrets-hao.yaml

      4. Create a CSpace

      Follow documentation: Creating a Compute Space

      4.1 Copy the sample CSpace CR

      cp examples/cspaces/cr-cspace-full-gce.yaml /tmp/my_cr-cspace-full-gce.yaml

      4.2 Modify the sample CSpace CR

      At least, we need to modify the cluster name.
      vim /tmp/my_cr-cspace-full-gce.yaml

      4.3 Apply CSpace CR

      kubectl apply -f /tmp/my_cr-cspace-full-gce.yaml

      4.4 Verify the PODs are ready and running in namespace "mycspace"

      kubectl get pods -n mycspace -o wide
      Here are 3 PODs running:
      CSpace terminal, Hive Metastore and Spark HistoryServer.

      4.5 Logon one of the PODs to verify CSI is working fine and MapRFS is accessible

      kubectl exec -ti hivemeta-f6d746f-n27h6 -n mycspace -- bash
      su - mapr
      maprlogin password
      hadoop fs -ls /

      5. Run a Drill Cluster in CSpace

      Follow documentation: Running Drillbits in Compute Spaces

      5.1 Copy the sample Drill CR

      cp examples/drill/drill-cr-full.yaml /tmp/my_drill-cr-full.yaml

      5.2 Modify the sample Drill CR

      At least. we need to modify the name of CSpace.
      vim /tmp/my_drill-cr-full.yaml

      5.3 Apply Drill CR

      kubectl apply -f /tmp/my_drill-cr-full.yaml

      5.4 Verify the Drillbit PODs are ready and running inside CSpace

      kubectl get pods -n mycspace

      5.5 Logon drillbit POD to check the health of Drill Cluster

      kubectl exec -ti drillcluster1-drillbit-0 -n mycspace -- bash
      su - mapr
      maprlogin password

      /opt/mapr/drill/drill-1.16.0/bin/sqlline -u "jdbc:drill:zk=xxx:5181,yyy:5181,zzz:5181;auth=maprsasl"
      apache drill> select * from sys.drillbits;
      +-----------------------------------------------------------------------+-----------+--------------+-----------+-----------+---------+----------------+--------+
      | hostname | user_port | control_port | data_port | http_port | current | version | state |
      +-----------------------------------------------------------------------+-----------+--------------+-----------+-----------+---------+----------------+--------+
      | drillcluster1-drillbit-0.drillcluster1-svc.mycspace.svc.cluster.local | 21010 | 21011 | 21012 | 8047 | false | 1.16.0.10-mapr | ONLINE |
      | drillcluster1-drillbit-1.drillcluster1-svc.mycspace.svc.cluster.local | 21010 | 21011 | 21012 | 8047 | true | 1.16.0.10-mapr | ONLINE |
      +-----------------------------------------------------------------------+-----------+--------------+-----------+-----------+---------+----------------+--------+
      2 rows selected (2.228 seconds)

      5.6 Access Drillbit UI

      First option is to do portforwarding:
      kubectl port-forward --namespace mycspace $(kubectl get pod --namespace mycspace --selector="controller-revision-hash=drillcluster1-drillbit-57876df7bf,drill-cluster=drillcluster1,statefulset.kubernetes.io/pod-name=drillcluster1-drillbit-1" --output jsonpath='{.items[0].metadata.name}') 8080:8047
      And then open UI:
      https://localhost:8080/

      Second option is to use the service which is already exposed as LoadBalancer type:
      $ kubectl get service -n mycspace | grep drillcluster1-web-svc
      drillcluster1-web-svc LoadBalancer 10.0.0.111 xxx.xxx.xxx.123 8047:31642/TCP,21010:30945/TCP 25h
      And then open UI:
      https://xxx.xxx.xxx.123:8047

      6. Run a Spark Application in CSpace

      Follow documentation: Running Spark Applications in Compute Spaces

      6.1 Logon cspace terminal POD

      kubectl port-forward -n mycspace cspaceterminal-bcdcf7bbb-p6227 7777:7777
      Note: the 2nd "7777" port is what you configured in CSpace CR file earlier, eg, /tmp/my_cr-cspace-full-gce.yaml:
      $ grep sshPort /tmp/my_cr-cspace-full-gce.yaml
      sshPort: 7777
      Then ssh to the cspace terminal POD:
      ssh mapr@localhost -p 7777

      6.2 Create the user ticket for the Spark Application submitter

      Follow documentation: Using the Ticketcreator Utility to Generate Secrets
      [mapr@cspaceterminal-bcdcf7bbb-p6227 ~]$ ticketcreator.sh
      Create a ticket for tenant user: [mapr]:
      Please provide 'mapr's password: [mapr]:
      uid=1002(mapr) gid=1003(mapr) groups=1003(mapr),0(root)
      Creating user ticket for mapr...
      MapR credentials of user 'mapr' for cluster 'gce1.cluster.com' are written to '/tmp/maprticket_1002'

      Please provide a name for your user secret: [mapr-user-secret-4030076998]:
      secret/mapr-user-secret-4030076998 created
      Please note secret name: mapr-user-secret-4030076998 for later use.

      Do you want to create a dynamic MapR Volume via CSI for storage of Spark secondary dependencies?
      This will create both a PVC and a PV. (y/n) [n]: y
      Provide the CSI PersistentVolumeClaim Name: [mapr-csi-pvc-2696334965]:
      persistentvolumeclaim/mapr-csi-pvc-2696334965 created
      Please note PVC name: mapr-csi-pvc-2696334965 for later use.

      Provide the CSI PersistentVolume Name: [mapr-csi-pv-2354307494]:
      persistentvolume/mapr-csi-pv-2354307494 created
      Please note PV name: mapr-csi-pv-2354307494 for later use.

      6.3 Copy the sample Spark pi job CR

      cp examples/spark/mapr-spark-pi.yaml /tmp/my_mapr-spark-pi.yaml

      6.4 Modify the sample Spark pi job CR

      vim /tmp/my_mapr-spark-pi.yaml
      At least modify the CSpace name , spark.mapr.user.secret and serviceAccount.

      6.5 Submit the spark pi job

      kubectl apply -f /tmp/my_mapr-spark-pi.yaml

      6.6 Verify the spark pi job is running

      [mapr@cspaceterminal-bcdcf7bbb-p6227 ~]$ sparkctl list -n mycspace
      +----------+---------+----------------+-----------------+
      | NAME | STATE | SUBMISSION AGE | TERMINATION AGE |
      +----------+---------+----------------+-----------------+
      | spark-pi | RUNNING | 36s | N.A. |
      +----------+---------+----------------+-----------------+

      6.7 View the log for the spark pi job

      On the CSpace terminal POD using sparkctl:
      sparkctl log spark-pi  -n mycspace
      OR
      On the kubectl client node using kubectl:
      kubectl logs spark-pi-driver -n mycspace

      6.8 Access Spark HistoryServer UI

      Use the service which is already exposed as LoadBalancer type:
      $ kubectl get service -n mycspace | grep sparkhs-svc
      sparkhs-svc LoadBalancer 10.0.0.222 yyy.yyy.yyy.230 18480:31507/TCP 26h
      And then open UI:
      https://yyy.yyy.yyy.230:18480


      ==

      How to check if Spark job runs out of quota in CSpace

      $
      0
      0

      Goal:

      How to check if Spark job runs out of quota in CSpace.

      Env:

      MKE 1.0

      Solution:

      The example configuration file for CSpace based on MKE 1.0 version has below 3 default PODs:
      • terminal
      • hivemetastore
      • sparkhs
      Each of them needs 2 CPUs + 8G memory.
      This information is inside:
      git clone https://github.com/mapr/mapr-operators
      cd ./mapr-operators
      git checkout mke-1.0.0.0
      cat examples/cspaces/cr-cspace-full-gce.yaml

      cspaceservices:
      terminal:
      count: 1
      image: cspaceterminal-6.1.0:201912180140
      sshPort: 7777
      requestcpu: "2000m"
      requestmemory: 8Gi
      logLevel: INFO
      hivemetastore:
      count: 1
      image: hivemeta-2.3:201912180140
      requestcpu: "2000m"
      requestmemory: 8Gi
      logLevel: INFO
      sparkhs:
      count: 1
      image: spark-hs-2.4.4:201912180140
      requestcpu: "2000m"
      requestmemory: 8Gi
      logLevel: INFO
      So when we are calculating how much resources are available for other ecosystems like Spark and Drill, we need to take those resource into consideration.

      How to check if the Spark job is running out of quota in CSpace?
      We need to get the Spark driver log using below commands:
      Take pi job for example:
      kubectl logs spark-pi-driver -n mycspace
      or
      sparkctl log spark-pi -n mycspace

      Here are 3 scenarios at least:

      1. No nodes in Kubernetes cluster have sufficient resources

      For example, if the CSpace quota has 50 CPUs, and no any other PODs running besides the 3 default PODs.
      We still have 50-6=44 CPUs available for running one Spark job.
      If the Spark driver only needs 1 CPU, then we still have 43 CPUs available for Spark executors.
      For below definition in the Spark job YAML file:
        executor:
      cores: 20
      instances: 2
      memory: "1024m"
      labels:
      version: 2.4.4
      I need to start 2 Spark executors with 20 CPUs each.

      Symptom:
      The requirement(40 CPUs) is below the available quota(43 CPUs),  however it may hit below error from Spark driver log:
      WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
      Troubleshooting:
      2 Spark executor PODs are pending forever.
      $ kubectl get pods -n mycspace
      NAME READY STATUS RESTARTS AGE
      spark-pi-1581449230742-exec-1 0/1 Pending 0 17m
      spark-pi-1581449230742-exec-2 0/1 Pending 0 16m
      spark-pi-driver 1/1 Running 0 17m
      ...

      "kubectl describe executor-POD" tells the reason why they are pending:
      $ kubectl describe pod spark-pi-1581449230742-exec-1 -n mycspace
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning FailedScheduling 2m24s (x29 over 17m) default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
      Basically it means on any nodes have sufficient resources.
      This can be confirmed by below commands:
      $ kubectl describe node
      ...
      Allocatable:
      attachable-volumes-csi-com.mapr.csi-kdf: 20
      attachable-volumes-gce-pd: 127
      cpu: 15890m
      ephemeral-storage: 47093746742
      hugepages-2Mi: 0
      memory: 56288592Ki
      pods: 110
      ...
      Root Cause:
      In this Kubernetes cluster, we have 3 nodes.
      The most empty node can allocate 15.89 CPUs at most, which is less than 20 CPUs request.

      2. Spark executors run out of quota of CSpace

      For example, if the CSpace quota has 10 CPUs, and no any other PODs running besides the 3 default PODs.
      We still have 10-6=4 CPUs available for running one Spark job.
      If the Spark driver only need 1 CPUs, then we still have 3 CPUs available for Spark executors.
      For below definition in the Spark job YAML file:
        driver:
      cores: 1
      coreLimit: "1000m"
      memory: "1024m"
      labels:
      version: 2.4.4
      serviceAccount: mapr-mycspace-cspace-sa
      executor:
      cores: 2
      instances: 2
      memory: "1024m"
      labels:
      version: 2.4.4
      I need to start 2 Spark executors with 2 CPUs each.

      Symptom:
      The requirement(4 CPUs) is above the available quota(3 CPUs), it may show below error from Spark driver log:
      ERROR util.Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
      io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-1581464839789-exec-3" is forbidden: exceeded quota: mycspacequota, requested: cpu=2, used: cpu=9, limited: cpu=10.
      However the job can still completes because it will put both tasks in one Spark executor.
      The SparkHistoryServer should show below from "Executors" tab:
      If we reduce the CPU requirement for each Spark executor to 1 from 2, SparkHistoryServer should show below as a comparison:

      3. Spark driver run out of quota of CSpace

      For example, if the CSpace quota has 10 CPUs, and no any other PODs running besides the 3 default PODs.
      We still have 10-6=4 CPUs available for running one Spark job.
      If the Spark driver need 5 CPU then it is already above the available quota.
      For below definition in the Spark job YAML file:
        driver:
      cores: 5
      coreLimit: "5000m"
      memory: "1024m"
      labels:
      version: 2.4.4
      serviceAccount: mapr-mycspace-cspace-sa
      I need to start 1 Spark driver with 5 CPUs.

      Symptom:
      The spark job will fail by checking their status using sparkctl:
      $ sparkctl list -n mycspace
      +----------+--------+----------------+-----------------+
      | NAME | STATE | SUBMISSION AGE | TERMINATION AGE |
      +----------+--------+----------------+-----------------+
      | spark-pi | FAILED | 1m | N.A. |
      +----------+--------+----------------+-----------------+
      Troubleshooting:
      No driver log is generated yet:
      $ kubectl logs spark-pi-driver -n mycspace -f |tee /tmp/sparkjob.txt
      Error from server (NotFound): pods "spark-pi-driver" not found
      This is because even Spark driver POD is not started yet:
      $ kubectl get pods -n mycspace
      NAME READY STATUS RESTARTS AGE
      cspaceterminal-bcdcf7bbb-f68r9 1/1 Running 0 5h18m
      hivemeta-f6d746f-jq5rj 1/1 Running 0 5h18m
      sparkhs-667f46dcfd-24k86 1/1 Running 0 5h18m
      "kubectl describe sparkapplication" should show the reason:
      $ kubectl describe sparkapplication spark-pi -n mycspace
      ...
      Application State:
      Error Message: failed to run spark-submit for SparkApplication mycspace/spark-pi: 20/02/11 23:59:59 ERROR deploy.SparkSubmit$$anon$2: Failure executing: POST at: https://10.0.32.1/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-driver" is forbidden: exceeded quota: mycspacequota, requested: cpu=5, used: cpu=6, limited: cpu=10.
      io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.0.32.1/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-driver" is forbidden: exceeded quota: mycspacequota, requested: cpu=5, used: cpu=6, limited: cpu=10.
      ...
      Root Cause:
      Spark driver POD could not be started because it is out of quota of CSpace already.



      Hbase replication cheat sheet

      $
      0
      0

       Goal:

      This article records the common commands and issues for hbase replication.


      Solution:

      1. Add the target as peer

      hbase shell> add_peer "us_east","hostname.of.zookeeper:5181:/path-to-hbase"

      2. Enable and Disable table replication

      hbase shell> enable_table_replication "t1"
      hbase shell> disable_table_replication "t1"

      3. Copy table from source to target

      hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=hostname.of.zookeeper:5181:/path-to-hbase t1

      4. Remove target as peer

      hbase shell> remove_peer "us_east"

      5. List all peers

      hbase shell> list_peers

      6. Verify the rows between source and target table

      hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication peer1 table1

      Compare the GOODROWS and BADROWS.

      7.  Monitor Replication Status

      # Prints the status of each source and its sinks, sorted by hostname.
      hbase shell> status 'replication'

      # Prints the status for each replication source, sorted by hostname.
      hbase shell> status 'replication', 'source'

      # Prints the status for each replication sink, sorted by hostname.
      hbase shell> status 'replication', 'sink'

      8.  HBase Replication Metrics

      Metric

      Description

      source.sizeOfLogQueue

      Number of WALs to process (excludes the one which is being processed) at the replication source.

      source.shippedOps

      Number of of mutations shipped.

      source.logEditsRead

      Number of mutations read from WALs at the replication source.

      source.ageOfLastShippedOp

      Age of last batch shipped by the replication source.

      9. Practice for replicating one existing table from cluster A to cluster B

      on cluster A:
      hbase shell> add_peer "B","hostname.of.zookeeper:5181:/path-to-hbase"
      hbase shell> enable_table_replication "t1"
      hbase shell> disable_peer 'B'

      Then use either CopyTable, Export/Import or ExportSnapshot to copy table "t1" from A to B.

      hbase shell> enable_peer 'B'

      10. Hbase replication related parameters

      <property>
      <name>hbase.replication</name>
      <value>true</value>
      <description>Allow HBase tables to be replicated.</description>
      </property>

      <property>
      <name>replication.source.nb.capacity</name>
      <value>25000</value>
      <description>The data records synchronized to the sink side each time cannot be greater than the threshold, and the default is 25000</description>
      </property>

      <property>
      <name>replication.source.ratio</name>
      <value>0.1</value>
      <description>The RegionServer of this ratio is selected from the cluster to be backed up as potential ReplicationSink, and the default value is 0.1</description>
      </property>

      <property>
      <name>replication.source.size.capacity</name>
      <value>67108864</value>
      <description>The size of the data synchronized to the sink side each time cannot exceed this threshold, and the default is 64M</description>
      </property>

      <property>
      <name>replication.sleep.before.failover</name>
      <value>2000</value>
      <description>Before transferring the ReplicationQueue in the dead RegionServer to another RegionServer, take a nap for 2 seconds</description>
      </property>

      <property>
      <name>replication.executor.workers</name>
      <value>1</value>
      <description>The number of threads engaged in replication, the default is 1</description>
      </property>

      Known Issues

      1. HBASE-18111

      The cluster connection was aborted when the ZookeeperWatcher receive a AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() method will stuck in a while loop.

      One symptom is the jstack on RS shows:

      java.lang.Thread.State: TIMED_WAITING (sleeping)
      at java.lang.Thread.sleep(Native Method)
      at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
      at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
      at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
      at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)

      This is fixed on 1.3.3, 1.4.0, 2.0.0.

      2.  HBASE-24359

      Replication will be stuck after we delete CFs from both the source and the sink, if the source still has outstanding edits that now it could not get rid of. Now all replication is backed up behind these unreplicatable edits.

      The fix is to introduce a new config hbase.replication.drop.on.deleted.columnfamily, default is false. When config to true, the replication will drop the edits for columnfamily that has been deleted from the replication source and target. 

      This is fixed on 2.3.0 and 3.0.0.

      References

      https://blog.cloudera.com/what-are-hbase-znodes/

      https://blog.cloudera.com/apache-hbase-replication-overview/

       https://blog.cloudera.com/online-apache-hbase-backups-with-copytable/

      https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/fault-tolerance/content/manually_enable_hbase_replication.html

      https://blog.cloudera.com/introduction-to-apache-hbase-snapshots/

       

      Hbase master failed to start with error "java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned"

      $
      0
      0

      Symptom:

      Hbase master failed to start with error "java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned".

      It could happen when starting or switching to master. 

      Sample error messages are:

      2000-01-01 01:01:01,999 FATAL [myhost:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
      java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
      at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
      at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
      at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
      at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
      at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
      at java.lang.Thread.run(Thread.java:745)

      Or

      2000-01-01 01:01:01,999 FATAL [myhost:16000.activeMasterManager] master.HMaster: Failed to become active master
      java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
      at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:104)
      at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1005)
      at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:799)
      at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:191)
      at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1783)
      at java.lang.Thread.run(Thread.java:745)

      Env:

      hbase 1.1.8

      Root Cause:

      When Hbase master is starting, it assigns meta table firstly and then assign other tables.

      So hbase:namespace is the same as other tables in this assignment phase.

      If there are too many tables or regions, and the default 300000ms(5mins) may not be enough.

      Solution:

      Increase hbase.master.namespace.init.timeout in hbase-site.xml and restart Hbase Master.


       


      What does "enable_table_replication" do internally in Hbase replication?

      $
      0
      0

      Goal:

      This article explains what does the command "enable_table_replication" do internally in Hbase replication by looking into the source code.

      It also explains the difference between below 2 commands which are shown on different articles.

      hbase shell> enable_table_replication "t1"

      vs.

      hbase shell> disable 't1'
      hbase shell> alter 't1', {NAME => 'column_family_name', REPLICATION_SCOPE => '1'}
      hbase shell> enable 't1'

      Env:

      Hbase 1.1.8

      Analysis:

      1. Hbase Source code analysis for "enable_table_replication"

      a. "enable_table_replication" is a ruby command in hbase shell

      Inside hbase-shell/src/main/ruby/shell/commands/enable_table_replication.rb,

      it is calling replication_admin.enable_tablerep(table_name).

      b. "enable_tablerep"

      Inside hbase-shell/src/main/ruby/hbase/replication_admin.rb,

      it is calling @replication_admin.enableTableRep(tableName).

      c. "enableTableRep"

      Inside hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java,

      it is calling:

          checkAndSyncTableDescToPeers(tableName, splits);
      setTableRep(tableName, true);

      d. "checkAndSyncTableDescToPeers"

      Inside hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java,

      it is to Create the same table on peer when not exist, and Throw exception if the table exists on peer cluster but descriptors are not same.

      e. "setTableRep"

      Inside hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java,

            if (isTableRepEnabled(htd) ^ isRepEnabled) {
      boolean isOnlineSchemaUpdateEnabled =
      this.connection.getConfiguration()
      .getBoolean("hbase.online.schema.update.enable", true);
      if (!isOnlineSchemaUpdateEnabled) {
      admin.disableTable(tableName);
      }
      for (HColumnDescriptor hcd : htd.getFamilies()) {
      hcd.setScope(isRepEnabled ? HConstants.REPLICATION_SCOPE_GLOBAL
      : HConstants.REPLICATION_SCOPE_LOCAL);
      }
      admin.modifyTable(tableName, htd);
      if (!isOnlineSchemaUpdateEnabled) {
      admin.enableTable(tableName);
      }

      Basically it checks the value of hbase.online.schema.update.enable(default=true).

      If hbase.online.schema.update.enable=true, it modify the REPLICATION_SCOPE for ALL column families to true.

      Else, it will firstly disable table, modify REPLICATION_SCOPE for ALL column families to true, and then enable table.

      2. Differences

      Based on above analysis, "enable_table_replication" can help create the table in target peer if not exist and detect differences on table if exist.

      It can modify the REPLICATION_SCOPE for ALL column families.

      It checks if hbase.online.schema.update.enable=true, and then decides if disable/enable table is needed.


      Spark Code -- How to replace Null values in DataFrame/Dataset

      $
      0
      0

      Goal:

      This article shares some Scala example codes to explain how to replace Null values in DataFrame/Dataset.

      Solution:

      Note: As per the the code and API for org.apache.spark.sql, DataFrame is basically Dataset[Row].

      So in the future, we are always checking the code or API for Dataset when researching on DataFrame/Dataset.

      Dataset has an Untyped transformations named "na" which is DataFrameNaFunctions:

      def na: DataFrameNaFunctions 

      DataFrameNaFunctions has methods named "fill" with different signatures to replace NULL values for different datatype columns.

      Let's create a sample Dataframe firstly as the data source:

      import org.apache.spark.sql.Row
      import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

      val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
      Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
      Row("Jerry ",null,"Brown",null,5000.67,null,false),
      Row("Henry ","B","Jones",66666,null,20015464564L,true)
      )

      val simpleSchema = StructType(Array(
      StructField("firstname",StringType,true),
      StructField("middlename",StringType,true),
      StructField("lastname",StringType,true),
      StructField("zipcode", IntegerType, true),
      StructField("salary", DoubleType, true),
      StructField("account", LongType, true),
      StructField("isAlive", BooleanType, true)
      ))

      val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

      Data source and its schema look as below:

      scala> df.printSchema()
      root
      |-- firstname: string (nullable = true)
      |-- middlename: string (nullable = true)
      |-- lastname: string (nullable = true)
      |-- zipcode: integer (nullable = true)
      |-- salary: double (nullable = true)
      |-- account: long (nullable = true)
      |-- isAlive: boolean (nullable = true)


      scala> df.show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | null| Brown| null|5000.67| null| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      1. Replace Null in ALL numeric columns.

      Here it includes ALL IntegerType, DoubleType and LongType columns.

      scala> df.na.fill(0).show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | null| Brown| 0|5000.67| 0| false|
      | Henry | B| Jones| 66666| 0.0|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      2. Replaces Null in specified numeric columns.

      For example, include only the numeric column named "account".

      scala> df.na.fill(0,Array("account")).show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | null| Brown| null|5000.67| 0| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      3. Replace Null in ALL string columns.

      scala> df.na.fill("").show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | | Brown| null|5000.67| null| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      4. Replace Null in ALL boolean columns.

      scala> df.na.fill(true).show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| true|
      | Jerry | null| Brown| null|5000.67| null| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

       

      Note: Here is the Complete Sample Code.

      Spark Code -- How to drop Null values in DataFrame/Dataset

      $
      0
      0

      Goal:

      This article shares some Scala example codes to explain how to drop Null values in DataFrame/Dataset.

      Solution:

      DataFrameNaFunctions has methods named "drop" with different signatures to drop NULL values under different scenarios.

      Let's create a sample Dataframe firstly as the data source:

      import org.apache.spark.sql.Row
      import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}

      val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
      Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
      Row("Jerry ",null,"Brown",null,5000.67,null,false),
      Row("Henry ","B","Jones",66666,null,20015464564L,true)
      )

      val simpleSchema = StructType(Array(
      StructField("firstname",StringType,true),
      StructField("middlename",StringType,true),
      StructField("lastname",StringType,true),
      StructField("zipcode", IntegerType, true),
      StructField("salary", DoubleType, true),
      StructField("account", LongType, true),
      StructField("isAlive", BooleanType, true)
      ))

      val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

      Data source and its schema look as below:

      scala> df.printSchema()
      root
      |-- firstname: string (nullable = true)
      |-- middlename: string (nullable = true)
      |-- lastname: string (nullable = true)
      |-- zipcode: integer (nullable = true)
      |-- salary: double (nullable = true)
      |-- account: long (nullable = true)
      |-- isAlive: boolean (nullable = true)


      scala> df.show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | null| Brown| null|5000.67| null| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      1. Drop rows containing NULL in any columns.(version 1)

      Here only one row does not have NULL in any columns.

      scala> df.na.drop().show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      2. Drop rows containing NULL in any columns. (version 2)

      Same as above. This is just another version.

      scala> df.na.drop("any").show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      +---------+----------+--------+-------+-------+-----------+-------+

       3. Drop rows containing NULL in all columns.

      Here it shows all rows because there is no such all-NULL rows.

      scala> df.na.drop("all").show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | null| Brown| null|5000.67| null| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

       4. Drop rows containing NULL in any of specified column(s).

      scala> df.na.drop(Seq("salary","account")).show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      +---------+----------+--------+-------+-------+-----------+-------+

      5. Drop rows containing NULL in all of specified column(s).

      scala> df.na.drop("all",Seq("salary","account")).show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      | Tom| A| Smith| 44444|4000.45|19886546456| null|
      | Jerry | null| Brown| null|5000.67| null| false|
      | Henry | B| Jones| 66666| null|20015464564| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      6. Drop rows containing less than minNonNulls non-null values. 

      It means we keep the rows with at least minNonNulls non-null values. 

      Here I want to keep the rows with all 7 non-null values.

      scala> df.na.drop(7).show()
      +---------+----------+--------+-------+-------+-----------+-------+
      |firstname|middlename|lastname|zipcode| salary| account|isAlive|
      +---------+----------+--------+-------+-------+-----------+-------+
      | Jim| | Green| 33333|3000.12|19605466456| true|
      +---------+----------+--------+-------+-------+-----------+-------+

      Note: Here is the Complete Sample Code.

      Spark Code -- Use date_format() to convert timestamp to String

      $
      0
      0

      Goal:

      This article shares some Scala example codes to explain how to use date_format() to convert timestamp to String.

      Solution:

      data_format() is one function of org.apache.spark.sql.functions to convert data/timestamp to String.

      This is the doc for datatime pattern.

      Here is a simple example to show this in spark-sql way.

      spark.sql("""
      SELECT current_timestamp() as ts,
      date_format(current_timestamp(),"yyyy-MM-dd") as `yyyy-MM-dd`,
      date_format(current_timestamp(),"MMM") as `MMM`,
      date_format(current_timestamp(),"MMMM") as `MMMM`,
      date_format(current_timestamp(),"d") as `d`,
      date_format(current_timestamp(),"E") as `E`,
      date_format(current_timestamp(),"EEEE") as `EEEE`,
      date_format(current_timestamp(),"HH:mm:ss.S") as `HH:mm:ss.S`,
      date_format(current_timestamp(),"Z") as `Z`,
      date_format(current_timestamp(),"z") as `z`
      """).show()

      The output is:

      +--------------------+----------+---+-------+---+---+--------+------------+-----+---+
      | ts|yyyy-MM-dd|MMM| MMMM| d| E| EEEE| HH:mm:ss.S| Z| z|
      +--------------------+----------+---+-------+---+---+--------+------------+-----+---+
      |2021-01-28 16:51:...|2021-01-28|Jan|January| 28|Thu|Thursday|16:51:48.802|-0800|PST|
      +--------------------+----------+---+-------+---+---+--------+------------+-----+---+


      Spark Tuning -- Use Partition Discovery feature to do partition pruning

      $
      0
      0

      Goal:

      This article explains how to use Partition Discovery feature to do partition pruning.

      Solution:

      If the data directories are organized using the same way that Hive partitions use, Spark can discover that partition column(s) using Partition Discovery feature

      After that, the query on top of the partitioned table can do partition pruning.

      Below is one example:

      1. Create a DataFrame based on sample data and add a new duplicate column.

      val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/data/retail-data/by-day/*.csv")
      //Add a new column named "AnotherCountry" to have the same value as column "Country" so that we can compare the different query plan.
      val newdf = df.withColumn("AnotherCountry", expr("Country"))

      2.  Save the DataFrame as partitioned orc files.

      val targetdir = "/tmp/test_partition_pruning/newdf"
      newdf.write.mode("overwrite").format("orc").partitionBy("Country").save(targetdir)

      3. Let's take a look at the target directory.

      newdf.write.mode("overwrite").format("orc").partitionBy("Country").save(targetdir)
      val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
      fs.listStatus(new Path("/tmp/test_partition_pruning/newdf")).filter(_.isDir).map(_.getPath).foreach(println)

      Output is:

      maprfs:///tmp/test_partition_pruning/newdf/Country=Australia
      maprfs:///tmp/test_partition_pruning/newdf/Country=Netherlands
      maprfs:///tmp/test_partition_pruning/newdf/Country=Canada
      maprfs:///tmp/test_partition_pruning/newdf/Country=Italy
      maprfs:///tmp/test_partition_pruning/newdf/Country=Denmark
      maprfs:///tmp/test_partition_pruning/newdf/Country=Iceland
      ...

      4. "good" query uses partition pruning

      val readdf = spark.read.format("orc").load(targetdir)
      readdf.createOrReplaceTempView("readdf")

      val goodsql = "SELECT * FROM readdf WHERE Country = 'Australia'"
      val goodresult = spark.sql(goodsql)
      goodresult.explain
      println(s"Result: ${goodresult.count()} ")

      Output:

      == Physical Plan ==
      *(1) FileScan orc [InvoiceNo#53,StockCode#54,Description#55,Quantity#56,InvoiceDate#57,UnitPrice#58,CustomerID#59,AnotherCountry#60,Country#61] Batched: true, Format: ORC, Location: InMemoryFileIndex[maprfs:///tmp/test_partition_pruning/newdf], PartitionCount: 1, PartitionFilters: [isnotnull(Country#61), (Country#61 = Australia)], PushedFilters: [], ReadSchema: struct<InvoiceNo:string,StockCode:string,Description:string,Quantity:int,InvoiceDate:timestamp,Un...

      Result: 1259

      5. "Bad" query can not use partition pruning

      val badsql  = "SELECT * FROM readdf WHERE AnotherCountry = 'Australia'"
      val badresult = spark.sql(badsql)
      badresult.explain
      println(s"Result: ${badresult.count()} ")

      Output:

      == Physical Plan ==
      *(1) Project [InvoiceNo#53, StockCode#54, Description#55, Quantity#56, InvoiceDate#57, UnitPrice#58, CustomerID#59, AnotherCountry#60, Country#61]
      +- *(1) Filter (isnotnull(AnotherCountry#60) && (AnotherCountry#60 = Australia))
      +- *(1) FileScan orc [InvoiceNo#53,StockCode#54,Description#55,Quantity#56,InvoiceDate#57,UnitPrice#58,CustomerID#59,AnotherCountry#60,Country#61] Batched: true, Format: ORC, Location: InMemoryFileIndex[maprfs:///tmp/test_partition_pruning/newdf], PartitionCount: 38, PartitionFilters: [], PushedFilters: [IsNotNull(AnotherCountry), EqualTo(AnotherCountry,Australia)], ReadSchema: struct<InvoiceNo:string,StockCode:string,Description:string,Quantity:int,InvoiceDate:timestamp,Un...

      Result: 1259

      Analysis:

      1. Explain plan

      From above explain plans, it is pretty obvious why the "good" query uses partition pruning while the "bad" query does not -- column "Country" is the partition key.

      The "good" query can actually push the "Filter" inside "FileScan" as "PartitionFilters".

      So it only needs to scan 1 partition(directory):

      PartitionCount: 1, PartitionFilters: [isnotnull(Country#61), (Country#61 = Australia)]

      However the "bad" query has to scan all the 38 partitions(direcotries) firstly and then apply Filter:

      PartitionCount: 38, PartitionFilters: []

      2. Event log/Web UI

      By only looking at the related Stage for the "good" query, the sum of input Size is only 80+KB while the sum of records = the final result = 1259.


       

       

       

       

      By only looking at the related Stage for the "bad" query, the sum of input Size is 2+MB while the sum of records = the final result = 1259.


       

       

       

       

      Of course, the execution time also has large differences. 

      Note: Here is the Complete Sample Code.


      Spark Tuning -- Column Projection for Parquet

      $
      0
      0

      Goal:

      This article explains the column projection for parquet format(or other columnar format) in Spark.

      Solution:

      Spark can do column projection for columnar format data such as Parquet.

      The idea is to only read the needed columns instead of reading all of the columns.

      This can reduce lots of I/O needed to improve the performance.

      Below is one example.

      Note:  To show difference of performance for column projection,  I disabled Parquet filter pushdown feature by setting spark.sql.parquet.filterPushdown=false in my configuration. 

      I will discuss about Parquet filter pushdown feature in another article.

      1.  Save a sample DataFrame as parquet files.

      val df = spark.read.json("/data/activity-data/")
      val targetdir = "/tmp/test_column_projection/newdf"
      df.write.mode("overwrite").format("parquet").save(targetdir)

      2. Select only 1 column

      val somecols  = "SELECT Device FROM readdf WHERE Model='something_not_exist'"
      val goodresult = spark.sql(somecols)
      goodresult.explain
      goodresult.collect

      Output:

      scala> goodresult.explain
      == Physical Plan ==
      *(1) Project [Device#48]
      +- *(1) Filter (isnotnull(Model#50) && (Model#50 = something_not_exist))
      +- *(1) FileScan parquet [Device#48,Model#50] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Model), EqualTo(Model,something_not_exist)], ReadSchema: struct<Device:string,Model:string>

      scala> goodresult.collect
      res5: Array[org.apache.spark.sql.Row] = Array()

      3. Select ALL columns

      val allcols = "SELECT * FROM readdf where Model='something_not_exist'"
      val badresult = spark.sql(allcols)
      badresult.explain
      badresult.collect

       Output:

      scala> badresult.explain
      == Physical Plan ==
      *(1) Project [Arrival_Time#46L, Creation_Time#47L, Device#48, Index#49L, Model#50, User#51, gt#52, x#53, y#54, z#55]
      +- *(1) Filter (isnotnull(Model#50) && (Model#50 = something_not_exist))
      +- *(1) FileScan parquet [Arrival_Time#46L,Creation_Time#47L,Device#48,Index#49L,Model#50,User#51,gt#52,x#53,y#54,z#55] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Model), EqualTo(Model,something_not_exist)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

      scala> badresult.collect
      res7: Array[org.apache.spark.sql.Row] = Array()

      Analysis:

      1. Explain plan

      During FileScan, we can see only the needed columns are read due to column projection feature:

      FileScan parquet [Device#48,Model#50]

      vs

      FileScan parquet [Arrival_Time#46L,Creation_Time#47L,Device#48,Index#49L,Model#50,User#51,gt#52,x#53,y#54,z#55]

      2. Event log/Web UI

      The "SELECT only 1 column"'s stage shows sum of Input Size=868.3KB.

       


       

       

       

       

      The "SELECT ALL columns"'s stage shows sum of Input Size=142.3MB.   

       


       

       

       

       

       Note: Here is the Complete Sample Code.


      Spark Tuning -- Predicate Pushdown for Parquet

      $
      0
      0

      Goal:

      This article explains the Predicate Pushdown for Parquet in Spark.

      Solution:

      Spark can push down the predicate into scanning parquet phase so that it can reduce the amount of data to be read. 

      This is done by checking the metadata of parquet files to filter out the unnecessary data.

      Note: Refer to this blog on How to use pyarrow to view the metadata information inside a Parquet file.

      This feature is controlled by a parameter named spark.sql.parquet.filterPushdown (default is true).

      Let's use the parquet files created in another blog for example.

      1. Create a DataFrame on parquet files

      val targetdir = "/tmp/test_column_projection/newdf"
      val readdf = spark.read.format("parquet").load(targetdir)
      readdf.createOrReplaceTempView("readdf")

      2. Let's look at the data distribution for column "Index".

      scala> spark.sql("SELECT min(Index), max(Index), count(distinct Index),count(*) FROM readdf").show
      +----------+----------+---------------------+--------+
      |min(Index)|max(Index)|count(DISTINCT Index)|count(1)|
      +----------+----------+---------------------+--------+
      | 0| 396342| 396343| 6240991|
      +----------+----------+---------------------+--------+

      As we know, the data range of this column "Index" is 0~396342.

      After knowing this, we can design our tests below to show the difference performance results for different filters.

      3. Query 1 and its explain plan

      val q1  = "SELECT * FROM readdf WHERE Index=20000"
      val result1 = spark.sql(q1)
      result1.explain
      result1.collect

      Query 1 will have to scan lots of data because the "Index=20000" data is in most of the parquet chunks.

      The explain plan:

      == Physical Plan ==
      *(1) Project [Arrival_Time#26L, Creation_Time#27L, Device#28, Index#29L, Model#30, User#31, gt#32, x#33, y#34, z#35]
      +- *(1) Filter (isnotnull(Index#29L) && (Index#29L = 20000))
      +- *(1) FileScan parquet [Arrival_Time#26L,Creation_Time#27L,Device#28,Index#29L,Model#30,User#31,gt#32,x#33,y#34,z#35] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Index), EqualTo(Index,20000)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

      4. Query 2 and its explain plan

      val q2  = "SELECT * FROM readdf where Index=9999999999"
      val result2 = spark.sql(q2)
      result2.explain
      result2.collect

      Query 2 just needs to scan little data because the "Index=9999999999" data is outside the range for that column.

      The explain plan:

      == Physical Plan ==
      *(1) Project [Arrival_Time#26L, Creation_Time#27L, Device#28, Index#29L, Model#30, User#31, gt#32, x#33, y#34, z#35]
      +- *(1) Filter (isnotnull(Index#29L) && (Index#29L = 9999999999))
      +- *(1) FileScan parquet [Arrival_Time#26L,Creation_Time#27L,Device#28,Index#29L,Model#30,User#31,gt#32,x#33,y#34,z#35] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Index), EqualTo(Index,9999999999)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

      5. Query 3 and its explain plan after disabling spark.sql.parquet.filterPushdown

      Everything is the same as Query 2, and the only difference is we manually disabled this feature by setting below in config:
      config("spark.sql.parquet.filterPushdown",false)
      Because we disabled this feature, so it has to scan all the parquet data.
      The explain plan:
      == Physical Plan ==
      *(1) Project [Arrival_Time#26L, Creation_Time#27L, Device#28, Index#29L, Model#30, User#31, gt#32, x#33, y#34, z#35]
      +- *(1) Filter (isnotnull(Index#29L) && (Index#29L = 9999999999))
      +- *(1) FileScan parquet [Arrival_Time#26L,Creation_Time#27L,Device#28,Index#29L,Model#30,User#31,gt#32,x#33,y#34,z#35] Batched: true, Format: Parquet, Location: InMemoryFileIndex[maprfs:///tmp/test_column_projection/newdf], PartitionFilters: [], PushedFilters: [IsNotNull(Index), EqualTo(Index,9999999999)], ReadSchema: struct<Arrival_Time:bigint,Creation_Time:bigint,Device:string,Index:bigint,Model:string,User:stri...

      Analysis:

      1. Explain plan

      As we can see, all of the explain plans look the same. 

      Even after we disabled spark.sql.parquet.filterPushdown, the explain plan did not show any difference between Query 2 and Query 3.

      This means, at least from query plan, we could not tell if predicate is pushed down or not.

      All of the explain plans show there is predicate push down:

      PushedFilters: [IsNotNull(Index), EqualTo(Index,9999999999)]

      Note: these tests are done in Spark 2.4.4, this behavior may change in the future release.

      2. Event log/Web UI

      Query 1's stage shows sum of Input Size is 142.3MB and sum of Records is 6240991:

       



       

       

       

      Query 2's stage shows sum of Input Size is 44.4KB and sum of Records is 0:

       


       

       

       

       

       

      Query 3's stage shows sum of Input Size is 142.3MB and sum of Records is 6240991:

       


       

       

       

       

      Above metrics clearly shows the selectivity for this predicate pushdown feature based on the filter and also on the metadata of parquet files.

      The performance difference between Query 2 and Query 3 shows how powerful this feature is.

      Note:  If the metadata of all parquet files has most/all of the data based on the filter, then this feature may not provide good selectivity. So data distribution also matters here.

      Note: Here is the Complete Sample Code



      How to use pyarrow to view the metadata information inside a Parquet file

      $
      0
      0

      Goal:

      This article explains how to use PyArrow to view the metadata information inside a Parquet file.

      Env:

      CentOS 7

      Solution:

      1. Create a Python 3 virtual environment

      This step is because the default python version is 2.x on CentOS/Redhat 7 and it is too old to install pyArrow latest version. 

      Using Python 3 and its pip3 is the way to go.

      However if we just use "alternatives" to config the python to use python3, it may break some other tools such as "yum" which depends on python2.

      Using virtual environment is the easiest way to keep both python2 and python3 on CentOS 7.

      python3 -m venv .venv
      . .venv/bin/activate

       2. Install PyArrow and its dependencies

      pip install --upgrade pip setuptools
      pip install Cython
      pip install pyarrow

      3.  Read the metadata inside a Parquet file

      >>> import pyarrow.parquet as pq
      >>> parquet_file = pq.ParquetFile('/.../part-00000-67861019-20bb-4396-96f8-146141351ff2-c000.snappy.parquet')

      >>> parquet_file.metadata
      <pyarrow._parquet.FileMetaData object at 0x7f8014250bf8>
      created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
      num_columns: 10
      num_rows: 546097
      num_row_groups: 1
      format_version: 1.0
      serialized_size: 1886

      >>> parquet_file.metadata.row_group(0)
      <pyarrow._parquet.RowGroupMetaData object at 0x7f808aaf4f98>
      num_columns: 10
      num_rows: 546097
      total_byte_size: 17515040

      >>> parquet_file.metadata.row_group(0).column(3)
      <pyarrow._parquet.ColumnChunkMetaData object at 0x7f801356cf48>
      file_offset: 6588315
      file_path:
      physical_type: INT64
      num_values: 546097
      path_in_schema: Index
      is_stats_set: True
      statistics:
      <pyarrow._parquet.Statistics object at 0x7f8013fd2ea8>
      has_min_max: True
      min: 0
      max: 396316
      null_count: 0
      distinct_count: 0
      num_values: 546097
      physical_type: INT64
      logical_type: None
      converted_type (legacy): NONE
      compression: SNAPPY
      encodings: ('BIT_PACKED', 'RLE', 'PLAIN')
      has_dictionary_page: False
      dictionary_page_offset: None
      data_page_offset: 6588315
      total_compressed_size: 2277936
      total_uncompressed_size: 4369155

      >>> parquet_file.metadata.row_group(0).column(3).statistics
      <pyarrow._parquet.Statistics object at 0x7f801356cef8>
      has_min_max: True
      min: 0
      max: 396316
      null_count: 0
      distinct_count: 0
      num_values: 546097
      physical_type: INT64
      logical_type: None
      converted_type (legacy): NONE

      From above information, we can tell that:

      • The parquet file version is 1.10.1.
      • It has only 1 row group inside.
      • It has 10 columns and 546097 rows.
      • The 4th column(.column(3)) named “Index” is a INT64 type with min=0 and max=396316.


      Spark Tuning -- How to use SparkMeasure to measure Spark job metrics

      $
      0
      0

      Goal:

      This article explains how to use SparkMeasure to measure Spark job metrics.

      Env:

      Spark 2.4.4 with Scala 2.11.12 

      SparkMeasure 0.17

      Concept:

      SparkMeasure is a very cool tool to collect the aggregated stage or task level metrics for Spark jobs or queries. Basically it creates the customized Spark listeners.

      Note: Collecting at task level has additional performance overhead comparing to collecting at stage level. Unless you want to study skew effects for tasks, I would suggest we collect at stage level.

      Regarding where those metrics come from, we can look into the Spark source code under "core/src/main/scala/org/apache/spark/executor" folder.

      You will find the metrics explanation inside TaskMetrics.scala, ShuffleReadMetrics.scala, ShuffleWriteMetrics.scala, etc.

      For example:

        /**
      * Time the executor spends actually running the task (including fetching shuffle data).
      */
      def executorRunTime: Long = _executorRunTime.sum

      /**
      * CPU Time the executor spends actually running the task
      * (including fetching shuffle data) in nanoseconds.
      */
      def executorCpuTime: Long = _executorCpuTime.sum
      Or you can find those metrics explanation from the Doc.

      Installation:

      In this post, we will use spark-shell or spark-submit to test. So we just need to follow this doc to build or download the jar file.

      Note: Before downloading/building the jar, make sure the jar should match your spark and scala version. 

      a. Download the Jar from Maven Central

      For example, based on my spark and scala version, I will choose below version:

      wget https://repo1.maven.org/maven2/ch/cern/sparkmeasure/spark-measure_2.11/0.17/spark-measure_2.11-0.17.jar

      b. Build the Jar using sbt from source code

      git clone https://github.com/lucacanali/sparkmeasure
      cd sparkmeasure
      sbt +package
      ls -l target/scala-2.11/spark-measure*.jar # location of the compiled jar

      Solution:

      In this post, we will use the sample data and queries from another post "Predicate Pushdown for Parquet".

      1.  Interactive Mode using Spark-shell for single job/query

      Please refer to this doc for Interactive Mode for Spark-shell.
      spark-shell --jars spark-measure_2.11-0.17.jar --master yarn --deploy-mode client --executor-memory 1G --num-executors 4

      Stage metrics:

      val stageMetrics = new ch.cern.sparkmeasure.StageMetrics(spark)
      val q1 = "SELECT * FROM readdf WHERE Index=20000"
      stageMetrics.runAndMeasure(sql(q1).show)

      Output:

      21/02/04 15:08:55 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
      21/02/04 15:08:55 WARN StageMetrics: Stage metrics data refreshed into temp view PerfStageMetrics
      Scheduling mode = FIFO
      Spark Context default degree of parallelism = 4
      Aggregated Spark stage metrics:
      numStages => 2
      numTasks => 4
      elapsedTime => 3835 (4 s)
      stageDuration => 3827 (4 s)
      executorRunTime => 4757 (5 s)
      executorCpuTime => 3672 (4 s)
      executorDeserializeTime => 772 (0.8 s)
      executorDeserializeCpuTime => 510 (0.5 s)
      resultSerializationTime => 0 (0 ms)
      jvmGCTime => 239 (0.2 s)
      shuffleFetchWaitTime => 0 (0 ms)
      shuffleWriteTime => 0 (0 ms)
      resultSize => 5441 (5.0 KB)
      diskBytesSpilled => 0 (0 Bytes)
      memoryBytesSpilled => 0 (0 Bytes)
      peakExecutionMemory => 0
      recordsRead => 6240991
      bytesRead => 149260233 (142.0 MB)
      recordsWritten => 0
      bytesWritten => 0 (0 Bytes)
      shuffleRecordsRead => 0
      shuffleTotalBlocksFetched => 0
      shuffleLocalBlocksFetched => 0
      shuffleRemoteBlocksFetched => 0
      shuffleTotalBytesRead => 0 (0 Bytes)
      shuffleLocalBytesRead => 0 (0 Bytes)
      shuffleRemoteBytesRead => 0 (0 Bytes)
      shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
      shuffleBytesWritten => 0 (0 Bytes)
      shuffleRecordsWritten => 0

      Task Metrics:

      val taskMetrics = new ch.cern.sparkmeasure.TaskMetrics(spark)
      val q1 = "SELECT * FROM readdf WHERE Index=20000"
      taskMetrics.runAndMeasure(spark.sql(q1).show)

      Output:

      21/02/04 16:52:59 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
      21/02/04 16:52:59 WARN TaskMetrics: Stage metrics data refreshed into temp view PerfTaskMetrics

      Scheduling mode = FIFO
      Spark Contex default degree of parallelism = 4
      Aggregated Spark task metrics:
      numtasks => 4
      elapsedTime => 3896 (4 s)
      duration => 5268 (5 s)
      schedulerDelayTime => 94 (94 ms)
      executorRunTime => 4439 (4 s)
      executorCpuTime => 3561 (4 s)
      executorDeserializeTime => 734 (0.7 s)
      executorDeserializeCpuTime => 460 (0.5 s)
      resultSerializationTime => 1 (1 ms)
      jvmGCTime => 237 (0.2 s)
      shuffleFetchWaitTime => 0 (0 ms)
      shuffleWriteTime => 0 (0 ms)
      gettingResultTime => 0 (0 ms)
      resultSize => 2183 (2.0 KB)
      diskBytesSpilled => 0 (0 Bytes)
      memoryBytesSpilled => 0 (0 Bytes)
      peakExecutionMemory => 0
      recordsRead => 6240991
      bytesRead => 149260233 (142.0 MB)
      recordsWritten => 0
      bytesWritten => 0 (0 Bytes)
      shuffleRecordsRead => 0
      shuffleTotalBlocksFetched => 0
      shuffleLocalBlocksFetched => 0
      shuffleRemoteBlocksFetched => 0
      shuffleTotalBytesRead => 0 (0 Bytes)
      shuffleLocalBytesRead => 0 (0 Bytes)
      shuffleRemoteBytesRead => 0 (0 Bytes)
      shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
      shuffleBytesWritten => 0 (0 Bytes)
      shuffleRecordsWritten => 0

      2.  Interactive Mode using Spark-shell for multiple jobs/queries

      Take Stage Metrics for example:

      val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
      stageMetrics.begin()

      //Run multiple jobs/queries
      val q1 = "SELECT * FROM readdf WHERE Index=20000"
      val q2 = "SELECT * FROM readdf where Index=9999999999"
      spark.sql(q1).show()
      spark.sql(q2).show()

      stageMetrics.end()
      stageMetrics.printReport()

      Output:

      21/02/04 17:00:59 WARN StageMetrics: Stage metrics data refreshed into temp view PerfStageMetrics

      Scheduling mode = FIFO
      Spark Context default degree of parallelism = 4
      Aggregated Spark stage metrics:
      numStages => 4
      numTasks => 8
      elapsedTime => 3242 (3 s)
      stageDuration => 1094 (1 s)
      executorRunTime => 1779 (2 s)
      executorCpuTime => 942 (0.9 s)
      executorDeserializeTime => 96 (96 ms)
      executorDeserializeCpuTime => 37 (37 ms)
      resultSerializationTime => 1 (1 ms)
      jvmGCTime => 42 (42 ms)
      shuffleFetchWaitTime => 0 (0 ms)
      shuffleWriteTime => 0 (0 ms)
      resultSize => 5441 (5.0 KB)
      diskBytesSpilled => 0 (0 Bytes)
      memoryBytesSpilled => 0 (0 Bytes)
      peakExecutionMemory => 0
      recordsRead => 6240991
      bytesRead => 149305675 (142.0 MB)
      recordsWritten => 0
      bytesWritten => 0 (0 Bytes)
      shuffleRecordsRead => 0
      shuffleTotalBlocksFetched => 0
      shuffleLocalBlocksFetched => 0
      shuffleRemoteBlocksFetched => 0
      shuffleTotalBytesRead => 0 (0 Bytes)
      shuffleLocalBytesRead => 0 (0 Bytes)
      shuffleRemoteBytesRead => 0 (0 Bytes)
      shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
      shuffleBytesWritten => 0 (0 Bytes)
      shuffleRecordsWritten => 0

      Further more, below command can print additional accumulables metrics (including SQL metrics):

      scala> stageMetrics.printAccumulables()
      21/02/04 17:01:26 WARN StageMetrics: Accumulables metrics data refreshed into temp view AccumulablesStageMetrics

      Aggregated Spark accumulables of type internal.metric. Sum of values grouped by metric name
      Name => sum(value) [group by name]

      executorCpuTime => 943 (0.9 s)
      executorDeserializeCpuTime => 39 (39 ms)
      executorDeserializeTime => 96 (96 ms)
      executorRunTime => 1779 (2 s)
      input.bytesRead => 149305675 (142.0 MB)
      input.recordsRead => 6240991
      jvmGCTime => 42 (42 ms)
      resultSerializationTime => 1 (1 ms)
      resultSize => 12780 (12.0 KB)

      SQL Metrics and other non-internal metrics. Values grouped per accumulatorId and metric name.
      Accid, Name => max(value) [group by accId, name]

      146, duration total => 1422 (1 s)
      147, number of output rows => 18
      148, number of output rows => 6240991
      151, scan time total => 1359 (1 s)
      202, duration total => 200 (0.2 s)
      207, scan time total => 198 (0.2 s)
      scala> stageMetrics.printAccumulables()

      3.  Flight Recorder Mode

      Please refer to this doc for Flight Recorder Mode.

      This mode will not touch your code/program and only need to add a jar file when submitting the job.

      Take Stage Metrics for example:

      spark-submit --conf spark.driver.extraClassPath=./spark-measure_2.11-0.17.jar  \
      --conf spark.extraListeners=ch.cern.sparkmeasure.FlightRecorderStageMetrics \
      --conf spark.sparkmeasure.outputFormat=json \
      --conf spark.sparkmeasure.outputFilename="/tmp/stageMetrics_flightRecorder" \
      --conf spark.sparkmeasure.printToStdout=false \
      --class "PredicatePushdownTest" \
      --master yarn \
      ~/sbt/SparkScalaExample/target/scala-2.11/sparkscalaexample_2.11-1.0.jar

      In the output, it will show:

      WARN FlightRecorderStageMetrics: Writing Stage Metrics data serialized as json to /tmp/stageMetrics_flightRecorder

      The json output file looks as:

      $ cat /tmp/stageMetrics_flightRecorder
      [ {
      "jobId" : 0,
      "jobGroup" : null,
      "stageId" : 0,
      "name" : "load at PredicatePushdownTest.scala:16",
      "submissionTime" : 1612488772250,
      "completionTime" : 1612488773352,
      "stageDuration" : 1102,
      "numTasks" : 1,
      "executorRunTime" : 352,
      "executorCpuTime" : 141,
      "executorDeserializeTime" : 589,
      "executorDeserializeCpuTime" : 397,
      "resultSerializationTime" : 3,
      "jvmGCTime" : 95,
      "resultSize" : 1969,
      "diskBytesSpilled" : 0,
      "memoryBytesSpilled" : 0,
      "peakExecutionMemory" : 0,
      "recordsRead" : 0,
      "bytesRead" : 0,
      "recordsWritten" : 0,
      "bytesWritten" : 0,
      "shuffleFetchWaitTime" : 0,
      "shuffleTotalBytesRead" : 0,
      "shuffleTotalBlocksFetched" : 0,
      "shuffleLocalBlocksFetched" : 0,
      "shuffleRemoteBlocksFetched" : 0,
      "shuffleLocalBytesRead" : 0,
      "shuffleRemoteBytesRead" : 0,
      "shuffleRemoteBytesReadToDisk" : 0,
      "shuffleRecordsRead" : 0,
      "shuffleWriteTime" : 0,
      "shuffleBytesWritten" : 0,
      "shuffleRecordsWritten" : 0
      }, {
      "jobId" : 1,
      "jobGroup" : null,
      "stageId" : 1,
      "name" : "collect at PredicatePushdownTest.scala:25",
      "submissionTime" : 1612488774600,
      "completionTime" : 1612488776522,
      "stageDuration" : 1922,
      "numTasks" : 4,
      "executorRunTime" : 4962,
      "executorCpuTime" : 4446,
      "executorDeserializeTime" : 1679,
      "executorDeserializeCpuTime" : 1215,
      "resultSerializationTime" : 2,
      "jvmGCTime" : 309,
      "resultSize" : 7545,
      "diskBytesSpilled" : 0,
      "memoryBytesSpilled" : 0,
      "peakExecutionMemory" : 0,
      "recordsRead" : 6240991,
      "bytesRead" : 149260233,
      "recordsWritten" : 0,
      "bytesWritten" : 0,
      "shuffleFetchWaitTime" : 0,
      "shuffleTotalBytesRead" : 0,
      "shuffleTotalBlocksFetched" : 0,
      "shuffleLocalBlocksFetched" : 0,
      "shuffleRemoteBlocksFetched" : 0,
      "shuffleLocalBytesRead" : 0,
      "shuffleRemoteBytesRead" : 0,
      "shuffleRemoteBytesReadToDisk" : 0,
      "shuffleRecordsRead" : 0,
      "shuffleWriteTime" : 0,
      "shuffleBytesWritten" : 0,
      "shuffleRecordsWritten" : 0
      }, {
      "jobId" : 2,
      "jobGroup" : null,
      "stageId" : 2,
      "name" : "collect at PredicatePushdownTest.scala:30",
      "submissionTime" : 1612488776656,
      "completionTime" : 1612488776833,
      "stageDuration" : 177,
      "numTasks" : 4,
      "executorRunTime" : 427,
      "executorCpuTime" : 261,
      "executorDeserializeTime" : 89,
      "executorDeserializeCpuTime" : 27,
      "resultSerializationTime" : 0,
      "jvmGCTime" : 0,
      "resultSize" : 5884,
      "diskBytesSpilled" : 0,
      "memoryBytesSpilled" : 0,
      "peakExecutionMemory" : 0,
      "recordsRead" : 0,
      "bytesRead" : 45442,
      "recordsWritten" : 0,
      "bytesWritten" : 0,
      "shuffleFetchWaitTime" : 0,
      "shuffleTotalBytesRead" : 0,
      "shuffleTotalBlocksFetched" : 0,
      "shuffleLocalBlocksFetched" : 0,
      "shuffleRemoteBlocksFetched" : 0,
      "shuffleLocalBytesRead" : 0,
      "shuffleRemoteBytesRead" : 0,
      "shuffleRemoteBytesReadToDisk" : 0,
      "shuffleRecordsRead" : 0,
      "shuffleWriteTime" : 0,
      "shuffleBytesWritten" : 0,
      "shuffleRecordsWritten" : 0
      } ]

      References:

      On Measuring Apache Spark Workload Metrics for Performance Troubleshooting

      Example analysis of Spark metrics collected with sparkMeasure

      ==

      How to generate TPC-DS data and run TPC-DS performance benchmark for Spark

      $
      0
      0

      Goal:

      This article explains how to use databricks/spark-sql-perf and databricks/tpcds-kit to generate TPC-DS data for Spark and run TPC-DS performance benchmark.

      Env:

      Spark 2.4.4 with Scala 2.11.12

      MapR 6.1

      Solution:

      1. Download and build the databricks/tpcds-kit from github.

      sudo yum install gcc make flex bison byacc git
      cd /tmp/
      git clone https://github.com/databricks/tpcds-kit.git
      cd tpcds-kit/tools
      make OS=LINUX

      • Note: This should be installed on all cluster nodes with the same location.

      Here we downloaded it at "/tmp/tpcds-kit" on ALL cluster nodes.

      2. Download and build the databricks/spark-sql-perf from github.

      git clone https://github.com/databricks/spark-sql-perf
      cd spark-sql-perf

      • Note: Make sure your Spark version and Scala version match this version of spark-sql-perf.

      Here I am using Spark 2.4.4 with Scala 2.11.12. So I have to checkout an older branch:

      git checkout remotes/origin/newversion

      Now the build.sbt contains below entries which should be compatible with my env:

      scalaVersion := "2.11.8"
      sparkVersion := "2.3.0"

      Then build:

      sbt +package

      • Note:  If you checkout a much older branch of spark-sql-perf say "remotes/origin/branch-0.4" which is based on spark 2.0.1, then you may hit below error when running the TPC-DS benchmark in step 6. This is because starting from Spark 2.2 there is no such method getExecutorStorageStatus in class org.apache.spark.SparkContext.
      java.lang.NoSuchMethodError: org.apache.spark.SparkContext.getExecutorStorageStatus()[Lorg/apache/spark/storage/StorageStatus;
      at com.databricks.spark.sql.perf.Benchmarkable$class.afterBenchmark(Benchmarkable.scala:63)

      3. create gendata.scala

      import com.databricks.spark.sql.perf.tpcds.TPCDSTables

      // Note: Declare "sqlContext" for Spark 2.x version
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)

      // Set:
      // Note: Here my env is using MapRFS, so I changed it to "hdfs:///tpcds".
      // Note: If you are using HDFS, the format should be like "hdfs://namenode:9000/tpcds"
      val rootDir = "hdfs:///tpcds" // root directory of location to create data in.

      val databaseName = "tpcds" // name of database to create.
      val scaleFactor = "10" // scaleFactor defines the size of the dataset to generate (in GB).
      val format = "parquet" // valid spark format like parquet "parquet".
      // Run:
      val tables = new TPCDSTables(sqlContext,
      dsdgenDir = "/tmp/tpcds-kit/tools", // location of dsdgen
      scaleFactor = scaleFactor,
      useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
      useStringForDate = false) // true to replace DateType with StringType


      tables.genData(
      location = rootDir,
      format = format,
      overwrite = true, // overwrite the data that is already there
      partitionTables = true, // create the partitioned fact tables
      clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
      filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
      tableFilter = "", // "" means generate all tables
      numPartitions = 20) // how many dsdgen partitions to run - number of input tasks.

      // Create the specified database
      sql(s"create database $databaseName")
      // Create metastore tables in a specified database for your data.
      // Once tables are created, the current database will be switched to the specified database.
      tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = true)
      // Or, if you want to create temporary tables
      // tables.createTemporaryTables(location, format)

      // For CBO only, gather statistics on all columns:
      tables.analyzeTables(databaseName, analyzeColumns = true)  

      4. Run the gendata.scala using spark-shell

      spark-shell --jars ~/hao/spark-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.0-SNAPSHOT.jar \
      --master yarn \
      --deploy-mode client \
      --executor-memory 4G \
      --num-executors 4 \
      --executor-cores 2 \
      -i ~/hao/gendata.scala
      • Note: Tune --executor-memory , --num-executors and --executor-cores to make sure no OOM happens.
      • Note: If we just need to generate 10G data, reduce "numPartitions" in above gendata.scala say 20 to reduce the overhead of too many tasks. 

      5. Confirm the data files and Hive tables are created.

      It should create 24 tables.

      Check Data:

      # hadoop fs -du -s -h /tpcds/*
      11.8 K /tpcds/call_center
      695.0 K /tpcds/catalog_page
      133.8 M /tpcds/catalog_returns
      1.1 G /tpcds/catalog_sales
      25.4 M /tpcds/customer
      4.7 M /tpcds/customer_address
      7.5 M /tpcds/customer_demographics
      1.8 M /tpcds/date_dim
      30.1 K /tpcds/household_demographics
      1.1 K /tpcds/income_band
      467.9 M /tpcds/inventory
      9.4 M /tpcds/item
      30.7 K /tpcds/promotion
      1.8 K /tpcds/reason
      2.3 K /tpcds/ship_mode
      18.3 K /tpcds/store
      190.4 M /tpcds/store_returns
      1.4 G /tpcds/store_sales
      1.1 M /tpcds/time_dim
      4.3 K /tpcds/warehouse
      7.7 K /tpcds/web_page
      69.7 M /tpcds/web_returns
      516.3 M /tpcds/web_sales
      13.1 K /tpcds/web_site

      Check Hive tables in hive CLI(or spark-sql):

      hive> use tpcds;
      OK
      Time taken: 0.011 seconds

      hive> show tables;
      OK
      call_center
      catalog_page
      catalog_returns
      catalog_sales
      customer
      customer_address
      customer_demographics
      date_dim
      household_demographics
      income_band
      inventory
      item
      promotion
      reason
      ship_mode
      store
      store_returns
      store_sales
      time_dim
      warehouse
      web_page
      web_returns
      web_sales
      web_site
      Time taken: 0.012 seconds, Fetched: 24 row(s)

      6. Run TPC-DS benchmark

      After the tables are created, we can run the 99 TPC-DS queries which are located under folder "./src/main/resources/tpcds_2_4/".

      Create runtpcds.scala:

      import com.databricks.spark.sql.perf.tpcds.TPCDS

      // Note: Declare "sqlContext" for Spark 2.x version
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)

      val tpcds = new TPCDS (sqlContext = sqlContext)
      // Set:
      val databaseName = "tpcds" // name of database with TPCDS data.
      sql(s"use $databaseName")
      val resultLocation = "/tmp/tpcds_results" // place to write results
      val iterations = 1 // how many iterations of queries to run.
      val queries = tpcds.tpcds2_4Queries // queries to run.
      val timeout = 24*60*60 // timeout, in seconds.
      // Run:
      val experiment = tpcds.runExperiment(
      queries,
      iterations = iterations,
      resultLocation = resultLocation,
      forkThread = true)
      experiment.waitForFinish(timeout)

      Run runtpcds.scala using spark-shell:

      spark-shell --jars ~/hao/spark-sql-perf/target/scala-2.11/spark-sql-perf_2.11-0.5.0-SNAPSHOT.jar \
      --driver-class-path /home/mapr/.ivy2/cache/com.typesafe.scala-logging/scala-logging-slf4j_2.10/jars/scala-logging-slf4j_2.10-2.1.2.jar:/home/mapr/.ivy2/cache/com.typesafe.scala-logging/scala-logging-api_2.11/jars/scala-logging-api_2.11-2.1.2.jar \
      --master yarn \
      --deploy-mode client \
      --executor-memory 2G \
      --driver-memory 4G \
      --num-executors 4 \
      -i ~/hao/runtpcds.scala
      • Note: We need to include scala-logging-slf4j and also scala-logging-api jars otherwise error java.lang.ClassNotFoundException will show up for related classes. Good thing is that you can find those jars in ivy cache directories when building the spark-sql-perf using "sbt +package".
      • Note:  "sql(s"use $databaseName")" should be put before declaring "val queries = tpcds.tpcds2_4Queries". Otherwise you will not see the explain plan for each query because it could not find the table in default database. So the Doc on github should be corrected.
      • Note: We need to increase the --driver-memory to large enough because broadcast join needs much memory. Otherwise you may hit below error when running q10.sql:
      failure in runBenchmark: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. 
      As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value

      7. Run customized query benchmark

      If we do not want to run the whole TPC-DS benchmark, and only want to test the benchmark result for certain customized query(eg. subset of TPC-DS queries, or just some other ad-hoc queries),  we can do so. But before that, we need to understand the source code on this project firstly.

      Code Analysis

      Previous example uses "tpcds.tpcds2_4Queries" which is Seq[com.databricks.spark.sql.perf.Query].
      "tpcds" is the object for class "TPCDS".
      Because class "TPCDS" extends trait "Tpcds_2_4_Queries", so "tpcds2_4Queries" is actually the member of trait "Tpcds_2_4_Queries":
        val tpcds2_4Queries = queryNames.map { queryName =>
      val queryContent: String = IOUtils.toString(
      getClass().getClassLoader().getResourceAsStream(s"tpcds_2_4/$queryName.sql"))
      Query(queryName + "-v2.4", queryContent, description = "TPCDS 2.4 Query",
      executionMode = CollectResults)
      }
      Initially, I mistook the "Query" for "com.databricks.spark.sql.perf.Query". So I failed to define the object for Query many times.
      Later on I found that trait "Tpcds_2_4_Queries" actually extends abstract class "Benchmark" which contains the Factory object for benchmark queries as below:
        /** Factory object for benchmark queries. */
      case object Query {
      def apply(
      name: String,
      sqlText: String,
      description: String,
      executionMode: ExecutionMode = ExecutionMode.ForeachResults): Query = {
      new Query(name, sqlContext.sql(sqlText), description, Some(sqlText), executionMode)
      }

      def apply(
      name: String,
      dataFrameBuilder: => DataFrame,
      description: String): Query = {
      new Query(name, dataFrameBuilder, description, None, ExecutionMode.CollectResults)
      }
      }
      After understanding the logic, if we want to customize the query, we just need to create a class extends abstract class "Benchmark".

      Example of customized query

      import com.databricks.spark.sql.perf.{Benchmark, ExecutionMode}

      // Customized a query
      class customized_query extends Benchmark {
      import ExecutionMode._
      private val sqlText = "select * from customer limit 10"
      val q1 = Seq(Query(name = "my customized query", sqlText = sqlText, description = "check some customer info", executionMode = CollectResults))
      }
      val queries = new customized_query().q1
      Everything else is the same as previous example in step 6.

      8. View Benchmark results

      a. If experiment is still running, use "experiment.getCurrentResults".

      experiment.getCurrentResults.createOrReplaceTempView("result") 
      spark.sql("select substring(name,1,100) as Name, bround((parsingTime+analysisTime+optimizationTime+planningTime+executionTime)/1000.0,1) as Runtime_sec from result").show()

      Sample Output:

      +---------+-----------+
      | Name|Runtime_sec|
      +---------+-----------+
      | q1-v2.4| 21.1|
      | q2-v2.4| 13.2|
      | q3-v2.4| 6.0|
      | q4-v2.4| 135.1|
      | q5-v2.4| 38.9|
      | q6-v2.4| 43.4|
      | q7-v2.4| 10.6|
      | q8-v2.4| 9.9|
      | q9-v2.4| 51.7|
      | q10-v2.4| 25.8|
      | q11-v2.4| 92.3|
      | q12-v2.4| 6.8|
      | q13-v2.4| 12.5|
      |q14a-v2.4| 130.7|
      |q14b-v2.4| 91.3|
      | q15-v2.4| 8.8|
      | q16-v2.4| 30.8|
      | q17-v2.4| 46.6|
      | q18-v2.4| 14.2|
      | q19-v2.4| 7.9|
      +---------+-----------+
      only showing top 20 rows

      b. If experiment has ended, read the result json file.

      • Note: since the json file contains nested columns, we need to flatten the data using "explode" function.
      import org.apache.spark.sql.functions._
      val result = spark.read.json(resultLocation).filter("timestamp = 1612560709933").select(explode($"results").as("r"))
      result.createOrReplaceTempView("result")
      spark.sql("select substring(r.name,1,100) as Name, bround((r.parsingTime+r.analysisTime+r.optimizationTime+r.planningTime+r.executionTime)/1000.0,1) as Runtime_sec from result").show()

      Sample Output:

      +---------+-----------+
      | Name|Runtime_sec|
      +---------+-----------+
      | q1-v2.4| 21.1|
      | q2-v2.4| 13.2|
      | q3-v2.4| 6.0|
      | q4-v2.4| 135.1|
      | q5-v2.4| 38.9|
      | q6-v2.4| 43.4|
      | q7-v2.4| 10.6|
      | q8-v2.4| 9.9|
      | q9-v2.4| 51.7|
      | q10-v2.4| 25.8|
      | q11-v2.4| 92.3|
      | q12-v2.4| 6.8|
      | q13-v2.4| 12.5|
      |q14a-v2.4| 130.7|
      |q14b-v2.4| 91.3|
      | q15-v2.4| 8.8|
      | q16-v2.4| 30.8|
      | q17-v2.4| 46.6|
      | q18-v2.4| 14.2|
      | q19-v2.4| 7.9|
      +---------+-----------+
      only showing top 20 rows


      Spark Tuning -- Understand Cost Based Optimizer in Spark

      $
      0
      0

      Goal:

      This article explains Spark CBO(Cost Based Optimizer) with examples and shares how to check the table statistics.

      Env:

      Spark 2.4.4

      MapR 6.1

      MySQL as backend database for Hive Metastore

      Concept:

      Like in any transitional RDBMS, CBO is to determine the best query execution plan based on table statistics. 

      Starting from Spark 2.2, CBO was introduced. Before that, RBO(Rule Based Optimizer) is used.

      Before using CBO, we need to collect the table/column level statistics(including histogram) using Analyze Table command.

      • Note: As of Spark 2.4.4, the CBO is disabled by default and the parameter spark.sql.cbo.enabled controls it.
      • Note: As of Spark 2.4.4, histogram statistics collection is disabled by default and the parameter spark.sql.statistics.histogram.enabled controls it. 
      • Note: Spark uses Equal-Height Histogram instead of Equal-Width Histogram.
      • Note: As of Spark 2.4.4, the default number of histogram buckets is 254 which is controlled by parameter spark.sql.statistics.histogram.numBins.

      What is included for column level statistics?

      • For Numeric/Date/Timestamp type: Distinct Count, Max, Min, Null Count, Average Length, Max Length.
      • For String/Binary type: Distinct Count, Null Count, Average Length, Max Length.

      CBO uses logical optimization rules to optimize the logical plan. 

      So if we want to examine the statistics inside query explain plan, we can find them inside “Optimized Logical Plan” section.  

      Solution:

      Here we will use some simple query examples based on test table named "customer"(generated by TPC-DS tool shared in this post) to demonstrate the CBO and statistics in Spark.

      All below SQL statements are executed in spark-sql by default.

      1. Collect Table/Column statistics

      1.1 Table level statistics including total number of rows and data size:

      ANALYZE TABLE customer COMPUTE STATISTICS;

      1.2 Table + Column statistics: 

      ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk;

      1.3 Table + Column statistics with histogram:

      set spark.sql.statistics.histogram.enabled=true;
      ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk;

      2. View Table/Column statistics

      2.1 Table level statistics:

      DESCRIBE EXTENDED customer;

      Output:

      Statistics	26670841 bytes, 500000 rows

      2.2 Column level statistics including histogram:

      DESCRIBE EXTENDED customer c_customer_sk;

      Output:

      col_name	c_customer_sk
      data_type int
      comment NULL
      min 1
      max 500000
      num_nulls 0
      distinct_count 500000
      avg_col_len 4
      max_col_len 4
      histogram height: 1968.5039370078741, num_of_bins: 254
      bin_0 lower_bound: 1.0, upper_bound: 1954.0, distinct_count: 1982
      bin_1 lower_bound: 1954.0, upper_bound: 3898.0, distinct_count: 1893
      ...
      bin_253 lower_bound: 497982.0, upper_bound: 500000.0, distinct_count: 2076

      2.3 Check statistics in backend database for Hive Metastore:(eg. mysql)

      select tp.PARAM_KEY, tp.PARAM_VALUE
      from DBS d,TBLS t, TABLE_PARAMS tp
      where t.DB_ID = d.DB_ID
      and tp.TBL_ID=t.TBL_ID
      and d.NAME='tpcds' and t.TBL_NAME='customer'
      and (
      tp.PARAM_KEY in (
      'spark.sql.statistics.numRows',
      'spark.sql.statistics.totalSize'
      )
      or
      tp.PARAM_KEY like 'spark.sql.statistics.colStats.c_customer_sk.%'
      )
      and tp.PARAM_KEY not like 'spark.sql.statistics.colStats.%.histogram'
      ;

      Output:

      +-----------------------------------------------------------+-------------+
      | PARAM_KEY | PARAM_VALUE |
      +-----------------------------------------------------------+-------------+
      | spark.sql.statistics.colStats.c_customer_sk.avgLen | 4 |
      | spark.sql.statistics.colStats.c_customer_sk.distinctCount | 500000 |
      | spark.sql.statistics.colStats.c_customer_sk.max | 500000 |
      | spark.sql.statistics.colStats.c_customer_sk.maxLen | 4 |
      | spark.sql.statistics.colStats.c_customer_sk.min | 1 |
      | spark.sql.statistics.colStats.c_customer_sk.nullCount | 0 |
      | spark.sql.statistics.colStats.c_customer_sk.version | 1 |
      | spark.sql.statistics.numRows | 500000 |
      | spark.sql.statistics.totalSize | 26670841 |
      +-----------------------------------------------------------+-------------+
      9 rows in set (0.00 sec)

      2.4  View statistics in spark-shell to understand which classes are used to store statistics

      val db = "tpcds"
      val tableName = "customer"
      val colName = "c_customer_sk"

      val metadata = spark.sharedState.externalCatalog.getTable(db, tableName)
      val stats = metadata.stats.get
      val colStats = stats.colStats
      val c_customer_sk_stats = colStats(colName)

      val props = c_customer_sk_stats.toMap(colName)
      println(props)

      Output:

      scala> println(props)
      Map(c_customer_sk.avgLen -> 4, c_customer_sk.nullCount -> 0, c_customer_sk.distinctCount -> 500000, c_customer_sk.histogram -> XXXYYYZZZ, c_customer_sk.min -> 1, c_customer_sk.max -> 500000, c_customer_sk.version -> 1, c_customer_sk.maxLen -> 4)

      Basically above "c_customer_sk_stats" is of class org.apache.spark.sql.catalyst.catalog.CatalogColumnStat which is defined inside ./sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala

      3. Check cardinality based on statistics

      From above statistics for column "c_customer_sk" in table "customer", we know that this column is unique and has totally 500000 distinct values ranging from 1 ~ 500000.

      In RBO world, no matter the filter is based on "where c_customer_sk < 500" or "where c_customer_sk < 500000", the Filter operator always shows "sizeInBytes=119.2 MB" which is the total table size. And there is no rowCount shown.

      spark-sql> set spark.sql.cbo.enabled=false;
      spark.sql.cbo.enabled false
      Time taken: 0.013 seconds, Fetched 1 row(s)

      spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500;
      == Optimized Logical Plan ==
      Project [c_customer_sk#724], Statistics(sizeInBytes=6.4 MB, hints=none)
      +- Filter (isnotnull(c_customer_sk#724) && (c_customer_sk#724 < 500)), Statistics(sizeInBytes=119.2 MB, hints=none)
      +- Relation[c_customer_sk#724,c_customer_id#725,c_current_cdemo_sk#726,c_current_hdemo_sk#727,c_current_addr_sk#728,c_first_shipto_date_sk#729,c_first_sales_date_sk#730,c_salutation#731,c_first_name#732,c_last_name#733,c_preferred_cust_flag#734,c_birth_day#735,c_birth_month#736,c_birth_year#737,c_birth_country#738,c_login#739,c_email_address#740,c_last_review_date#741] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

      spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500000;
      == Optimized Logical Plan ==
      Project [c_customer_sk#724], Statistics(sizeInBytes=6.4 MB, hints=none)
      +- Filter (isnotnull(c_customer_sk#724) && (c_customer_sk#724 < 500000)), Statistics(sizeInBytes=119.2 MB, hints=none)
      +- Relation[c_customer_sk#724,c_customer_id#725,c_current_cdemo_sk#726,c_current_hdemo_sk#727,c_current_addr_sk#728,c_first_shipto_date_sk#729,c_first_sales_date_sk#730,c_salutation#731,c_first_name#732,c_last_name#733,c_preferred_cust_flag#734,c_birth_day#735,c_birth_month#736,c_birth_year#737,c_birth_country#738,c_login#739,c_email_address#740,c_last_review_date#741] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

       In CBO world, we can see the estimated data size and rowCount based on column level statistics and the Filter: 

      sizeInBytes=122.8 KB, rowCount=503 VS sizeInBytes=119.2 MB, rowCount=5.00E+5.

      spark-sql> set spark.sql.cbo.enabled=true;
      spark.sql.cbo.enabled true
      Time taken: 0.02 seconds, Fetched 1 row(s)

      spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500;
      == Optimized Logical Plan ==
      Project [c_customer_sk#1024], Statistics(sizeInBytes=5.9 KB, rowCount=503, hints=none)
      +- Filter (isnotnull(c_customer_sk#1024) && (c_customer_sk#1024 < 500)), Statistics(sizeInBytes=122.8 KB, rowCount=503, hints=none)
      +- Relation[c_customer_sk#1024,c_customer_id#1025,c_current_cdemo_sk#1026,c_current_hdemo_sk#1027,c_current_addr_sk#1028,c_first_shipto_date_sk#1029,c_first_sales_date_sk#1030,c_salutation#1031,c_first_name#1032,c_last_name#1033,c_preferred_cust_flag#1034,c_birth_day#1035,c_birth_month#1036,c_birth_year#1037,c_birth_country#1038,c_login#1039,c_email_address#1040,c_last_review_date#1041] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

      spark-sql> explain cost select c_customer_sk from customer where c_customer_sk < 500000;
      == Optimized Logical Plan ==
      Project [c_customer_sk#1024], Statistics(sizeInBytes=5.7 MB, rowCount=5.00E+5, hints=none)
      +- Filter (isnotnull(c_customer_sk#1024) && (c_customer_sk#1024 < 500000)), Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)
      +- Relation[c_customer_sk#1024,c_customer_id#1025,c_current_cdemo_sk#1026,c_current_hdemo_sk#1027,c_current_addr_sk#1028,c_first_shipto_date_sk#1029,c_first_sales_date_sk#1030,c_salutation#1031,c_first_name#1032,c_last_name#1033,c_preferred_cust_flag#1034,c_birth_day#1035,c_birth_month#1036,c_birth_year#1037,c_birth_country#1038,c_login#1039,c_email_address#1040,c_last_review_date#1041] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

      Note: In spark-shell, we can use below way to fetch the same stuff:

      spark.conf.set("spark.sql.cbo.enabled","true")
      sql(s"use tpcds")
      val stats = spark.sql("select c_customer_sk from customer where c_customer_sk < 500").queryExecution.stringWithStats
      More information regarding how Spark CBO calculates the cardinality for Filter/Join/other operators, please refer to this slide and training session:

      4. Broadcast Join

      Like in any MPP architecture query engine or SQL on Hadoop products(Such as Hive, Impala, Drill), Broadcast Join is not a new thing.  

      By default in Spark, the table/data size below 10MB(configured by spark.sql.autoBroadcastJoinThreshold) can be broadcasted to all worker nodes.

       Look at below example join query:

      spark-sql> explain cost select * from customer a, customer b where a.c_first_name=b.c_first_name and a.c_customer_sk<500000 and b.c_customer_sk<500;
      == Optimized Logical Plan ==
      Join Inner, (c_first_name#18 = c_first_name#62), Statistics(sizeInBytes=22.5 MB, rowCount=4.80E+4, hints=none)
      :- Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18)), Statistics(sizeInBytes=115.1 MB, rowCount=4.83E+5, hints=none)
      : +- Relation[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)
      +- Filter ((isnotnull(c_customer_sk#54) && (c_customer_sk#54 < 500)) && isnotnull(c_first_name#62)), Statistics(sizeInBytes=118.7 KB, rowCount=486, hints=none)
      +- Relation[c_customer_sk#54,c_customer_id#55,c_current_cdemo_sk#56,c_current_hdemo_sk#57,c_current_addr_sk#58,c_first_shipto_date_sk#59,c_first_sales_date_sk#60,c_salutation#61,c_first_name#62,c_last_name#63,c_preferred_cust_flag#64,c_birth_day#65,c_birth_month#66,c_birth_year#67,c_birth_country#68,c_login#69,c_email_address#70,c_last_review_date#71] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

      == Physical Plan ==
      *(2) BroadcastHashJoin [c_first_name#18], [c_first_name#62], Inner, BuildRight
      :- *(2) Project [c_customer_sk#10, c_customer_id#11, c_current_cdemo_sk#12, c_current_hdemo_sk#13, c_current_addr_sk#14, c_first_shipto_date_sk#15, c_first_sales_date_sk#16, c_salutation#17, c_first_name#18, c_last_name#19, c_preferred_cust_flag#20, c_birth_day#21, c_birth_month#22, c_birth_year#23, c_birth_country#24, c_login#25, c_email_address#26, c_last_review_date#27]
      : +- *(2) Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18))
      : +- *(2) FileScan parquet tpcds.customer[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500000), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[8, string, true]))
      +- *(1) Project [c_customer_sk#54, c_customer_id#55, c_current_cdemo_sk#56, c_current_hdemo_sk#57, c_current_addr_sk#58, c_first_shipto_date_sk#59, c_first_sales_date_sk#60, c_salutation#61, c_first_name#62, c_last_name#63, c_preferred_cust_flag#64, c_birth_day#65, c_birth_month#66, c_birth_year#67, c_birth_country#68, c_login#69, c_email_address#70, c_last_review_date#71]
      +- *(1) Filter ((isnotnull(c_customer_sk#54) && (c_customer_sk#54 < 500)) && isnotnull(c_first_name#62))
      +- *(1) FileScan parquet tpcds.customer[c_customer_sk#54,c_customer_id#55,c_current_cdemo_sk#56,c_current_hdemo_sk#57,c_current_addr_sk#58,c_first_shipto_date_sk#59,c_first_sales_date_sk#60,c_salutation#61,c_first_name#62,c_last_name#63,c_preferred_cust_flag#64,c_birth_day#65,c_birth_month#66,c_birth_year#67,c_birth_country#68,c_login#69,c_email_address#70,c_last_review_date#71] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
      Time taken: 0.098 seconds, Fetched 1 row(s)

      From the "Optimized Logical Plan", we know the estimated size of the smaller table is (sizeInBytes=118.7 KB, rowCount=486). So it can be broadcasted and that is why we see "BroadcastHashJoin" in "Physical Plan".

      If we decrease spark.sql.autoBroadcastJoinThreshold to 118KB(118*1024=120832), then it will be converted to a SortMergeJoin:

      spark-sql> set spark.sql.autoBroadcastJoinThreshold=120832;
      spark.sql.autoBroadcastJoinThreshold 120832
      Time taken: 0.016 seconds, Fetched 1 row(s)
      spark-sql> explain cost select * from customer a, customer b where a.c_first_name=b.c_first_name and a.c_customer_sk<500000 and b.c_customer_sk<500;
      == Optimized Logical Plan ==
      Join Inner, (c_first_name#18 = c_first_name#94), Statistics(sizeInBytes=22.5 MB, rowCount=4.80E+4, hints=none)
      :- Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18)), Statistics(sizeInBytes=115.1 MB, rowCount=4.83E+5, hints=none)
      : +- Relation[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)
      +- Filter ((isnotnull(c_customer_sk#86) && (c_customer_sk#86 < 500)) && isnotnull(c_first_name#94)), Statistics(sizeInBytes=118.7 KB, rowCount=486, hints=none)
      +- Relation[c_customer_sk#86,c_customer_id#87,c_current_cdemo_sk#88,c_current_hdemo_sk#89,c_current_addr_sk#90,c_first_shipto_date_sk#91,c_first_sales_date_sk#92,c_salutation#93,c_first_name#94,c_last_name#95,c_preferred_cust_flag#96,c_birth_day#97,c_birth_month#98,c_birth_year#99,c_birth_country#100,c_login#101,c_email_address#102,c_last_review_date#103] parquet, Statistics(sizeInBytes=119.2 MB, rowCount=5.00E+5, hints=none)

      == Physical Plan ==
      *(5) SortMergeJoin [c_first_name#18], [c_first_name#94], Inner
      :- *(2) Sort [c_first_name#18 ASC NULLS FIRST], false, 0
      : +- Exchange hashpartitioning(c_first_name#18, 200)
      : +- *(1) Project [c_customer_sk#10, c_customer_id#11, c_current_cdemo_sk#12, c_current_hdemo_sk#13, c_current_addr_sk#14, c_first_shipto_date_sk#15, c_first_sales_date_sk#16, c_salutation#17, c_first_name#18, c_last_name#19, c_preferred_cust_flag#20, c_birth_day#21, c_birth_month#22, c_birth_year#23, c_birth_country#24, c_login#25, c_email_address#26, c_last_review_date#27]
      : +- *(1) Filter ((isnotnull(c_customer_sk#10) && (c_customer_sk#10 < 500000)) && isnotnull(c_first_name#18))
      : +- *(1) FileScan parquet tpcds.customer[c_customer_sk#10,c_customer_id#11,c_current_cdemo_sk#12,c_current_hdemo_sk#13,c_current_addr_sk#14,c_first_shipto_date_sk#15,c_first_sales_date_sk#16,c_salutation#17,c_first_name#18,c_last_name#19,c_preferred_cust_flag#20,c_birth_day#21,c_birth_month#22,c_birth_year#23,c_birth_country#24,c_login#25,c_email_address#26,c_last_review_date#27] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500000), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
      +- *(4) Sort [c_first_name#94 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(c_first_name#94, 200)
      +- *(3) Project [c_customer_sk#86, c_customer_id#87, c_current_cdemo_sk#88, c_current_hdemo_sk#89, c_current_addr_sk#90, c_first_shipto_date_sk#91, c_first_sales_date_sk#92, c_salutation#93, c_first_name#94, c_last_name#95, c_preferred_cust_flag#96, c_birth_day#97, c_birth_month#98, c_birth_year#99, c_birth_country#100, c_login#101, c_email_address#102, c_last_review_date#103]
      +- *(3) Filter ((isnotnull(c_customer_sk#86) && (c_customer_sk#86 < 500)) && isnotnull(c_first_name#94))
      +- *(3) FileScan parquet tpcds.customer[c_customer_sk#86,c_customer_id#87,c_current_cdemo_sk#88,c_current_hdemo_sk#89,c_current_addr_sk#90,c_first_shipto_date_sk#91,c_first_sales_date_sk#92,c_salutation#93,c_first_name#94,c_last_name#95,c_preferred_cust_flag#96,c_birth_day#97,c_birth_month#98,c_birth_year#99,c_birth_country#100,c_login#101,c_email_address#102,c_last_review_date#103] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs:///tpcds/customer], PartitionFilters: [], PushedFilters: [IsNotNull(c_customer_sk), LessThan(c_customer_sk,500), IsNotNull(c_first_name)], ReadSchema: struct<c_customer_sk:int,c_customer_id:string,c_current_cdemo_sk:int,c_current_hdemo_sk:int,c_cur...
      Time taken: 0.113 seconds, Fetched 1 row(s)

      If we run this query "select * from customer a, customer b where a.c_first_name=b.c_first_name and a.c_customer_sk<5 and b.c_customer_sk<500;" and look at the web UI, we can also find the estimated cardinality:

       In all, CBO is a huge topic in any database/query engine. I will discuss more in future posts.

      References:

      Cost Based Optimizer in Apache Spark 2.2 

      Training: Cardinality Estimation through Histogram in Apache Spark 2.

       

      Spark Code -- Which Spark SQL data type isOrderable?

      $
      0
      0

      Goal:

      This article does some code analysis on which Spark SQL data type is Order-able or Sort-able.

      We will look into the source code logic for method "isOrderable" of object org.apache.spark.sql.catalyst.expressions.RowOrdering.

      The reason why we are interested into method "isOrderable" is this method is used by SparkStrategies.scala to choose join types which we will dig deeper more in another post.

      Env:

      Spark 2.4 source code

      Solution:

      The source code for method "isOrderable" is:

        /**
      * Returns true iff the data type can be ordered (i.e. can be sorted).
      */
      def isOrderable(dataType: DataType): Boolean = dataType match {
      case NullType => true
      case dt: AtomicType => true
      case struct: StructType => struct.fields.forall(f => isOrderable(f.dataType))
      case array: ArrayType => isOrderable(array.elementType)
      case udt: UserDefinedType[_] => isOrderable(udt.sqlType)
      case _ => false
      }

      So basically any NullType or AtomicType should be Order-able. For other complex types, it depends on their element types.

      Let's now take a look at ALL of the Spark SQL data types in org.apache.spark.sql.types

      1. NullType

      import org.apache.spark.sql.catalyst.expressions.RowOrdering
      import org.apache.spark.sql.types._

      scala> RowOrdering.isOrderable(NullType)
      res0: Boolean = true

      2. class which extends AtomicType

      scala> RowOrdering.isOrderable(BinaryType)
      res29: Boolean = true

      scala> RowOrdering.isOrderable(BooleanType)
      res6: Boolean = true

      scala> RowOrdering.isOrderable(DateType)
      res31: Boolean = true

      RowOrdering.isOrderable(HiveStringType)
      scala> RowOrdering.isOrderable(StringType)
      res1: Boolean = true

      scala> RowOrdering.isOrderable(TimestampType)
      res19: Boolean = true

      Here is another abstract class HiveStringType which also extends AtomicType

      But as per the comment below, it should be replaced by a StringType. And it is even removed in Spark 3.1.

      /**
      * A hive string type for compatibility. These datatypes should only used for parsing,
      * and should NOT be used anywhere else. Any instance of these data types should be
      * replaced by a [[StringType]] before analysis.
      */

      sealed abstract class HiveStringType extends AtomicType

      3. class which extends IntegralType

      Basically "abstract class IntegralType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala

      So any class which extends IntegralType should also be order-able:

      scala> RowOrdering.isOrderable(ByteType)
      res11: Boolean = true

      scala> RowOrdering.isOrderable(IntegerType)
      res3: Boolean = true

      scala> RowOrdering.isOrderable(LongType)
      res13: Boolean = true

      scala> RowOrdering.isOrderable(ShortType)
      res14: Boolean = true

      4. class which extends FractionalType

      Basically "abstract class FractionalType extends NumericType" and "abstract class NumericType extends AtomicType" inside AbstractDataType.scala.

      So any class which extends FractionalType should also be order-able:

      scala> RowOrdering.isOrderable(DecimalType(10,5))
      res17: Boolean = true

      scala> RowOrdering.isOrderable(DoubleType)
      res2: Boolean = true

      scala> RowOrdering.isOrderable(FloatType)
      res13: Boolean = true

      5. Spark SQL data types which are not order-able

      scala> RowOrdering.isOrderable(CalendarIntervalType)
      res26: Boolean = false

      scala> RowOrdering.isOrderable(DataTypes.createMapType(StringType,StringType))
      res9: Boolean = false

      scala> RowOrdering.isOrderable(ObjectType(classOf[java.lang.Integer]))
      res23: Boolean = false

      6. Complex Spark SQL data types

      If the ArrayType's element type is order-able, then ArrayType is order-able. Vice Versa.

      scala> RowOrdering.isOrderable(ArrayType(IntegerType))
      res22: Boolean = true

      scala> RowOrdering.isOrderable(ArrayType(CalendarIntervalType))
      res27: Boolean = false

      If all of the field types of StructType is order-able, then StructType is order-able. Vice Versa. 

      scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", StringType))
      res6: Boolean = true

      scala> RowOrdering.isOrderable(new StructType().add("a", IntegerType).add("b", CalendarIntervalType))
      res7: Boolean = false

      7. UserDefinedType

      As per below comment, it should not be used and becomes private after Spark 2.x.

       * Note: This was previously a developer API in Spark 1.x. We are making this private in Spark 2.0
      * because we will very likely create a new version of this that works better with Datasets.


      Viewing all 137 articles
      Browse latest View live


      Latest Images