Spark example job

PHOTO

Thu Apr 01 2021 17:59:16 GMT+0000 (Coordinated Universal Time)

Saved by @ankity09 #aws #emr #spark

"""
A simple example demonstrating basic Spark SQL features using fictional
data inspired by a paper on determining the optimum length of chopsticks.
https://www.ncbi.nlm.nih.gov/pubmed/15676839
Run with:
  ./bin/spark-submit OptimumChopstick.py
"""
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.storagelevel import StorageLevel
#rdd.persist(StorageLevel.MEMORY_ONLY_SER)

# Get avg Food pinching effeciency by length
def AvgEffeciencyByLength(df):
    meansDf = df.groupby('ChopstickLength').mean('FoodPinchingEffeciency').orderBy('avg(FoodPinchingEffeciency)',ascending=0)
    return meansDf

# init
spark = SparkSession.builder.appName("Optimum Chopstick").getOrCreate()
sc = spark.sparkContext
input_loc = "s3://llubbe-gdelt-open-data/ChopstickEffeciency/"

# Read input by line
lines = sc.textFile(input_loc)
parts = lines.map(lambda l: l.split(","))
parts.persist(StorageLevel.MEMORY_ONLY_SER)
# Each line is converted to a tuple.
chopstickItems = parts.map(lambda p: (str(p[0]), float(p[1]), int(p[2]), int(p[3].strip())))

# Define a schema
fields = [StructField("TestID", StringType()),
          StructField("FoodPinchingEffeciency", DoubleType()), 
          StructField("Individual", IntegerType()), 
          StructField("ChopstickLength", IntegerType())]
schema = StructType(fields)

# Apply the schema to the RDD
chopsticksDF = spark.createDataFrame(chopstickItems, schema)

effeciencyByLength = AvgEffeciencyByLength(chopsticksDF)
effeciencyByLength.distinct().count()

moar_chopsticksDF = spark.read.load(input_loc, format="csv", schema=schema)
moar_effeciencyByLength = AvgEffeciencyByLength(moar_chopsticksDF)
moar_effeciencyByLength.distinct().count()

spark.stop()

Save snippets that work from anywhere online with our extensions

Available in the Chrome Web Store

Get Firefox Add-on

Get VS Code extension

Comments

More like this

#pyspark #spark #python #etl

Split Spark Dataframe string column into multiple columns

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

gets current AWS CLI user

aws sts get-caller-identity

Finding Object Information

kubectl get pods -o yaml | grep -C 5 labels:

Imperative Approach

kubectl run frontend --image=nginx --port=80

Create IAM user from CLI

aws iam create-user --user-name username

Create Access Key for user

aws iam create-access-key --user-name username | tee /tmp/create_output.json

Create Shell script to export Keys for username

cat << EoF > username_cred.sh
export AWS_SECRET_ACCESS_KEY=$(jq -r .AccessKey.SecretAccessKey /tmp/create_output.json)
export AWS_ACCESS_KEY_ID=$(jq -r .AccessKey.AccessKeyId /tmp/create_output.json)
EoF

Describe configmap for EKS

kubectl describe configmap -n kube-system aws-auth

#aws #awscli #ec2 #describe-images

Get Image Tag Value

aws --profile default ec2 describe-images --owners self --query 'Images[*].[Tags[?Key==`ImageType`] | [0].Value]'

ssh -i ~/Documents/nkityd.pem hadoop@ec2-54-236-239-19.compute-1.amazonaws.com cat /etc/hive/conf/hive-site.xml | grep "thrift://" | sed 's/\<value>//g' | sed 's/\<\/value>//g' | awk '{print "HMS URI: " $1 }'

#aws #emr #spark

Spark example job

"""
A simple example demonstrating basic Spark SQL features using fictional
data inspired by a paper on determining the optimum length of chopsticks.
https://www.ncbi.nlm.nih.gov/pubmed/15676839
Run with:
  ./bin/spark-submit OptimumChopstick.py
"""
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.storagelevel import StorageLevel
#rdd.persist(StorageLevel.MEMORY_ONLY_SER)

# Get avg Food pinching effeciency by length
def AvgEffeciencyByLength(df):
    meansDf = df.groupby('ChopstickLength').mean('FoodPinchingEffeciency').orderBy('avg(FoodPinchingEffeciency)',ascending=0)
    return meansDf

# init
spark = SparkSession.builder.appName("Optimum Chopstick").getOrCreate()
sc = spark.sparkContext
input_loc = "s3://llubbe-gdelt-open-data/ChopstickEffeciency/"

# Read input by line
lines = sc.textFile(input_loc)
parts = lines.map(lambda l: l.split(","))
parts.persist(StorageLevel.MEMORY_ONLY_SER)
# Each line is converted to a tuple.
chopstickItems = parts.map(lambda p: (str(p[0]), float(p[1]), int(p[2]), int(p[3].strip())))

# Define a schema
fields = [StructField("TestID", StringType()),
          StructField("FoodPinchingEffeciency", DoubleType()), 
          StructField("Individual", IntegerType()), 
          StructField("ChopstickLength", IntegerType())]
schema = StructType(fields)

# Apply the schema to the RDD
chopsticksDF = spark.createDataFrame(chopstickItems, schema)

effeciencyByLength = AvgEffeciencyByLength(chopsticksDF)
effeciencyByLength.distinct().count()

moar_chopsticksDF = spark.read.load(input_loc, format="csv", schema=schema)
moar_effeciencyByLength = AvgEffeciencyByLength(moar_chopsticksDF)
moar_effeciencyByLength.distinct().count()

spark.stop()

Connect to beeline

beeline -u "jdbc:hive2://localhost:10000/default" -n hdfs

Connect to mysql database for HMS

mysql -h ip-172-31-39-192.ec2.internal  -u hive -p

Tera generate data for EMR

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 10000000 /nkityd/teragendata

Tera Sort for EMR

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /nkityd/teragendata /nkityd/teragensorteddata/

Browse more snippets >>