本文参考林子雨老师的大数据课程【Spark编程基础(Python版)】
1. 在线课程
2. PPT下载

为了编写代码方便,没有在Linux上用vim编辑,而是在Windows10上用pycharm编译器进行代码编写和运行测试。请事先安装好pyspark,并用pip3安装好需要导入的包。本文使用python 3.7

@

零、概念

  • DataFrame: 使用Spark SQL中的DataFrame作为数据集,它可以容纳各种数据类型。较之RDD,DataFrame包含了schema 信息,更类似传统数据库中的二维表格。 它被ML Pipeline用来存储源数据。例如,DataFrame中的列可以是存储的文本、特征向量、真实标签和预测的标签等。
  • Transformer: 翻译成转换器,是一种可以将一个DataFrame转换为另一个DataFrame的算法。比如一个模型就是一个Transformer。它可以把一个不包含预测标签的测试数据集 DataFrame 打上标签,转化成另一个包含预测标签的 DataFrame。 技术上,Transformer实现了一个方法transform(),它通过附加一个或多个列将一个DataFrame转换为另一个DataFrame。
  • Estimator: 翻译成估计器或评估器,它是学习算法或在训练数据上的训练方法的概念抽象。在 Pipeline 里通常是被用来操作 DataFrame 数据并生成一个Transformer。从技术上讲,Estimator实现了一个方法fit(),它接受一个DataFrame并产生一个转换器。比如,一个随机森林算法就是一个Estimator,它可以调用fit(),通过训练特征数据而得到一个随机森林模型。
  • Parameter: Parameter 被用来设置 Transformer 或者 Estimator 的参数。现在,所有转换器和估计器可共享用于指定参数的公共API。ParamMap是一组(参数,值)对。
  • PipeLine: 翻译为流水线或者管道。流水线将多个工作流阶段(转换器和评估器)连接在一起,形成机器学习的工作流,并获得结果输出。
    在这里插入图片描述
    在这里插入图片描述

一、简单示例:如何构建一个机器学习流水线

以逻辑斯蒂回归(Logistic Regression)为例,构建一个典型的机器学习过程,来具体介绍一下流水线是如何应用的。

任务需求:查找出所有包含"spark"的句子,即将包含"spark"的句子的标签设为1,没有"spark"的句子的标签设为0。

代码实现:

# 1 导入所需要的包
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.shell import spark

# Prepare training documents from a list of (id, text, label) tuples.
# 构建训练集
trainingData = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# 2 定义Pipeline中各个流水线阶段PipelineStage
# 具体包括转换器(Tokenizer, HashingTF)和评估器(LogisticRegression)
# 转换为分词(Raw text -> Words)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
# 转换为特征向量(Words -> Feature vectors)
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
# 逻辑斯蒂回归算法(Feature vectors -> Logistic Regression Model)
lr = LogisticRegression(maxIter=10, regParam=0.001)

# 3 按照具体的处理逻辑有序地组织PipelineStages,并创建一个Pipeline(本质上是一个评估器)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# 用fit方法训练生成一个流水线模型PipelineModel,它是一个转换器,之后在预测测试集标签的时候使用
model = pipeline.fit(trainingData)

# 4 构建测试集(没有标签列,标签由模型来预测)
testData = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# 5 调用之前训练好的PipelineModel的transform()方法
# 让测试数据按顺序通过拟合的流水线,生成预测结果
prediction = model.transform(testData)
# 选出4列,其中probability表示属于各类的概率,prediction表示预测的标签
# select选择要输出的列,collect获取所有行的数据,用foreach把每行打印出来
preRows = prediction.select("id", "text", "probability", "prediction").collect()
for row in preRows:
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, prob, prediction))

运行结果:
在这里插入图片描述

二、逻辑斯蒂回归分类器

逻辑斯蒂回归(logistic regression)是统计学习中的经典分类方法,属于对数线性模型。逻辑斯蒂回归的因变量可以是二分类的,也可以是多分类的。

任务需求:以iris数据集(iris下载地址)为例进行分析,iris以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性,是在数据挖掘、数据分类中非常常用的测试集、训练集。为了便于理解,这里主要用后两个属性(花瓣的长度和宽度)来进行分类。我们先取其中的后两类数据,用二项逻辑斯蒂回归进行二分类分析。

要求在iris数据集中随机选择70%数据作为训练集,30%作为测试集,由训练集训练出逻辑斯蒂回归模型,预测测试集的标签并将其与实际标签比较,计算预测准确率。

代码实现:

# 1 导入所需要的包
from pyspark.ml.linalg import Vectors
from pyspark.shell import spark
from pyspark.sql import Row
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.classification import LogisticRegression

# 2 定制一个函数,来返回一个指定的数据(dense vector类型)
def f(x):
    rel = {}
    # 每行被分成了5部分,前4部分是鸢尾花的4个特征,最后一部分是鸢尾花的分类
    rel['features'] = Vectors.dense(float(x[0]), float(x[1]), float(x[2]), float(x[3]))
    rel['label'] = str(x[4])
    return rel

# 读取iris.txt文件(注意对应上你自己的文件路径)
# 第一个map把每行的数据用“,”隔开,把特征存储在Vector中
# 创建一个Iris模式的RDD,然后转化成dataframe,最后调用show()方法来查看部分数据
data = spark.sparkContext. \
    textFile("file:///C:/Users/LJW/Desktop/iris.txt"). \
    map(lambda line: line.split(',')). \
    map(lambda p: Row(**f(p))). \
    toDF()
data.show()

# 3 分别获取标签列和特征列,进行索引并进行重命名
labelIndexer = StringIndexer(). \
    setInputCol("label"). \
    setOutputCol("indexedLabel"). \
    fit(data)  # 评估器->转换器
featureIndexer = VectorIndexer(). \
    setInputCol("features"). \
    setOutputCol("indexedFeatures"). \
    fit(data)

# 4 设置一个IndexToString的转换器,把预测的类别(数值型prediction)转化成字符型的predictedLabel
labelConverter = IndexToString(). \
    setInputCol("prediction"). \
    setOutputCol("predictedLabel"). \
    setLabels(labelIndexer.labels)

# 5 设置LogisticRegression算法的参数,这里设置了循环次数为100次,规范化项为0.3等
# 具体可以设置的所有参数,可以通过explainParams()来获取
lr = LogisticRegression(). \
    setLabelCol("indexedLabel"). \
    setFeaturesCol("indexedFeatures"). \
    setMaxIter(100). \
    setRegParam(0.3). \
    setElasticNetParam(0.8)
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

# 6 构建机器学习流水线(Pipeline),在训练数据集上调用fit()进行模型训练,并在测试数据集上调用transform()方法进行预测
lrPipeline = Pipeline().setStages([labelIndexer, featureIndexer, lr, labelConverter])
# 把数据集随机分成训练集和测试集,其中训练集占70%
trainingData, testData = data.randomSplit([0.7, 0.3])
# Pipeline本质上是一个评估器,当Pipeline调用fit()的时候就产生了一个PipelineModel,它是一个转换器
lrPipelineModel = lrPipeline.fit(trainingData)
# PipelineModel可以调用transform()来进行预测,生成一个新的DataFrame,即利用训练得到的模型对测试集进行验证
lrPredictions = lrPipelineModel.transform(testData)

# 7 输出预测的结果,其中select选择要输出的列,collect获取所有行的数据,用foreach把每行打印出来
preRows = lrPredictions.select("label", "features", "probability", "predictedLabel").collect()
for row in preRows:
    label, features, probability, predictedLabel = row
    print("%s,%s --> prob=%s,predictedLabel:%s" % (label, features, probability, predictedLabel))

# 8 对训练的模型进行评估,创建一个MulticlassClassificationEvaluator实例
# 用set方法把预测分类的列名和真实分类的列名进行设置,然后计算预测准确率
evaluator = MulticlassClassificationEvaluator(). \
    setLabelCol("indexedLabel"). \
    setPredictionCol("prediction")
lrAccuracy = evaluator.evaluate(lrPredictions)
print("\nlrAccuracy=%f" % lrAccuracy)

# 9 可以通过model来获取训练得到的逻辑斯蒂模型
# lrPipelineModel是一个PipelineModel,因此可以通过调用它的stages方法来获取lr模型
lrModel = lrPipelineModel.stages[2]
print("\nCoefficients: \n " + str(lrModel.coefficientMatrix) +
      "\nIntercept: " + str(lrModel.interceptVector) +
      "\n numClasses: " + str(lrModel.numClasses) +
      "\n numFeatures: " + str(lrModel.numFeatures))

运行结果(部分):

Iris-setosa,[4.3,3.0,1.1,0.1] --> prob=[0.5243322260103365,0.2807261844423659,0.1949415895472976],predictedLabel:Iris-setosa
Iris-setosa,[4.4,2.9,1.4,0.2] --> prob=[0.49729174541655624,0.2912406744481094,0.2114675801353344],predictedLabel:Iris-setosa
Iris-setosa,[4.4,3.2,1.3,0.2] --> prob=[0.5033392716254922,0.28773708047332464,0.20892364790118315],predictedLabel:Iris-setosa
Iris-setosa,[4.6,3.2,1.4,0.2] --> prob=[0.49729174541655624,0.2912406744481094,0.2114675801353344],predictedLabel:Iris-setosa
Iris-setosa,[4.6,3.6,1.0,0.2] --> prob=[0.5214689199070267,0.27723378965826473,0.2012972904347087],predictedLabel:Iris-setosa
Iris-setosa,[4.7,3.2,1.3,0.2] --> prob=[0.5033392716254922,0.28773708047332464,0.20892364790118315],predictedLabel:Iris-setosa
Iris-setosa,[4.7,3.2,1.6,0.2] --> prob=[0.48520083817212994,0.2982454609190897,0.21655370090878034],predictedLabel:Iris-setosa
Iris-setosa,[4.8,3.0,1.4,0.3] --> prob=[0.48825575319574605,0.29089433834232115,0.22084990846193275],predictedLabel:Iris-setosa
Iris-setosa,[4.9,3.1,1.5,0.1] --> prob=[0.5001610520649271,0.2949913539708255,0.20484759396424754],predictedLabel:Iris-setosa
Iris-setosa,[4.9,3.1,1.5,0.1] --> prob=[0.5001610520649271,0.2949913539708255,0.20484759396424754],predictedLabel:Iris-setosa
Iris-setosa,[5.0,3.0,1.6,0.2] --> prob=[0.48520083817212994,0.2982454609190897,0.21655370090878034],predictedLabel:Iris-setosa
Iris-setosa,[5.0,3.2,1.2,0.2] --> prob=[0.5093858209426285,0.2842340524542886,0.206380126603083],predictedLabel:Iris-setosa
Iris-setosa,[5.0,3.3,1.4,0.2] --> prob=[0.49729174541655624,0.2912406744481094,0.2114675801353344],predictedLabel:Iris-setosa
Iris-setosa,[5.0,3.4,1.5,0.2] --> prob=[0.49124501150976607,0.29474380940793027,0.21401117908230358],predictedLabel:Iris-setosa
Iris-setosa,[5.1,3.5,1.4,0.2] --> prob=[0.49729174541655624,0.2912406744481094,0.2114675801353344],predictedLabel:Iris-setosa
Iris-setosa,[5.1,3.8,1.6,0.2] --> prob=[0.48520083817212994,0.2982454609190897,0.21655370090878034],predictedLabel:Iris-setosa
Iris-setosa,[5.2,3.5,1.5,0.2] --> prob=[0.49124501150976607,0.29474380940793027,0.21401117908230358],predictedLabel:Iris-setosa
Iris-setosa,[5.3,3.7,1.5,0.2] --> prob=[0.49124501150976607,0.29474380940793027,0.21401117908230358],predictedLabel:Iris-setosa
Iris-setosa,[5.4,3.7,1.5,0.2] --> prob=[0.49124501150976607,0.29474380940793027,0.21401117908230358],predictedLabel:Iris-setosa
Iris-setosa,[5.4,3.9,1.7,0.4] --> prob=[0.4610298666177363,0.30045633530810995,0.23851379807415374],predictedLabel:Iris-setosa
Iris-setosa,[5.5,3.5,1.3,0.2] --> prob=[0.5033392716254922,0.28773708047332464,0.20892364790118315],predictedLabel:Iris-setosa
Iris-versicolor,[5.7,2.8,4.5,1.3] --> prob=[0.23405628210659687,0.3503950951343531,0.4155486227590501],predictedLabel:Iris-virginica
Iris-setosa,[5.7,4.4,1.5,0.4] --> prob=[0.47307195215434117,0.29374330861943265,0.23318473922622612],predictedLabel:Iris-setosa
Iris-versicolor,[5.8,2.7,4.1,1.0] --> prob=[0.27548327231423464,0.3556055570492075,0.368911170636558],predictedLabel:Iris-virginica
Iris-setosa,[5.8,4.0,1.2,0.2] --> prob=[0.5093858209426285,0.2842340524542886,0.206380126603083],predictedLabel:Iris-setosa
Iris-versicolor,[5.9,3.0,4.2,1.5] --> prob=[0.23207148536211597,0.33437704647792515,0.4335514681599589],predictedLabel:Iris-virginica
Iris-versicolor,[5.9,3.2,4.8,1.8] --> prob=[0.18679593114220863,0.32761065662002586,0.4855934122377655],predictedLabel:Iris-virginica
Iris-versicolor,[6.0,2.2,4.0,1.0] --> prob=[0.2803376182533597,0.3532229586552324,0.3664394230914078],predictedLabel:Iris-virginica
Iris-versicolor,[6.1,2.8,4.0,1.3] --> prob=[0.25643231693957325,0.3401587648781376,0.4034089181822891],predictedLabel:Iris-virginica
Iris-versicolor,[6.1,2.9,4.7,1.4] --> prob=[0.21831381999091906,0.3489615595842377,0.4327246204248433],predictedLabel:Iris-virginica
Iris-versicolor,[6.5,2.8,4.6,1.5] --> prob=[0.21527571318131467,0.34169038435779375,0.44303390246089164],predictedLabel:Iris-virginica
Iris-versicolor,[5.5,2.4,3.8,1.1] --> prob=[0.28200999494950263,0.34440356990877696,0.37358643514172035],predictedLabel:Iris-virginica
Iris-versicolor,[5.5,2.5,4.0,1.3] --> prob=[0.25643231693957325,0.3401587648781376,0.4034089181822891],predictedLabel:Iris-virginica
Iris-versicolor,[5.7,2.6,3.5,1.0] --> prob=[0.30537417641293546,0.34093457541890737,0.3536912481681571],predictedLabel:Iris-virginica
Iris-versicolor,[5.7,2.9,4.2,1.3] --> prob=[0.24731651349696118,0.34432895746524445,0.4083545290377943],predictedLabel:Iris-virginica
Iris-versicolor,[5.7,3.0,4.2,1.2] --> prob=[0.2550461699890246,0.3490535961297235,0.3959002338812519],predictedLabel:Iris-virginica
Iris-versicolor,[5.8,2.6,4.0,1.2] --> prob=[0.2643468189724618,0.34469570864297666,0.3909574723845617],predictedLabel:Iris-virginica
Iris-virginica,[6.0,2.2,5.0,1.5] --> prob=[0.19937990466720557,0.34861185348025864,0.4520082418525358],predictedLabel:Iris-virginica
Iris-versicolor,[6.0,2.7,5.1,1.6] --> prob=[0.18893619942100204,0.3442935001259121,0.466770300453086],predictedLabel:Iris-virginica
Iris-versicolor,[6.1,3.0,4.6,1.4] --> prob=[0.22247010206632536,0.34710610567933564,0.430423792254339],predictedLabel:Iris-virginica
Iris-versicolor,[6.2,2.9,4.3,1.3] --> prob=[0.24284101804142783,0.34637635549107865,0.41078262646749364],predictedLabel:Iris-virginica
Iris-virginica,[6.4,3.1,5.5,1.8] --> prob=[0.16242448956109531,0.3374290334394874,0.5001464769994173],predictedLabel:Iris-virginica
Iris-virginica,[6.4,3.2,5.3,2.3] --> prob=[0.13982955580747441,0.3015462978145963,0.5586241463779293],predictedLabel:Iris-virginica
Iris-virginica,[6.5,3.2,5.1,2.0] --> prob=[0.1635497990405064,0.3191921118474714,0.5172580891120222],predictedLabel:Iris-virginica
Iris-versicolor,[6.7,3.1,4.7,1.5] --> prob=[0.21121732742751112,0.34345751634463606,0.44532515622785285],predictedLabel:Iris-virginica
Iris-virginica,[7.2,3.0,5.8,1.6] --> prob=[0.1643419477261269,0.3547336664765263,0.4809243857973467],predictedLabel:Iris-virginica
Iris-virginica,[7.2,3.2,6.0,1.8] --> prob=[0.1466338234492874,0.3437905245970798,0.5095756519536329],predictedLabel:Iris-virginica
Iris-virginica,[7.4,2.8,6.1,1.9] --> prob=[0.13830261194763882,0.33794265280847413,0.523754735243887],predictedLabel:Iris-virginica
Iris-virginica,[7.7,2.8,6.7,2.0] --> prob=[0.11721249765044979,0.3368745764712335,0.5459129258783166],predictedLabel:Iris-virginica
Iris-virginica,[7.9,3.8,6.4,2.0] --> prob=[0.12493278994581293,0.3339284878708483,0.5411387221833388],predictedLabel:Iris-virginica

lrAccuracy=0.550000

Coefficients: 
 3 X 4 CSRMatrix
(0,2) -0.2419
(0,3) -0.1715
(1,3) 0.446
Intercept: [0.7417523479805953,-0.16623552721353418,-0.575516820767061]
 numClasses: 3
 numFeatures: 4

三、决策树分类器

决策树(decision tree)是一种基本的分类与回归方法,这里主要介绍用于分类的决策树。决策树模式呈树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别。学习时利用训练数据,根据损失函数最小化的原则建立决策树模型;预测时,对新的数据,利用决策树模型进行分类。决策树学习通常包括3个步骤:特征选择、决策树的生成和决策树的剪枝。

任务需求:
以iris数据集(iris下载地址)为例进行分析。
要求在iris数据集中随机选择70%数据作为训练集,30%作为测试集,由训练集训练出决策树模型,预测测试集的标签并将其与实际标签比较,计算预测准确率。

代码实现:

# 1 导入所需要的包
from pyspark.ml.linalg import Vectors
from pyspark.shell import spark
from pyspark.sql import Row
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.classification import DecisionTreeClassifier

# 2 定制一个函数,来返回一个指定的数据(dense vector类型)
def f(x):
    rel = {}
    # 每行被分成了5部分,前4部分是鸢尾花的4个特征,最后一部分是鸢尾花的分类
    rel['features'] = Vectors.dense(float(x[0]), float(x[1]), float(x[2]), float(x[3]))
    rel['label'] = str(x[4])
    return rel

# 读取iris.txt文件(注意对应上你自己的文件路径)
# 第一个map把每行的数据用“,”隔开,把特征存储在Vector中
# 创建一个Iris模式的RDD,然后转化成dataframe,最后调用show()方法来查看部分数据
data = spark.sparkContext. \
    textFile("file:///C:/Users/LJW/Desktop/iris.txt"). \
    map(lambda line: line.split(',')). \
    map(lambda p: Row(**f(p))). \
    toDF()
data.show()

# 3 分别获取标签列和特征列,进行索引并进行重命名
labelIndexer = StringIndexer(). \
    setInputCol("label"). \
    setOutputCol("indexedLabel"). \
    fit(data)  # 评估器->转换器
featureIndexer = VectorIndexer(). \
    setInputCol("features"). \
    setOutputCol("indexedFeatures"). \
    setMaxCategories(4). \
    fit(data)

# 4 设置一个IndexToString的转换器,把预测的类别(数值型prediction)转化成字符型的predictedLabel
labelConverter = IndexToString(). \
    setInputCol("prediction"). \
    setOutputCol("predictedLabel"). \
    setLabels(labelIndexer.labels)

# 5 创建决策树模型DecisionTreeClassifier,设置决策树的参数
# 这里仅需要设置特征列(FeaturesCol)和待预测列(LabelCol),设置的参数可通过explainParams()来获取
dtClassifier = DecisionTreeClassifier(). \
    setLabelCol("indexedLabel"). \
    setFeaturesCol("indexedFeatures")
print("DecisionTreeClassifier parameters:\n" + dtClassifier.explainParams() + "\n")

# 6 构建机器学习流水线(Pipeline),在训练数据集上调用fit()进行模型训练,并在测试数据集上调用transform()方法进行预测
dtPipeline = Pipeline().setStages([labelIndexer, featureIndexer, dtClassifier, labelConverter])
# 把数据集随机分成训练集和测试集,其中训练集占70%
trainingData, testData = data.randomSplit([0.7, 0.3])
# Pipeline本质上是一个评估器,当Pipeline调用fit()的时候就产生了一个PipelineModel,它是一个转换器
dtPipelineModel = dtPipeline.fit(trainingData)
# PipelineModel可以调用transform()来进行预测,生成一个新的DataFrame,即利用训练得到的模型对测试集进行验证
dtPredictions = dtPipelineModel.transform(testData)

# 7 输出预测的结果,其中select选择要输出的列,collect获取所有行的数据,用foreach把每行打印出来
preRows = dtPredictions.select("label", "features", "probability", "predictedLabel").collect()
for row in preRows:
    label, features, probability, predictedLabel = row
    print("%s,%s --> prob=%s,predictedLabel:%s" % (label, features, probability, predictedLabel))

# 8 对训练的模型进行评估,创建一个MulticlassClassificationEvaluator实例
# 用set方法把预测分类的列名和真实分类的列名进行设置,然后计算预测准确率
evaluator = MulticlassClassificationEvaluator(). \
    setLabelCol("indexedLabel"). \
    setPredictionCol("prediction")
dtAccuracy = evaluator.evaluate(dtPredictions)
print("\ndtAccuracy=%f" % dtAccuracy)

# 9 可以通过调用DecisionTreeClassificationModel的toDebugString方法,查看训练的决策树模型结构
treeModelClassifier = dtPipelineModel.stages[2]
print("\nLearned classification tree model:\n" + str(treeModelClassifier.toDebugString))

运行结果(部分):

Iris-setosa,[4.3,3.0,1.1,0.1] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[4.4,2.9,1.4,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[4.6,3.1,1.5,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[4.6,3.2,1.4,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[4.6,3.6,1.0,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[4.8,3.4,1.9,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[4.9,3.1,1.5,0.1] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-versicolor,[5.0,2.0,3.5,1.0] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-setosa,[5.1,3.4,1.5,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[5.1,3.8,1.9,0.4] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[5.2,3.5,1.5,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[5.2,4.1,1.5,0.1] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-setosa,[5.4,3.4,1.7,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-versicolor,[5.5,2.3,4.0,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-setosa,[5.5,4.2,1.4,0.2] --> prob=[1.0,0.0,0.0],predictedLabel:Iris-setosa
Iris-versicolor,[5.7,2.8,4.5,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[6.1,2.8,4.0,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[6.3,2.5,4.9,1.5] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[6.3,3.3,4.7,1.6] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[5.4,3.0,4.5,1.5] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[5.5,2.5,4.0,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[5.6,3.0,4.1,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[5.7,2.6,3.5,1.0] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[5.7,2.8,4.1,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[5.7,2.9,4.2,1.3] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-virginica,[5.8,2.8,5.1,2.4] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[6.0,2.2,5.0,1.5] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-versicolor,[6.0,3.4,4.5,1.6] --> prob=[0.0,1.0,0.0],predictedLabel:Iris-versicolor
Iris-virginica,[6.3,3.4,5.6,2.4] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[6.4,2.7,5.3,1.9] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[6.5,3.0,5.2,2.0] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[6.5,3.0,5.8,2.2] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[6.7,3.0,5.2,2.3] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[6.9,3.2,5.7,2.3] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[7.2,3.2,6.0,1.8] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[7.7,2.6,6.9,2.3] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica
Iris-virginica,[7.7,3.8,6.7,2.2] --> prob=[0.0,0.0,1.0],predictedLabel:Iris-virginica

dtAccuracy=0.972830

Learned classification tree model:
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_fafc74570484, depth=5, numNodes=17, numClasses=3, numFeatures=4
  If (feature 2 <= 2.35)
   Predict: 0.0
  Else (feature 2 > 2.35)
   If (feature 3 <= 1.75)
    If (feature 2 <= 5.05)
     If (feature 0 <= 4.95)
      If (feature 1 <= 2.45)
       Predict: 1.0
      Else (feature 1 > 2.45)
       Predict: 2.0
     Else (feature 0 > 4.95)
      Predict: 1.0
    Else (feature 2 > 5.05)
     If (feature 0 <= 6.05)
      Predict: 1.0
     Else (feature 0 > 6.05)
      Predict: 2.0
   Else (feature 3 > 1.75)
    If (feature 2 <= 4.85)
     If (feature 0 <= 5.95)
      Predict: 1.0
     Else (feature 0 > 5.95)
      Predict: 2.0
    Else (feature 2 > 4.85)
     Predict: 2.0