博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
spark(1.1) mllib 源代码分析
阅读量:4337 次
发布时间:2019-06-07

本文共 5123 字,大约阅读时间需要 17 分钟。

在spark mllib 1.1加入版本stat包,其中包括一些统计数据有关的功能。本文分析中卡方检验和实施的主要原则:

 

一个、根本

  在stat包实现Pierxunka方检验,它包括以下类别

    (1)适配度检验(Goodness of Fit test):验证一组观察值的次数分配是否异于理论上的分配。

    (2)独立性检验(independence test) :验证从两个变量抽出的配对观察值组是否互相独立(比如:每次都从A国和B国各抽一个人,看他们的反应是否与国籍无关)

  计算公式:

\chi^2 =   \sum_{i=1}^{r} \sum_{j=1}^{c} {(O_{i,j} - E_{i,j})^2 \over E_{i,j}}.

    当中O表示观測值,E表示期望值

  具体原理能够參考:

 

二、java api调用example

  

 

三、源代码分析

  1、外部api

    通过Statistics类提供了4个外部接口  

// Goodness of Fit testdef chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult = {    ChiSqTest.chiSquared(observed, expected)  }//Goodness of Fit testdef chiSqTest(observed: Vector): ChiSqTestResult = ChiSqTest.chiSquared(observed)//independence testdef chiSqTest(observed: Matrix): ChiSqTestResult = ChiSqTest.chiSquaredMatrix(observed)//independence testdef chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {    ChiSqTest.chiSquaredFeatures(data)}

  2、Goodness of Fit test实现

  这个比較简单。关键是依据(observed-expected)2/expected计算卡方值

/*   * Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.   * Uniform distribution is assumed when `expected` is not passed in.   */  def chiSquared(observed: Vector,      expected: Vector = Vectors.dense(Array[Double]()),      methodName: String = PEARSON.name): ChiSqTestResult = {    // Validate input arguments    val method = methodFromString(methodName)    if (expected.size != 0 && observed.size != expected.size) {      throw new IllegalArgumentException("observed and expected must be of the same size.")    }    val size = observed.size    if (size > 1000) {      logWarning("Chi-squared approximation may not be accurate due to low expected frequencies "        + s" as a result of a large number of categories: $size.")    }    val obsArr = observed.toArray  // 假设expected值没有设置,默认取1.0 / size    val expArr = if (expected.size == 0) Array.tabulate(size)(_ => 1.0 / size) else expected.toArray  / 假设expected、observed值都必需要大于1    if (!obsArr.forall(_ >= 0.0)) {      throw new IllegalArgumentException("Negative entries disallowed in the observed vector.")    }    if (expected.size != 0 && ! expArr.forall(_ >= 0.0)) {      throw new IllegalArgumentException("Negative entries disallowed in the expected vector.")    }    // Determine the scaling factor for expected    val obsSum = obsArr.sum    val expSum = if (expected.size == 0.0) 1.0 else expArr.sum    val scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else obsSum / expSum    // compute chi-squared statistic    val statistic = obsArr.zip(expArr).foldLeft(0.0) { case (stat, (obs, exp)) =>      if (exp == 0.0) {        if (obs == 0.0) {          throw new IllegalArgumentException("Chi-squared statistic undefined for input vectors due"            + " to 0.0 values in both observed and expected.")        } else {          return new ChiSqTestResult(0.0, size - 1, Double.PositiveInfinity, PEARSON.name,            NullHypothesis.goodnessOfFit.toString)        }      }  // 计算(observed-expected)2/expected      if (scale == 1.0) {        stat + method.chiSqFunc(obs, exp)      } else {        stat + method.chiSqFunc(obs, exp * scale)      }    }    val df = size - 1    val pValue = chiSquareComplemented(df, statistic)    new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)  }

  3、independence test实现

    先通过以下的公式计算expected值,矩阵共同拥有 r 行 c 列

     E_{i,j}=\frac{\left(\sum_{n_c=1}^c O_{i,n_c}\right) \cdot\left(\sum_{n_r=1}^r O_{n_r,j}\right)}{N}

    然后依据(observed-expected)2/expected计算卡方值

/*   * Pearon's independence test on the input contingency matrix.   * TODO: optimize for SparseMatrix when it becomes supported.   */  def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {    val method = methodFromString(methodName)    val numRows = counts.numRows    val numCols = counts.numCols    // get row and column sums    val colSums = new Array[Double](numCols)    val rowSums = new Array[Double](numRows)    val colMajorArr = counts.toArray    var i = 0    while (i < colMajorArr.size) {      val elem = colMajorArr(i)      if (elem < 0.0) {        throw new IllegalArgumentException("Contingency table cannot contain negative entries.")      }      colSums(i / numRows) += elem      rowSums(i % numRows) += elem      i += 1    }    val total = colSums.sum    // second pass to collect statistic    var statistic = 0.0    var j = 0    while (j < colMajorArr.size) {      val col = j / numRows      val colSum = colSums(col)      if (colSum == 0.0) {        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"          + s"0 sum in column [$col].")      }      val row = j % numRows      val rowSum = rowSums(row)      if (rowSum == 0.0) {        throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"          + s"0 sum in row [$row].")      }      val expected = colSum * rowSum / total      statistic += method.chiSqFunc(colMajorArr(j), expected)      j += 1    }    val df = (numCols - 1) * (numRows - 1)    val pValue = chiSquareComplemented(df, statistic)    new ChiSqTestResult(pValue, df, statistic, methodName, NullHypothesis.independence.toString)  }

版权声明:本文博客原创文章,博客,未经同意,不得转载。

转载于:https://www.cnblogs.com/zfyouxi/p/4731120.html

你可能感兴趣的文章
阿里负载均衡,配置中间证书问题(在starcom申请免费DV ssl)
查看>>
转:How to force a wordbreaker to be used in Sharepoint Search
查看>>
MySQL存储过程定时任务
查看>>
Python中and(逻辑与)计算法则
查看>>
POJ 3267 The Cow Lexicon(动态规划)
查看>>
设计原理+设计模式
查看>>
音视频处理
查看>>
tomcat 7服务器跨域问题解决
查看>>
前台实现ajax 需注意的地方
查看>>
Jenkins安装配置
查看>>
个人工作总结05(第二阶段)
查看>>
Java clone() 浅拷贝 深拷贝
查看>>
深入理解Java虚拟机&运行时数据区
查看>>
02-环境搭建
查看>>
spring第二冲刺阶段第七天
查看>>
搜索框键盘抬起事件2
查看>>
阿里百川SDK初始化失败 错误码是203
查看>>
透析Java本质-谁创建了对象,this是什么
查看>>
BFS和DFS的java实现
查看>>
关于jquery中prev()和next()的用法
查看>>