DataFrameStatFunctions

org.apache.spark.sql.DataFrameStatFunctions

Statistical functions for a Dataset, reached via df.stat. Mirrors PySpark's DataFrame.stat (DataFrameStatFunctions).

 df.stat.corr("x", "y")
 df.stat.approxQuantile("x", Array(0.25, 0.5, 0.75), 0.01)
 df.stat.crosstab("a", "b").show()

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any

Members list

Value members

Concrete methods

def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]

Calculates the approximate quantiles of a numerical column.

Calculates the approximate quantiles of a numerical column.

Value parameters

col

the column to compute quantiles for.

probabilities

quantile probabilities, each in [0.0, 1.0] (e.g. 0.5 is the median).

relativeError

the relative target precision; 0.0 yields exact quantiles (at high cost).

Attributes

Returns

the approximate quantiles, one per probability.

def approxQuantile(cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]]

Calculates the approximate quantiles of numerical columns.

Calculates the approximate quantiles of numerical columns.

Value parameters

cols

the columns to compute quantiles for.

probabilities

quantile probabilities, each in [0.0, 1.0].

relativeError

the relative target precision; 0.0 yields exact quantiles (at high cost).

Attributes

Returns

an array of quantile arrays, one inner array per column.

def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

Builds a Bloom filter over the given column, sized for expectedNumItems items with a target false-positive probability fpp.

Builds a Bloom filter over the given column, sized for expectedNumItems items with a target false-positive probability fpp.

Attributes

def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

Builds a Bloom filter over the given column for expectedNumItems and target fpp.

Builds a Bloom filter over the given column for expectedNumItems and target fpp.

Attributes

def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

Builds a Bloom filter over the given column with an explicit number of bits.

Builds a Bloom filter over the given column with an explicit number of bits.

Attributes

def corr(col1: String, col2: String): Double

Calculates the Pearson correlation coefficient of two columns.

Calculates the Pearson correlation coefficient of two columns.

Attributes

Returns

the correlation of col1 and col2.

def corr(col1: String, col2: String, method: String): Double

Calculates the correlation of two columns.

Calculates the correlation of two columns.

Value parameters

method

the correlation method; currently only "pearson" is supported.

Attributes

Returns

the correlation of col1 and col2.

def countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

Builds a Count-Min Sketch over the given column with the given relative error (eps), confidence and random seed. The sketch is computed by a server-side aggregate and deserialized on the client.

Builds a Count-Min Sketch over the given column with the given relative error (eps), confidence and random seed. The sketch is computed by a server-side aggregate and deserialized on the client.

Attributes

def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

Builds a Count-Min Sketch over the given column.

Builds a Count-Min Sketch over the given column.

Attributes

def cov(col1: String, col2: String): Double

Calculates the sample covariance of two numerical columns.

Calculates the sample covariance of two numerical columns.

Attributes

Returns

the sample covariance of col1 and col2.

def crosstab(col1: String, col2: String): DataFrame

Computes a pair-wise frequency table (contingency table) of the given columns.

Computes a pair-wise frequency table (contingency table) of the given columns.

Attributes

Returns

a Dataset containing the contingency table.

def freqItems(cols: Seq[String]): DataFrame

Finds frequent items for the given columns, with the default support 0.01.

Finds frequent items for the given columns, with the default support 0.01.

Attributes

Returns

a Dataset of frequent items per column.

def freqItems(cols: Seq[String], support: Double): DataFrame

Finds frequent items for the given columns.

Finds frequent items for the given columns.

Value parameters

support

the minimum frequency for an item to be considered frequent, in (0.0, 1.0].

Attributes

Returns

a Dataset of frequent items per column.

def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

Returns a stratified sample without replacement, keyed by the values in col.

Returns a stratified sample without replacement, keyed by the values in col.

Value parameters

col

the column defining the strata.

fractions

a stratum -> sampling fraction mapping; fractions are in [0.0, 1.0].

seed

the random seed.

Attributes

Returns

a Dataset containing the stratified sample.