DataFrameStatFunctions
Statistical functions for a Dataset, reached via df.stat. Mirrors PySpark's DataFrame.stat (DataFrameStatFunctions).
df.stat.corr("x", "y")
df.stat.approxQuantile("x", Array(0.25, 0.5, 0.75), 0.01)
df.stat.crosstab("a", "b").show()
Attributes
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
Members list
Value members
Concrete methods
Calculates the approximate quantiles of a numerical column.
Calculates the approximate quantiles of a numerical column.
Value parameters
- col
-
the column to compute quantiles for.
- probabilities
-
quantile probabilities, each in
[0.0, 1.0](e.g.0.5is the median). - relativeError
-
the relative target precision;
0.0yields exact quantiles (at high cost).
Attributes
- Returns
-
the approximate quantiles, one per probability.
Calculates the approximate quantiles of numerical columns.
Calculates the approximate quantiles of numerical columns.
Value parameters
- cols
-
the columns to compute quantiles for.
- probabilities
-
quantile probabilities, each in
[0.0, 1.0]. - relativeError
-
the relative target precision;
0.0yields exact quantiles (at high cost).
Attributes
- Returns
-
an array of quantile arrays, one inner array per column.
Builds a Bloom filter over the given column, sized for expectedNumItems items with a target false-positive probability fpp.
Builds a Bloom filter over the given column, sized for expectedNumItems items with a target false-positive probability fpp.
Attributes
Builds a Bloom filter over the given column for expectedNumItems and target fpp.
Builds a Bloom filter over the given column for expectedNumItems and target fpp.
Attributes
Builds a Bloom filter over the given column with an explicit number of bits.
Builds a Bloom filter over the given column with an explicit number of bits.
Attributes
Calculates the Pearson correlation coefficient of two columns.
Calculates the Pearson correlation coefficient of two columns.
Attributes
- Returns
-
the correlation of
col1andcol2.
Calculates the correlation of two columns.
Calculates the correlation of two columns.
Value parameters
- method
-
the correlation method; currently only
"pearson"is supported.
Attributes
- Returns
-
the correlation of
col1andcol2.
Builds a Count-Min Sketch over the given column with the given relative error (eps), confidence and random seed. The sketch is computed by a server-side aggregate and deserialized on the client.
Builds a Count-Min Sketch over the given column with the given relative error (eps), confidence and random seed. The sketch is computed by a server-side aggregate and deserialized on the client.
Attributes
Builds a Count-Min Sketch over the given column.
Builds a Count-Min Sketch over the given column.
Attributes
Calculates the sample covariance of two numerical columns.
Calculates the sample covariance of two numerical columns.
Attributes
- Returns
-
the sample covariance of
col1andcol2.
Computes a pair-wise frequency table (contingency table) of the given columns.
Computes a pair-wise frequency table (contingency table) of the given columns.
Attributes
- Returns
-
a Dataset containing the contingency table.
Finds frequent items for the given columns, with the default support 0.01.
Finds frequent items for the given columns, with the default support 0.01.
Attributes
- Returns
-
a Dataset of frequent items per column.
Finds frequent items for the given columns.
Finds frequent items for the given columns.
Value parameters
- support
-
the minimum frequency for an item to be considered frequent, in
(0.0, 1.0].
Attributes
- Returns
-
a Dataset of frequent items per column.
Returns a stratified sample without replacement, keyed by the values in col.
Returns a stratified sample without replacement, keyed by the values in col.
Value parameters
- col
-
the column defining the strata.
- fractions
-
a
stratum -> sampling fractionmapping; fractions are in[0.0, 1.0]. - seed
-
the random seed.
Attributes
- Returns
-
a Dataset containing the stratified sample.