Class: SparkConnect::GroupedData

Inherits:
Object
  • Object
show all
Defined in:
lib/spark_connect/grouped_data.rb

Overview

The result of DataFrame#group_by / DataFrame#rollup / DataFrame#cube. Call an aggregate (#agg, #count, #sum, #avg, #max, #min, ...) to produce a new DataFrame, optionally after #pivot.

Examples:

df.group_by("dept").agg(F.avg("salary").alias("avg_salary"), F.count("*"))
df.group_by("dept").pivot("year").sum("revenue")

Constant Summary collapse

Proto =
SparkConnect::Proto

Instance Method Summary collapse

Constructor Details

#initialize(df, grouping, group_type, pivot_col: nil, pivot_values: nil) ⇒ GroupedData

Returns a new instance of GroupedData.

Parameters:

  • df (DataFrame)
  • grouping (Array<Column>)

    grouping columns.

  • group_type (Symbol)

    a GROUP_TYPE_* enum symbol.

  • pivot_col (Column, nil) (defaults to: nil)
  • pivot_values (Array, nil) (defaults to: nil)


19
20
21
22
23
24
25
# File 'lib/spark_connect/grouped_data.rb', line 19

def initialize(df, grouping, group_type, pivot_col: nil, pivot_values: nil)
  @df = df
  @grouping = grouping
  @group_type = group_type
  @pivot_col = pivot_col
  @pivot_values = pivot_values
end

Instance Method Details

#agg(*columns) ⇒ DataFrame #agg(hash) ⇒ DataFrame

Compute aggregate expressions.

Overloads:

  • #agg(*columns) ⇒ DataFrame

    Parameters:

    • columns (Array<Column>)

      aggregate columns, e.g. F.sum("x").

  • #agg(hash) ⇒ DataFrame

    Parameters:

    • hash (Hash{String=>String})

      column-to-function map, e.g. {"age" => "max", "salary" => "avg"}.

Returns:



35
36
37
38
39
40
41
42
43
# File 'lib/spark_connect/grouped_data.rb', line 35

def agg(*exprs)
  agg_exprs =
    if exprs.size == 1 && exprs.first.is_a?(Hash)
      exprs.first.map { |col, fn| Column.invoke(fn.to_s, Functions.col(col.to_s)).to_expr }
    else
      exprs.flatten.map { |c| Column.to_col(c).to_expr }
    end
  build(agg_exprs)
end

#avg(*cols) ⇒ DataFrame Also known as: mean

Mean of each numeric column.

Returns:



57
# File 'lib/spark_connect/grouped_data.rb', line 57

def avg(*cols) = numeric_agg("avg", cols)

#countDataFrame

Count rows per group.

Returns:



47
48
49
# File 'lib/spark_connect/grouped_data.rb', line 47

def count
  build([Column.invoke("count", Column.lit(1)).alias("count").to_expr])
end

#max(*cols) ⇒ DataFrame

Maximum of each column.

Returns:



62
# File 'lib/spark_connect/grouped_data.rb', line 62

def max(*cols) = numeric_agg("max", cols)

#min(*cols) ⇒ DataFrame

Minimum of each column.

Returns:



66
# File 'lib/spark_connect/grouped_data.rb', line 66

def min(*cols) = numeric_agg("min", cols)

#pivot(pivot_col, values = nil) ⇒ GroupedData

Pivot a column into multiple output columns.

Parameters:

  • pivot_col (String, Column)
  • values (Array, nil) (defaults to: nil)

    optional explicit pivot values (faster, deterministic).

Returns:



73
74
75
76
77
# File 'lib/spark_connect/grouped_data.rb', line 73

def pivot(pivot_col, values = nil)
  GroupedData.new(@df, @grouping, :GROUP_TYPE_PIVOT,
                  pivot_col: Column.to_col(pivot_col.is_a?(String) ? Functions.col(pivot_col) : pivot_col),
                  pivot_values: values)
end

#sum(*cols) ⇒ DataFrame

Sum of each numeric column (or all numeric columns when none given).

Returns:



53
# File 'lib/spark_connect/grouped_data.rb', line 53

def sum(*cols) = numeric_agg("sum", cols)