Types & Schemas

The SparkConnect::Types module models the Spark SQL type system. Every type is an instance of a SparkConnect::Types::DataType subclass.

Constructing types

Convenience constructors live directly on the module:

T = SparkConnect::Types

T.long; T.integer; T.short; T.byte
T.double; T.float; T.decimal(10, 2)
T.string; T.binary; T.boolean
T.date; T.timestamp; T.timestamp_ntz
T.array(T.string)
T.map(T.string, T.long)
T.struct(
  T.field("id", T.long, nullable: false),
  T.field("name", T.string),
  T.field("tags", T.array(T.string)),
)

Struct types and schemas

A DataFrame’s schema is a StructType. You can build one fluently:

schema = T.struct
          .add("id", T.long, nullable: false)
          .add("name", T.string)

schema.names                #=> ["id", "name"]
schema["name"].data_type    #=> #<SparkConnect::Types::StringType string>
schema.simple_string        #=> "struct<id:bigint,name:string>"

Rendering

Method Example output
simple_string array<string>, decimal(10,2), struct<a:int>
type_name integer, long, string
json Spark’s JSON schema representation
tree_string indented schema tree (used by print_schema)
df.print_schema
# root
#  |-- id: long (nullable = false)
#  |-- name: string (nullable = true)

df.schema.simple_string
df.dtypes   #=> [["id", "bigint"], ["name", "string"]]

DDL strings

Anywhere a schema is accepted you may pass a DDL string instead of a StructType:

spark.read.schema("id BIGINT, name STRING").csv("people.csv")
spark.create_data_frame(rows, "id BIGINT, name STRING")

Ruby <-> Spark value mapping

When building literals and local DataFrames, Ruby values map to Spark types as follows (and the inverse applies when decoding results):

Ruby Spark type
nil null
true / false boolean
Integer (32-bit range) int
Integer (larger) bigint (long)
Float double
BigDecimal decimal
String (UTF-8) string
String (ASCII-8BIT) binary
Symbol string
Time / DateTime timestamp
Date date
Array array
Hash map

Results decode back into SparkConnect::Row objects, with structs as Row/Hash, arrays as Array, maps as Hash, decimals as BigDecimal, timestamps as Time, and dates as Date.