Schema

The schema format, consumed in the TypingTransform and other stages, is an opinionated format for specifying common data typing actions.

It is designed to:

  • Allow precise definition of how to perform common data typing conversions found in business datasets.
  • Support limited Schema Evolution of source data in the form of allowed lists of accepted input formats.
  • Specification of metadata to attach to columns.

Common

Attributes

Attribute Type Required Description
id String false A optional unique identifier for this field. Ideally this is a GUID and should remain constant even when changing field attributes over time.
name String true The field name.
description String false A description of the field which will be embedded in the dataset metadata (and persisted in formats like Parquet/ORC).
type String true The data type to convert the field to. Supported values: boolean, date, decimal, double, integer, long, string, timestamp.
metadata Object false Metadata to attach to the column.

Basic JSON types string, double, long, object and array are supported.

With array types the values must be all of the same type (i.e. [true, false] works but [true, 0] will not).

Examples

{
  "name" : "first_name",
  "description" : "Customer First Name",
  "type" : "string",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Array

An array defines a schema for a repeated list of elements. It requires the specification of the schema of the nested elements which must all be of the same type.

Additional Attributes

Attribute Type Required Description
elementType schema true The schema of the nested list of elements.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.

Examples

{
  "name" : "customer_phone_numbers",
  "description" : "Customer Phone Numbers",
  "type" : "array",
  "metadata": {},
  "elementType": {
    "name" : "phone_number",
    "description" : "Phone Number",
    "type" : "string",
    "trim" : true,
    "nullable" : true,
    "nullableValues" : [ "", "null" ]
  }
}

Binary

Additional Attributes

Attribute Type Required Description
encoding String true The binary-to-text encoding format of the value. Valid values base64, hexadecimal.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "id",
  "description" : "GUID identifier",
  "type" : "binary",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "encoding" : "base64",
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Boolean

Additional Attributes

Attribute Type Required Description
trueValues Array[String] true A list of values which are considered as true. Try to order this list so the values are arranged from most frequent to least frequent.
falseValues Array[String] true A list of values which are considered as false. Try to order this list so the values are arranged from most frequent to least frequent.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "marketing_opt_in_flag",
  "description" : "Whether the customer has opted in to receive marketing communications.",
  "type" : "boolean",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "trueValues" : [ "true", "TRUE", "t", "1" ],
  "falseValues" : [ "false", "FALSE", "f", "0" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Date

Date vs Timestamp

This class does not store or represent a time or time-zone. Instead, it is a description of the date, as used for birthdays. It cannot represent an instant on the time-line without additional information such as an offset or time-zone.

This means that if users will be executing SQL statements which have conditional logic based on date comparisons (such as WHERE [date] < CURRENT_DATE()) then it is safer to use a Timestamp with a hard-coded time component for that source data so you get consistent results regardless of which time zone your users are located.

Additional Attributes

Attribute Type Required Description
formatters Array[String] true The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "birth_date",
  "description" : "Customer Birth Date",
  "type" : "date",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters" : [ "uuuuMMdd", "uuuu-MM-dd" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Decimal

Additional Attributes

Attribute Type Required Description
precision Integer true The total number of digits. e.g. 1234.567 has a precision of 7.
scale Integer true The number of digits in the fraction part. e.g. 1234.567 has a scale of 3.
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0.###;-#,##0.###
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "account_balance",
  "description" : "The current account balance",
  "type" : "decimal",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "precision": 10,
  "scale": 2,
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Double

A Double is a double-precision 64-bit IEEE 754 floating point number.

Double vs Decimal

A Decimal should be used whenever precision is required or for numbers which must sum up correctly or balance, e.g. monetary transactions.

Additional Attributes

Attribute Type Required Description
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0.###;-#,##0.###
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "last_meter_reading",
  "description" : "The last reading from the customer power meter.",
  "type" : "double",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Integer

Use Integer when dealing with values up to ±2 billion (-231 to +231-1)

Additional Attributes

Attribute Type Required Description
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0;-#,##0
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "update_count",
  "description" : "Number of updates to this customer record.",
  "type" : "integer",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Long

Use a Long Integer when dealing with values greater than ±2 billion (-263 to +263-1)

Additional Attributes

Attribute Type Required Description
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0;-#,##0
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "transaction_num",
  "description" : "Global transaction sequence number.",
  "type" : "long",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

String

Additional Attributes

Attribute Type Required Description
minLength Integer false The minimum length of the string value.
maxLength Integer false The maximum length of the string value.
regex String false A regular expression to validate the input value against e.g. [a-z]* would match a value made of only lowercase alphabet characters.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "first_name",
  "description" : "Customer First Name",
  "type" : "string",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Struct

A struct is a nested field similar to a map in most programming languages. It can be used to support complex data types.

Additional Attributes

Attribute Type Required Description
fields Array true A list of schema fields.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.

Examples

{
  "name" : "customer_name",
  "description" : "Customer Name",
  "type" : "struct",
  "nullable" : true,
  "metadata": {
    "primaryKey" : true,
    "position": 1
  },
  "fields": [
    {
      "name" : "first_name",
      "description" : "Customer First Name",
      "type" : "string",
      "trim" : true,
      "nullable" : true,
      "nullableValues" : [ "", "null" ],
      "metadata": {
        "primaryKey" : true,
        "position": 1
      }
    },
    {
      "name" : "last_name",
      "description" : "Customer Last Name",
      "type" : "string",
      "trim" : true,
      "nullable" : true,
      "nullableValues" : [ "", "null" ],
      "metadata": {
        "primaryKey" : true,
        "position": 1
      }
    }
  ]
}

Time

TimeType

Spark does not have an internal TimeType representation of time. This type can be used to ensure time values are able to be successfully parsed as LocalTime objects but they are always stored in Spark as string formatted in the standard HH:mm:ss format type meaning they can safely used in the to_utc_timestamp SQL function (but be very careful with timezone offsets). If they cannot be parsed then an error will be inserted into the _errors array like all other types.

Additional Attributes

Attribute Type Required Description
formatters Array[String] true The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "transaction_time",
  "description" : "Time of the database transaction",
  "type" : "time",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters" : [ "HHmmss" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Timestamp

Additional Attributes

Attribute Type Required Description
formatters Array[String] true The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Custom formats ssssssssss and sssssssssssss have been added to support epoch time (i.e. 1527727035) and epoch millis time (i.e. 1527727035456) respectively. Both require timezoneId of UTC.
timezoneId String true The timezone of the incoming timestamp. This uses the SimpleDateFormat supported timezones. All timestamps are internally stored in UTC to allow correctly sequenced events when dealing with events from multiple systems which may all run with different internal timezones.
time Map[String, Integer] false Use this capability if converting a Date label into a Timestamp for relative comparisons. Required fields are hour, minute, second and nano . These values can be agreed with source data suppliers to ensure intra-system data alignment. See below for example.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.

Examples

{
  "name" : "create_date",
  "description" : "Customer Creation Date",
  "type" : "timestamp",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters": [
    "dd/MM/uuuu HH:mm:ss",
    "dd/MM/uuuu H:mm:ss",
    "dd/MM/uuuu HH:mm",
    "dd/MM/uuuu H:mm",
    "d/MM/uuuu HH:mm:ss",
    "d/MM/uuuu H:mm:ss",
    "d/MM/uuuu HH:mm",
    "d/MM/uuuu H:mm"
  ],
  "timezoneId": "+1000",
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

For converting a Date label into a Timestamp supply the time key:

{
  "name" : "create_date",
  "description" : "Customer Creation Date",
  "type" : "timestamp",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters": [
    "dd/MM/uuuu",
  ],
  "timezoneId": "Australia/Sydney",
  "time": {
    "hour": 23,
    "minute": 59,
    "second": 59,
    "nano": 0,
  },
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}