Metadata
The metadata
format, consumed in the TypingTransform stage, is an opinionated format for specifying common data typing actions.
It is designed to:
- Support common data typing conversions found in business datasets.
- Support limited ‘schema evolution’ of source data in the form of allowed lists of accepted input formats.
- Collect errors into array columns so that a user can decide how to handle errors once all have been collected.
Common
Attributes
Attribute | Type | Required | Description |
---|---|---|---|
id | String | true | A unique identifier for this field. Ideally this is a GUID and should remain constant even when changing field attributes over time. |
name | String | true | The field name. |
description | String | false | A description of the field which will be embedded in the dataset metadata (and persisted in formats like Parquet/ORC). |
type | String | true | The data type to convert the field to. Supported values: boolean , date , decimal , double , integer , long , string , timestamp . |
trim | Boolean | true | Trim the field prior to data typing attempts. |
nullable | Boolean | true | Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met. |
nullableValues | Array[String] | false | Values which when found will be converted to null . This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check. |
nullReplacementValue | String | false | An optional value that a null value input value is replaced with before typing. |
metadata | Object | false | Metadata to attach to the column after TypingTransform .These values are limited to the basic JSON types string , double , long and array .With array types the values must be all of the same type (i.e. [true, false] works but [true, 0] will not) and can only be values of basic types (string , double , long ). |
Examples
{
"id" : "9712c383-22d1-44a6-9ca2-0087af4857f1",
"name" : "first_name",
"description" : "Customer First Name",
"type" : "string",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Binary
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
encoding | String | true | The binary-to-text encoding format of the value. Valid values base64 , hexadecimal . |
Examples
{
"id" : "982cbf60-7ba7-4e50-a09b-d8624a5c49e6",
"name" : "id",
"description" : "GUID identifier",
"type" : "binary",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"encoding" : "base64",
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Boolean
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
trueValues | Array[String] | true | A list of values which are considered as true. Try to order this list so the values are arranged from most frequent to least frequent. |
falseValues | Array[String] | true | A list of values which are considered as false. Try to order this list so the values are arranged from most frequent to least frequent. |
Examples
{
"id" : "982cbf60-7ba7-4e50-a09b-d8624a5c49e6",
"name" : "marketing_opt_in_flag",
"description" : "Whether the customer has opted in to receive marketing communications.",
"type" : "boolean",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"trueValues" : [ "true", "TRUE", "t", "1" ],
"falseValues" : [ "false", "FALSE", "f", "0" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Date
Date vs Timestamp
This class does not store or represent a time or time-zone. Instead, it is a description of the date, as used for birthdays. It cannot represent an instant on the time-line without additional information such as an offset or time-zone.
This means that if users will be executing SQL statements which have conditional logic based on date comparisons (such as WHERE [date] < CURRENT_DATE()
) then it is safer to use a Timestamp with a hard-coded time component for that source data so you get consistent results regardless of which time zone your users are located.
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
formatters | Array[String] | true | The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent. |
Examples
{
"id" : "0e8109ba-1000-4b7d-8a4c-b01bae07027f",
"name" : "birth_date",
"description" : "Customer Birth Date",
"type" : "date",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"formatters" : [ "uuuuMMdd", "uuuu-MM-dd" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Decimal
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
precision | Integer | true | The total number of digits. e.g. 1234.567 has a precision of 7. |
scale | Integer | true | The number of digits in the fraction part. e.g. 1234.567 has a scale of 3. |
formatters | Array[String] | false | The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent. Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.Default: #,##0.###;-#,##0.### |
Examples
{
"id" : "9712c383-22d1-44a6-9ca2-0087af4857f1",
"name" : "account_balance",
"description" : "The current account balance",
"type" : "decimal",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"precision": 10,
"scale": 2,
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Double
A Double
is a double-precision 64-bit IEEE 754 floating point number.
Double vs Decimal
A Decimal should be used whenever precision is required or for numbers which must sum up correctly or balance, e.g. monetary transactions.
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
formatters | Array[String] | false | The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent. Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.Default: #,##0.###;-#,##0.### |
Examples
{
"id" : "31541ea3-5b74-4753-857c-770bd601c35b",
"name" : "last_meter_reading",
"description" : "The last reading from the customer power meter.",
"type" : "double",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Integer
Use Integer when dealing with values up to ±2 billion (-231 to +231-1)
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
formatters | Array[String] | false | The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent. Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.Default: #,##0;-#,##0 |
Examples
{
"id" : "a66f3bbe-d1c6-44c7-b096-a4be59fdcd78",
"name" : "update_count",
"description" : "Number of updates to this customer record.",
"type" : "integer",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Long
Use a Long Integer when dealing with values greater than ±2 billion (-263 to +263-1)
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
formatters | Array[String] | false | The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent. Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.Default: #,##0;-#,##0 |
Examples
{
"id" : "1c0eec1d-17cd-45da-8744-7a9ef5b8b086",
"name" : "transaction_num",
"description" : "Global transaction sequence number.",
"type" : "long",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
String
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
minLength | Integer | false | The minimum length of the string value. |
maxLength | Integer | false | The maximum length of the string value. |
Examples
{
"id" : "9712c383-22d1-44a6-9ca2-0087af4857f1",
"name" : "first_name",
"description" : "Customer First Name",
"type" : "string",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"primaryKey" : false,
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Time
TimeType
Spark does not have an internal TimeType
representation of time. This type can be used to ensure time values are able to be successfully parsed as LocalTime objects but they are always stored in Spark as string
formatted in the standard HH:mm:ss
format type meaning they can safely used in the to_utc_timestamp SQL function (but be very careful with timezone offsets). If they cannot be parsed then an error will be inserted into the _errors
array like all other types.
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
formatters | Array[String] | true | The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent. |
Examples
{
"id" : "0f5162ce-64ca-409d-abd1-f0b5bb5830de",
"name" : "transaction_time",
"description" : "Time of the database transaction",
"type" : "time",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"formatters" : [ "HHmmss" ],
"metadata": {
"primaryKey" : true,
"position": 1
}
}
Timestamp
Additional Attributes
Attribute | Type | Required | Description |
---|---|---|---|
formatters | Array[String] | true | The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent. Custom formats ssssssssss and sssssssssssss have been added to support epoch time (i.e. 1527727035) and epoch millis time (i.e. 1527727035456) respectively. Both require timezoneId of UTC . |
timezoneId | String | true | The timezone of the incoming timestamp. This uses the SimpleDateFormat supported timezones. All timestamps are internally stored in UTC to allow correctly sequenced events when dealing with events from multiple systems which may all run with different internal timezones. |
time | Map[String, Integer] | false | Use this capability if converting a Date label into a Timestamp for relative comparisons. Required fields are hour , minute , second and nano . These values can be agreed with source data suppliers to ensure intra-system data alignment. See below for example. |
Examples
{
"id" : "8e42c8f0-22a8-40db-9798-6dd533c1de36",
"name" : "create_date",
"description" : "Customer Creation Date",
"type" : "timestamp",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"formatters": [
"dd/MM/uuuu HH:mm:ss",
"dd/MM/uuuu H:mm:ss",
"dd/MM/uuuu HH:mm",
"dd/MM/uuuu H:mm",
"d/MM/uuuu HH:mm:ss",
"d/MM/uuuu H:mm:ss",
"d/MM/uuuu HH:mm",
"d/MM/uuuu H:mm"
],
"timezoneId": "+1000",
"metadata": {
"primaryKey" : true,
"position": 1
}
}
For converting a Date
label into a Timestamp
supply the time
key:
{
"id" : "8e42c8f0-22a8-40db-9798-6dd533c1de36",
"name" : "create_date",
"description" : "Customer Creation Date",
"type" : "timestamp",
"trim" : true,
"nullable" : true,
"nullableValues" : [ "", "null" ],
"formatters": [
"dd/MM/uuuu",
],
"timezoneId": "Australia/Sydney",
"time": {
"hour": 23,
"minute": 59,
"second": 59,
"nano": 0,
},
"metadata": {
"primaryKey" : true,
"position": 1
}
}