Partials
Authentication
The Authentication
map defines the authentication parameters for connecting to a remote service (e.g. HDFS, Blob Storage, etc.).
Parameters
Attribute | Type | Required | Description |
---|---|---|---|
method | String | true | A value of AzureSharedKey , AzureSharedAccessSignature , AzureDataLakeStorageToken , AzureDataLakeStorageGen2AccountKey , AzureDataLakeStorageGen2OAuth , AmazonAccessKey , AmazonAnonymous , AmazonIAM , GoogleCloudStorageKeyFile which defines which method should be used to authenticate with the remote service. |
accountName | String | false* | Required for AzureSharedKey and AzureSharedAccessSignature . |
signature | String | false* | Required for AzureSharedKey . |
container | String | false* | Required for AzureSharedAccessSignature . |
token | String | false* | Required for AzureSharedAccessSignature . |
clientID | String | false* | Required for AzureDataLakeStorageToken . |
refreshToken | String | false* | Required for AzureDataLakeStorageToken . |
accountName | String | false* | Required for AzureDataLakeStorageGen2AccountKey . |
accessKey | String | false* | Required for AzureDataLakeStorageGen2AccountKey . |
clientID | String | false* | Required for AzureDataLakeStorageGen2OAuth . |
secret | String | false* | Required for AzureDataLakeStorageGen2OAuth . |
directoryID | String | false* | Required for AzureDataLakeStorageGen2OAuth . |
accessKeyID | String | false* | Required for AmazonAccessKey . |
secretAccessKey | String | false* | Required for AmazonAccessKey . |
accessKeyID | String | false* | Required for AmazonIAM . |
secretAccessKey | String | false* | Required for AmazonAccessKey . |
encryptionAlgorithm | String | false* | The bucket encrpytion algorithm: SSE-S3 , SSE-KMS , SSE-C . Optional for AmazonIAM . |
kmsArn | String | false* | The Key Management Service Amazon Resource Name when using SSE-KMS encryptionAlgorithm e.g. arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab . Optional for AmazonIAM . |
customKey | String | false* | The key to use when using Customer-Provided Encryption Keys (SSE-C ). Optional for AmazonIAM . |
endpoint | String | false | Used for setting S3 endpoint for services like Ceph Object Store or Minio . Optional for AmazonAccessKey . |
sslEnabled | Boolean | false | Used to set whether to use SSL. Optional for AmazonAccessKey . |
projectID | String | false* | Required for GoogleCloudStorageKeyFile . |
keyFilePath | String | false* | Required for GoogleCloudStorageKeyFile . |
Examples
{
"type": "DelimitedExtract",
...
"authentication": {
"method": "AzureSharedKey",
"accountName": "myaccount",
"signature": "ctzMq410TV3wS7upTBcunJTDLEJwMAZuFPfr0mrrA08=",
}
...
}
{
"type": "DelimitedExtract",
...
"authentication": {
"method": "AzureSharedAccessSignature",
"accountName": "myaccount",
"container": "mycontainer",
"token": "sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D",
}
...
}
{
"type": "DelimitedExtract",
...
"authentication": {
"method": "AmazonAccessKey",
"accessKeyID": "AKIAIOSFODNN7EXAMPLE",
"secretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"endpoint": "http://minio:9000"
}
...
}
Environments
The Environments
list specifies a list of environments under which the stage will be executed. The environments list must contain the value in the ETL_CONF_ENV
environment variable or etl.config.environment
spark-submit
argument for the stage to be executed.
Examples
If a stage is to be executed in both production and testing and the ETL_CONF_ENV
environment variable is set to production
or test
then the DelimitedExtract
stage defined here will be executed. If the ETL_CONF_ENV
environment variable was set to something else like user_acceptance_testing
then this stage will not be executed and a warning message will be logged.
{
"type": "DelimitedExtract",
...
"environments": ["production", "test"],
...
}
A practical use case of this is to execute additional stages in testing which would prevent the job from being automatically deployed to production via Continuous Delivery if it fails:
{
"type": "ParquetExtract",
"name": "load the manually verified known good set of data from testing",
"environments": ["test"],
"outputView": "known_correct_dataset",
...
},
{
"type": "EqualityValidate",
"name": "ensure the business logic produces the same result as the known good set of data from testing",
"environments": ["test"],
"leftView": "newly_caluclated_dataset",
"rightView": "known_correct_dataset",
...
}
User Defined Functions
To help with common data tasks several additional functions have been added to Arc in addition to the inbuilt Spark SQL Functions.
get_json_double_array
Since: 1.0.9
Similar to get_json_object - but extracts a json double
array
from path.
SELECT get_json_double_array('[0.1, 1.1]', '$')
get_json_integer_array
Since: 1.0.9
Similar to get_json_object - but extracts a json integer
array
from path.
SELECT get_json_integer_array('[1, 2]', '$')
get_json_long_array
Since: 1.0.9
Similar to get_json_object - but extracts a json long
array
from path.
SELECT get_json_long_array('[2147483648, 2147483649]', '$')