Change Log
3.13.2
- Java 8 build
3.13.1
- release fix
3.13.0
- update to Spark 3.3.2
3.12.0
- update to Spark 3.3.0
3.11.1
- change
ETL_POLICY_DROP_UNSUPPORTED
default totrue
.
3.11.0
- bump to Spark 3.1.2.
3.10.0
- add
escape
option toDelimitedExtract
.
3.9.0
- add
multiLine
option toDelimitedExtract
. - bump to Spark 3.0.3.
3.8.2
3.8.1
- FIX changes to allow execution with both Spark 3.0.2 and Spark 3.1.1.
- FIX logging when executing
LazyEvaluator
now contains thechild
attribute with details of nested plugin.
3.8.0
- bump to Spark 3.1.1.
3.7.0
- refactor
JupyterCompleter
to allow users to specifyetl.config.completion.environments
orETL_CONF_COMPLETION_ENVIRONMENTS
with settings specific for their deployment model. - log stage details when running in
lintOnly
mode allowing parsing with tools like Open Policy Agent in CICD. - change
org.apache.hadoop:hadoop-aws
library to be aprovided
dependency. - add the
hllRelativeSD
tuning parameter to theHyperLogLogPlusPlus
count distinct algorithm used inStatisticsExtract
and revert to Spark default of0.05
from0.01
.
3.6.2
- FIX defect where IPYNB
%configexecute
did not provide the optionaloutputView
parameter.
3.6.1
- add
JupyterCompleter
to Lifecycle PluginsChaosMonkey
andControlFlow
.
3.6.0
- add
ControlFlowExecute
andControlFlow
plugins to support work avoidance.
3.5.3
- FIX remove limitation that required field
metadata
names to be different to the fieldname
.
3.5.2
- FIX reorder AWS Identity and Access Management providers
com.amazonaws.auth.WebIdentityTokenCredentialsProvider
to allow use of Arc on Amazon Elastic Kubernetes Service with IAM.
3.5.1
- FIX issue with
TensorFlowServingTransform
transform not parsingbatchSize
argument correctly.
3.5.0
- add
StatisticsExtract
stage. - add
etl.config.lintOnly
(ETL_CONF_LINT_ONLY
) option to only validate the configuration and not run the job. - add validation that stage
id
values are unique within the job. - FIX minor defect relating to order of
etl.config.uri
vsetl.config.environments
error messages.
3.4.1
- set name on Dataframe when
persist=true
to help understand persisted datasets when using the Spark UI. - add better error messages to job failure if
lazy
evaluation is set. - add support for
%configexecute
from Arc Jupyter notebooks.
3.4.0
- add
resolution
tag for all stages to indicatelazy
orstrict
resolution of stage variables. This can be used withConfigExecute
to generate runtime specific configuration variables. - add
ConfigExecute
stage to allow dynamic creation of runtime variables. - FIX revert
DiffTransform
andEqualityValidate
to usesha2(to_json(struct()))
rather than inbuilt Sparkhash
function due to high liklihood of collisions.
3.3.3
- FIX standardise
persist
behavior logic inDiffTransform
to match other stages.
3.3.2
- FIX incorrect logic in
DiffTransform
and bad test for changes made in3.3.1
.
3.3.1
- FIX
MetadataExtract
to export full Arc schema (a superset of the previous schema) so that it can be used withschemaView
. - FIX
DiffTransform
will output theleft
andright
structs in theintersectionView
wheninputLeftKeys
orinputRightKeys
are supplied.
3.3.0
- bump to Spark 3.0.1.
- add
inputLeftKeys
andinputRightKeys
toDiffTransform
to support matching on a subset of keys.
3.2.0
- rewrote the code to calculate
_index
(vs_monotonically_increasing_id
) to be more efficient with large datasets. - add optional
id
attribute to all stages which can be used to easily identify individual stages when monitoring logs. - add the
schema
attribute toTypingTransform
andMetadataTransform
to allow inline schema. Can be disabled viaetl.policy.inline.schema
andETL_POLICY_INLINE_SCHEMA
. These are being trialed and if beneficial will be added to all stages that support schemas. - rename
etl.policy.inlinesql
toetl.policy.inline.sql
andETL_POLICY_INLINESQL
toETL_POLICY_INLINE_SQL
. - remove forced use of
etl.config.fs.gs.project.id
/ETL_CONF_GOOGLE_CLOUD_PROJECT_ID
andetl.config.fs.google.cloud.auth.service.account.json.keyfile
/ETL_CONF_GOOGLE_CLOUD_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE
to access Google Cloud Storage job files. - remove previous optimisation when reading a large number of small
json
files inJSONExtract
. This is to better align withDataSourceV2
. - added
sql
attribute toMetadataFilterTransform
andMetadataValidate
allowing inline SQL statements. - added support for scientific notation to
Integer
andLong
when performingTypingTransform
. - FIX a non-threadsafe HashMap was used in string validation functions resulting in non-deterministic hanging in the
TypingTransform
. This would happen more frequently with datasets containing many string columns. - This commit replaces the HashMap with the threadsafe ConcurrentHashMap
- BREAKING disable automatic dropping of unsupported types when performing
*Load
stages (e.g.ParquetLoad
cannot supportNullType
). Old behavior can be enabled by settingetl.policy.drop.unsupported
/ETL_POLICY_DROP_UNSUPPORTED
totrue
. - BREAKING remove deprecated
etl.config.environment.id
andETL_CONF_ENV_ID
in favor ofetl.config.tags
orETL_CONF_TAGS
.
3.1.1
- remove
spark.authenticate.secret
from log output. - support nested
struct
andarray
types inmakeMetadataFromDataframe
helper function used by Arc Jupyter%printmetadata
magic. - minor tweaks to readers and writers to begin
DataSourceV2
support.
3.1.0
- add the
JupyterCompleter
trait for auto-completion in Jupyter allowingsnippet
,language
anddocumentationURI
to be defined. - add the
ExtractPipelineStage
,TransformPipelineStage
andLoadPipelineStage
traits to allow easier pattern matching inLifecyclePlugins
.
3.0.0
- bump to Spark 3.0.0.
- bump to Hadoop 3.2.0.
- FIX
MLTransform
dropping all calculated columns when applying models which do not produce a prediction column. - BREAKING remove
Scala 2.11
support as Arc is now built againstSpark 3.0.0
which does not supportScala 2.11
. - BREAKING move
XMLExtract
andXMLLoad
to arc-xml-plugin. - BREAKING Spark ML models trained with Spark 2.x do not work with Spark 3.x and will need to be retrained (
MLTransform
). - BREAKING remove
GraphTransform
andCypherTransform
as the underlying library has been abandoned.
2.14.0
This is the last release supporting Scala 2.s11
given the release of Spark 3.0
which only supports Scala 2.12
.
- add support for case-insensitive formatter (default
true
) to allow formatterMMM
to acceptJUL
andJul
where case-sensitive will only acceptJul
. Applies toDate
andTimestamp
schema columns. Boolean propertycaseSensitive
can be used to set case-sensitive behavior.
2.13.0
- bump to Spark 2.4.6.
2.12.5
- FIX support for
timestamp
formatters
that include offset (e.g.+01:00
) will overridetimezoneId
which remains mandatory.
2.12.4
- FIX rare edge-case of
TextLoad
insingleFile
mode throwing non-seriaizable exception when non-serializableuserData
exists.
2.12.3
- add support for parsing
array
objects when returned in themessage
field fromLogExecute
andSQLValidate
.
2.12.2
- FIX
PipelineExecute
long standing issue where errors in nested pipeline were not being exposed correctly. - FIX restore the ability for user to be able to define
fs.s3a.aws.credentials.provider
overrides via--conf spark.hadoop.fs.s3a.aws.credentials.provider=
2.12.1
- FIX
PipelineExecute
so that it will correctly identify.ipynb
files and parse them correctly. - add
get_uri_filename_array
user defined function which returns the contents of a Glob/URI as anArray[(Array[Byte], String)]
where the second return value isfilename
. - remove
delay
versions ofget_uri
as this can be handled by target service.
2.12.0
- FIX prevent early validation failure of SQL statements which contain
${hiveconf:
or${hivevar:
. - add
get_uri_array
user defined function which returns the contents of a Glob/URI as anArray[Array[Byte]]
. - add
get_uri_array_delay
which is the same asget_uri_array
but adds a delay in milliseconds to reduce DDOS liklihood. - add
LogExecute
which allows logging to the Arc log similar toSQLValidate
but without the success/fail decision.
2.11.0
- add ability to define the Arc schema with a
schema
key ({"schema": [...]}
) so that common attributes can be defined using Human-Optimized Config Object Notation (HOCON) functionality. - add ability to define
regex
when parsing string columns to perform validation. - remove mandatory requirement to supply
trim
,nullReplacementValue
andnullableValues
for schema that don’t logically use them. This will not break existing configurations. - change
DiffTransform
andEqualityValidate
to use inbuilt Sparkhash
function rather thansha2(to_json(struct()))
. - add
get_uri_delay
which is the same asget_uri
but adds a delay in milliseconds to reduce DDOS liklihood.
2.10.2
- FIX
get_uri
ability to read compressed file formats.gzip
,.bzip2
,.lz4
. - add
get_uri
ability to read fromhttp
andhttps
sources.
2.10.1
- add the
get_uri
user defined function which returns the contents of a URI as anArray[Byte]
which can be used withdecode
to convert to text. - rename
frameworkVersion
toarcVersion
in initiliasation logs for clarity.
2.10.0
- make
id
an optional field when specifying an Arc Schema. - make
TextLoad
singleFile
mode load in parallel from the Sparkexecutor
processes rather thandriver
as they will have likely have more ram available. - add option to supply
index
in addition to [value
,filename
] to ensure correct output ordering when insingleFile
mode. - add
singleFile
mode toXMLLoad
. - add
to_xml
UDF. - add support for
struct
andarray
types in the schema definition file. - add support for
TextExtract
to be suppiled aninputView
. - FIX any deprecations preventing upgrade to Spark 3.0.
- deprecate of
get_json_double_array
,get_json_integer_array
,get_json_long_array
in favor of inbuiltget_json_object
. - ability to pass in config file via
http
orhttps
. - BREAKING remove
DataFramePrinter
lifecycle plugin as it presents too much risk of data leakage. - BREAKING Amazon Web Services
authentication
methods will now limit their scope to specific buckets rather than global. - BREAKING remove ability to read
.zip
files.
2.9.0
- FIX defect with
JSONExtract
not usingbasePath
correctly. - add ability to export to multiple files by providing second
filename
column toTextLoad
when set tosingleFile
mode but not create a directory like the standard Spark behavior. - add ability to provide an XML Schema Definition (XSD) to
XMLExtract
to validate input file schemas. - add
blas
andlapack
implementation toMLTransform
logs to help debug performance.
2.8.1
- FIX defect with
MetadataTransform
logic. - add
contentLength
toHTTPExtract
response and logs.
2.8.0
- FIX defect which reported job failure when running in YARN mode.
- FIX defect relating to Amazon S3 protocol to use when running on Amazon EMR
- added
sql
attribute toSQLTransform
andSQLValidate
allowing inline SQL statements. - added capability to replace the returned dataset via a
LifecyclePlugin
after
hook. - removed
tutorials
directory from main arc repo as it is available via arc-starter. - bump to Spark 2.4.5.
2.7.0
- added
ContainerCredentials
provider to resolver list allowing IAM roles to be accessed by Arc jobs running inside an Amazon ECS container (specified viataskRoleArn
in ECS). - added
AmazonAnonymous
mode to default provider list meaning users do not have to specify it manually. - enhanced file based
*Extract
to throw richer error messages when files are not found.
2.6.0
- provided ability for job configuration files to be retrieved via
AmazonIAM
(by default) in addition to the existingAccessKey
andAmazonAnonymous
methods.
2.5.0
- enhanced
PipelineExecute
to allow execution of nested Lifecycle Plugins. - changed Stack Trace logging to be opt-in when errors occur (default
false
) via parametersETL_CONF_ENABLE_STACKTRACE
andetl.config.enableStackTrace
.
2.4.0
- FIX defect in
DelimitedExtract
when running in streaming mode. - added
MetadataExtract
stage which creates an Arcmetadata
dataframe from an input view. - added
MetadataTransform
stage which attaches/overrides the metadata attached to an input view. - added
MetadataValidate
stage which allows runtime rules to be aplied against an input view’s metadata. - added ability to include
%configplugins
when defined arc-jupyter notebook files (.ipynb). - rewrote tutorial to point to public datasets rather than requiring user to download data first.
2.3.1
- will now throw exceptions if trying to use the Amazon
s3://
ors3n://
protocols instead ofs3a://
as they have been deprecated by the Hadoop project, are no longer suppored in Hadoop 3.0+ and do not behave predictably with the Arc Amazon Authentication methods. - will now throw clearer message when typing conversion fails on a non-nullable field (which means the job cannot logically proceed).
2.3.0
- added
AmazonAnonymous
(for public buckets) andAmazonIAM
(allowing users to specify encryption method) Authentication methods. - added initial support for
runStage
to Lifecycle Plugins to support early job exit with success when certain criteria are met.
2.2.0
- add ability to execute arc-jupyter notebook files (.ipynb) directly without conversion to arc ‘job’.
- add
watermark
toDelimitedExtract
,ImageExtract
,JSONExtract
,ORCExtract
,ParquetExtract
andTextExtract
for structured streaming. - performance and usability improvements for
SimilarityJoinTransform
. Can now cope with null inputs and performs caching to prevent recalculation of input data. - FIX issue where malformed job configuration files would not error and job would exit with success.
2.1.0
- add
SimilarityJoinTransform
a stage which performs a fuzzy match and can be used for dataset deduplication or approximate joins. - add missing types
BooleanList
,Double
,DoubleList
,LongList
to config reader.
BREAKING
- change API for
LifecyclePlugin
to pass the stage index and the full job pipeline so that the current and other stages can be accessed in the plugin.
2.0.1
- update to Spark 2.4.4.
- update to Scala
2.12.9
2.0.0
Arc 2.0.0 is a major (breaking) change which has been done for multiple reasons:
- to support both
Scala 2.11
andScala 2.12
as they are not binary compatible and the Spark project is moving toScala 2.12
. Arc will be published for bothScala 2.11
andScala 2.12
. - to decouple stages/extensions reliant on third-party packages from the main repository so that Arc is not dependent on a library which does not yet support
Scala 2.12
(for example). - to support first-class plugins by providing a better API to allow the same type-safety when reading the job configuration as the core Arc pipeline stages (in fact all the core stages have been rebuilt as included plugins). This extends to allowing version number specification in stage resolution.
BREAKING
REMOVED
- remove
AzureCosmosDBExtract
stage. This could be reimplemented as a Lifecycle Plugin. - remove
AzureEventHubsLoad
stage. This could be reimplemented as a Lifecycle Plugin. - remove
DatabricksDeltaExtract
andDatabricksDeltaLoad
stages and replace with the open-source DeltaLake versions (DeltaLakeExtract
andDeltaLakeLoad
) implemented https://github.com/tripl-ai/arc-deltalake-pipeline-plugin. - remove
DatabricksSQLDWLoad
. This could be reimplemented as a Lifecycle Plugin. - remove
bulkload
mode fromJDBCLoad
. Any target specific JDBC behaviours could be implemented by a custom plugins if required. - remove
user
andpassword
fromJDBCExecute
for consistency. Move details to eitherjdbcURL
orparams
. - remove the
Dockerfile
and put it in separate repo: (https://github.com/tripl-ai/docker)
CHANGES
- changed
inputURI
field forTypingTransform
toschemaURI
to allow addition ofschemaView
. - add
CypherTransform
andGraphTransform
stages to support the https://github.com/opencypher/morpheus project (https://github.com/tripl-ai/arc-graph-pipeline-plugin). - add
MongoDBExtract
andMongoDBLoad
stages (https://github.com/tripl-ai/arc-mongodb-pipeline-plugin). - move
ElasticsearchExtract
andElasticsearchLoad
to their own repository https://github.com/tripl-ai/arc-elasticsearch-pipeline-plugin. - move
KafkaExtract
,KafkaLoad
andKafkaCommitExecute
to their own repository https://github.com/tripl-ai/arc-kafka-pipeline-plugin.
1.15.0
- added
uriField
andbodyField
toHTTPExtract
allowing dynamic data to be generated andPOST
ed to endpoints when using aninputView
.
1.14.0
- changed all dependencies to
intransitive()
. All tests pass however this may cause issues. Please raise issue if found. - removed reliance on
/lib
libraries. - added
endpoint
andsslEnabled
variables toAmazonAccessKey
authentication options to help connect toMinio
orCeph Object Store
.
1.13.3
- Arc now available on Maven.
- added configuration flag
ETL_CONF_DISABLE_DEPENDENCY_VALIDATION
andetl.config.disableDependencyValidation
to disable config dependency graph validation in case of dependency resolution defects.
1.13.2
- FIX issue where using SQL Common Table Expressions (CTE -
WITH
statements) would break the config dependency graph validation.
1.13.1
- added ability to add custom key/value tags to all log messages via
ETL_CONF_TAGS
oretl.config.tags
.
1.13.0
- BREAKING added
environments
key to Dynamic Configuration Plugins and Lifecycle Plugins so they can be enabled/disabled depending on the deloyment environment. - BREAKING Lifecycle Plugins now require explicit declaration like Dynamic Configuration Plugins by use of the
config.lifecycle
attribute. - FIX fixed issue https://issues.apache.org/jira/browse/SPARK-26995 to Dockerfile.
- FIX error reading
Elasticsearch*
configuration parameters due to escaping by Typesafe Config. - added
AzureCosmosDBExtract
stage. - added ability to pass
params
to Lifecycle Plugins. - rewrote tutorial to use arc-starter.
- added
failMode
toBytesExtract
to allow pipeline to continue if missing binary files. - added
DataFramePrinterLifecyclePlugin
to base image. ARC.run()
now returns the finalOption[DataFrame]
facilitating better integrations.
1.12.2
- FIX defect where
sqlParams
inSQLTransform
stage would throw exception.
1.12.1
- FIX defect where job config files would not resolve internal subsittution values.
1.12.0
- bump to Spark 2.4.3.
- bump to OpenJDK
8.212.04-r0
inDockerfile
.
1.11.1
- FIX error reading text file during config stage.
- added support for
AzureDataLakeStorageGen2AccountKey
andAzureDataLakeStorageGen2OAuth
authentication methods.
1.11.0
- BREAKING
key
andvalue
fields extracted asbinary
type fromKafkaExtract
to be consistent with Spark Streaming schema and is easier to generalise. - added support for writing
binary
key/value to Kafka usingKafkaLoad
. - added
inputView
,inputField
andavroSchemaURI
to allow parsing ofavro
binary data which does not have an embedded schema for reading from sources like likeKafkaExtract
with a Kafka Schema Registry.
1.10.0
- added
basePath
to relevant*Extract
do aid with partition discovery. - added check to ensure no parameters are remaining after
sqlParams
string replacement (i.e. missingsqlParams
). - added
failMode
toHTTPTransform
with defaultfailfast
(unchanged behaviour). - added streaming mode support to
HTTPLoad
. - added
binary
metadata
type to allow decodingbase64
andhexadecimal
encodings. - CHORE bumped some JAR versions up to latest.
1.9.0
- FIX fixed command line arguments which contain equals sign not being parsed correctly
- added
DatabricksSQLDWLoad
stage for bulk loading Azure SQLDW when executing in the Databricks Runtime environment. - added
ElasticsearchExtract
andElasticsearchLoad
stages for connecting to Elasticsearch clusters. - added additional checks for table dependencies when validating the job config.
- added
TextLoad
which supports bothsingleFile
and standard partitioned output formats.
1.8.0
- added ability to pass job substitution variables via the
spark-submit
command instead of only environment variables. See Typesafe Config Substitutions for additional information. - added the
DatabricksDeltaExtract
andDatabricksDeltaLoad
stages for when executing in the Databricks Runtime environment.
1.7.1
- added additional logging in
*Transform
to better track partition behaviour.
1.7.0
- added
partitionBy
andnumPartitions
to relevant*Transform
stages to allow finer control of parallelism. - changed to only perform
HTTPExtract
split result whenbatchSize
is greater than 1.
1.6.1
- FIX fixed exit process when executing within a Databricks Runtime environment so the job reports success/failure correctly. Source.
- FIX changed log level for
DynamicConfiguration
return values toDEBUG
to prevent spilling secrets unless opt-in.
1.6.0
- FIX fixed defect in
*Extract
where Arc would recalculate metadata columns (_filename
,_index
or_monotonically_increasing_id
) if both_index
or_monotonically_increasing_id
missing ignoring_filename
presence. - FIX
HTTPExtract
,HTTPTransform
andHTTPLoad
changed to fail fast and hit HTTP endpoint only once. - FIX
JDBCExecute
was not setting connectionparams
ifuser
orpassword
were not provided. - changed
DynamicConfiguration
plugins to be HOCONobject
rather than astring
allowing parameters to be passed in. - added
logger
object toDynamicConfiguration
plugins. - added
customDelimiter
attribute toDelimitedExtract
andDelimitedLoad
to be used in conjunction withdelimiter
equal toCustom
. - added optional
description
attribute to all stages. - added
inputField
toDelimitedExtract
andJSONExtract
to simplify loading from sources likeHTTPExtract
. - added
batchSize
anddelimiter
toHTTPTransform
to allow batching to reduce cost of HTTP overhead. - bump to Alpine 3.9 in
Dockerfile
.
1.5.0
- changed
delimiter
forDelimitedExtract
andDelimitedLoad
fromDefaultHive
toComma
. - renamed
BytesExtract
attributepathView
toinputView
for consistency. - renamed
JDBCExecute
attributeurl
tojdbcURL
for consistency. - added
authentication
toPipelineExecute
to allow reading external pipelines from different sources. - major rework of error messages when reading job and metadata config files.
- bump to OpenJDK 8.191.12-r0 in
Dockerfile
.
1.4.1
- added
inputField
to bothTensorFlowServingTransform
andHTTPTransform
to allow overriding default fieldvalue
. - added
ImageExtract
to read image files for machine learning etc. - added the
minLength
andmaxLength
properties to thestring
metadata type.
1.4.0
- bump to Spark 2.4.0.
- bump to OpenJDK 8.181.13-r0 in
Dockerfile
.
1.3.1
- added additional tutorial job at
/tutorial/starter
.
1.3.0
- added support for dynamic runtime configuration via
Dynamic Configuration Plugins
. - added support for custom stages via
Pipeline Stage Plugins
. - added support for Spark SQL extensions via custom
User Defined Functions
registration. - added support for specifying custom
formatters
forTypingTransform
of integer, long, double and decimal types. - added
failMode
forTypingTransform
to allow eitherpermissive
orfailfast
mode. - added
inputView
capability toHTTPExtract
stage.
1.2.1
- bump to Spark 2.3.2.
1.2.0
- added support for Spark Structured Streaming.
- added
pathView
property forBytesExtract
allowing a dataframe of file paths to be provided to dynamically load files. - added
TextExtract
to read basic text files. - added
RateExtract
which wraps the Spark Structured Streaming rate source for testing streaming jobs. - added
ConsoleLoad
which wraps the Spark Structured Streaming console sink for testing streaming jobs. - added ability for
JDBCLoad
to execute in Spark Structured Streaming mode.
1.1.0
- fixed a bug in the
build.sbt
mergeStrategy which incorrectly excluded theBinaryContentDataSource
registration whenassembly
. - changed
TensorFlowServingTransform
to require aresponseType
argument.
1.0.9
- added the ability to add column level
metadata
. - added the
MetadataFilterTransform
stage - a stage which allows filtering of columns by their metadata. - added the
PipelineExecute
stage - a stage which allows embeding other pipelines. - added the
HTTPTransform
stage - a stage which calls an external API and adds result to Dataframe. - added support for Google Cloud Storage - prefix:
gs://
- added support for Azure Data Lake Storage - prefix:
adl://
- added
partitionColumn
toJDBCExtract
to support parallel extract. - added
predicates
toJDBCExtract
to allow manual partition definitions. - changed
XMLExtract
to be able to support reading.zip
files. - changed
*Extract
to allow inputglob
patterns not just simpleURI
. - changed
*Extract
to support the schema to be provided asschemaView
. - added
BytesExtract
to allowArray[Bytes]
to be read into a dataframe for things like calling external Machine Learning models. - refactored
TensorFlowServingTransform
to call theREST
API (in batches) which is easier due to library conflicts creatingprotobuf
. - create the
UDF
registration mechanism allowing logical extension point for common functions missing from Spark core.
1.0.8
- added the
KafkaLoad
stage - a stage which will write a dataset to a designatedtopic
in Apache Kafka. - added the
EXPERIMENTAL
KafkaExtract
stage - a stage which will read from a designatedtopic
in Apache Kafka and produce a dataset. - added the
KafkaCommitExecute
stage - a stage which allows the commiting of the Kafka offsets to be deferred allowing quasi-transactional behaviour. - added integration tests for
KafkaLoad
andKafkaExtract
. more to come. - added
partitionBy
to*Extract
. - added
contiguousIndex
option to file based*Extract
which allows users to opt out of the expensive_index
resolution from_monotonically_increasing_id
. - added ability to send a
POST
request withbody
inHTTPExtract
. - changed hashing function for
DiffTransform
andEqualityValidate
from inbuilthash
tosha2(512)
.
1.0.7
- added the
DiffTransform
stage - a stage which efficiently calculutes the difference between two datasets and produces left, right and intersection output views. - added logging of
records
andbatches
counts to theAzureEventHubsLoad
. - updated the
EqualityValidate
to use thehash
diffing function as the inbuiltexcept
function is very difficult to use in practice. API unchanged. - updated the
HTTPExecute
stage to not automatically split the response body by newline (\n
) - this is more in line with expected usecase of REST API endpoints.
1.0.6
- updated
AzureEventHubsLoad
to use a SNAPSHOT compiled JAR in/lib
to get latest changes. this will be changed back to Maven once version is officially released. Also exposed the expontential retry options to API. - initial testing of a
.zip
reader/writer.
1.0.5
- FIX a longstanding defect in
TypingTranform
not correctly passing through values which are already of correct type. - change the
_index
field added to*Extract
frommonotonically_increasing_id()
to arow_number(monotonically_increasing_id())
so that the index aligns to underlying files and is more intuitive to use.
1.0.4
- allow passing of same metadata schema to
JSONExtract
andXMLExtract
to reduce cost of schema inference for large number of files.
1.0.3
- Expose numPartitions Optional Parameter for
*Extract
.
1.0.2
- add sql validation step to
SQLValidate
configuration parsing and ensure parameters are injected first (includingSQLTransform
) so the statements with parameters can be parsed.
1.0.1
- bump to Spark 2.3.1.
1.0.0
- initial release.