Releases July 2023¶

Latest datahub release 1.6.0¶

Datahub 1.6.0 introduces error handlers for jobs and general error handling improvements.

Http retries¶

When datasets are synchronized over http using jobs in the datahub, it can happen that the remote HttpDatasetSink fails. Often this is due to network partition, service restarts and so on, where a retry after a short interval fixes the issue.

Datahub now automatically retries failed HTTP requests in HttpDatasetSink

Job error handlers¶

`reRun`¶

When a job works with remote services that are unstable and take more than a couple of seconds to recover, it's trigger can now be configured with a reRun error handler. This handler runs the the job again after a specified interval for a configured number of times, or until it succeeds.

This is an alternative to short job schedule intervals, which are not always a valid solution for flaky systems due factors such as rate limiting, quotas or economical implications.

`log`¶

In cases where a remote job sink is rejecting a transmission due to problems with the data, it is not always easy to see which specific entity caused the issue in a large batch of entities.

It can sometimes also be desirable to skip failing entities and continue with the rest of the dataset to get as much as possible synced and not be completely blocked by errors.

The new log error handler can be added to a job trigger to address both needs. It can be configured to log only the first failing entity or entities, and then stop. Giving users a new tool to debug job sync issues.

It can also be configured to just log all failing entities and continue on without stopping the pipeline.

It is of note that entities, which are skipped by the log error handler, are not processed again if the skipping advances the job to a new continuation taken in the job source. In this case a job must be reset or a fullsync must be run to re-process the logged and skipped entities.

Also be aware that the sink must be idempotent, or at minimum it must tolerate that duplicates of the same entities are sent during a job execution. The mechanism which identifies single failing entities in a rejected batch splits the batch up multiple times to narrow the number of potetially failing entities down to one. The same entities are sent multiple times during this process.

Read more about error handlers in the Datahub Jobs documentation

Get Datahub version 1.6.0 on Github and on Dockerhub.