[data] Adding Kafka datasink. by justinrmiller · Pull Request #60307 · ray-project/ray

justinrmiller · 2026-01-20T00:59:12Z

Description

This PR introduces a Kafka Datasink for Ray Data, allowing users to write Ray Datasets directly to Apache Kafka topics. The implementation leverages Confluent's Kafka python library and supports distributed writes with configurable serialization, key extraction, and performance tuning (e.g., periodic flushing to manage memory).

Key Changes

`ray.data._internal.datasource.kafka_datasink.py`

New KafkaDatasink Class: Implements the Datasink interface to handle parallel writes across Ray workers.
Serialization Support: Provides built-in support for json, string, and bytes for both message keys and values.
Smart Row Conversion: Includes a logic handler to convert various Ray row types (Dict, ArrowRow, PandasRow, NamedTuple) into serializable formats.
Memory Management:
- Implements a _FLUSH_INTERVAL (set to 10,000) to periodically flush the producer and wait for acknowledgments, preventing memory exhaustion from too many un-flushed futures.
- Handles BufferError by flushing and retrying the send operation.
Error Handling: Captures and reports the number of failed messages, ensuring that the first encountered exception is chained to the final RuntimeError for easier debugging.

`ray.data.dataset.py`

write_kafka() Method: Added as a top-level convenience method on the Dataset class.
API Exposure: Decorated with @ConsumptionAPI, allowing users to call ds.write_kafka(...) with standard Ray remote arguments and concurrency controls.

`ray.data.tests.datasource.test_kafka.py`

Unit Tests: Validates initialization, row-to-dict conversion, and serialization logic in isolation.
Integration Tests: Comprehensive test suite (requires a Kafka broker) covering:
Basic writes and multi-block writes.
Key extraction and custom serializers.
Producer configurations and delivery callbacks.
Edge cases like empty datasets, null values, and invalid connection strings.

Basic Usage

import ray

ds = ray.data.range(100)
ds.write_kafka(
    topic="my-topic", 
    bootstrap_servers="localhost:9092",
    key_field="id",
    key_serializer="string",
    value_serializer="json"
)

Related issues

Closes #58725

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a KafkaDatasink for Ray Data, which is a valuable addition. The implementation is generally well-structured. I've identified a few areas for improvement to enhance robustness and maintainability. My feedback includes suggestions to refactor duplicated code, correct potentially buggy logic in object-to-dictionary conversion, add parameter validation, and fix an incorrect docstring example. Addressing these points will strengthen the new datasink implementation.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com>

owenowenisme

Thanks for the contribution!

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

…n't happen again. Signed-off-by: Justin Miller <justinrmiller@gmail.com>

…er/ray into 58725-Kafka-Datasync

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

…er/ray into 58725-Kafka-Datasync

justinrmiller · 2026-02-21T21:40:34Z

@owenowenisme Do you think we should make the flush interval a configurable parameter? I went with 10k as I thought that would be a reasonable amount (guidance is generally to keep message sizes under 1.5 KB which is 15 MB max (10k * 1.5 KB).

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a Kafka Datasink for Ray Data, which is a great addition. The implementation is well-structured, with good error handling for producer buffer errors and delivery failures. The periodic polling to manage memory is also a thoughtful touch. The accompanying test suite is comprehensive, covering unit tests, integration tests, and various edge cases. I have one suggestion to improve code maintainability by refactoring duplicated serialization logic.

_{Note: Security Review is unavailable for this PR.}

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

iamjustinhsu

nice work!! I'm not too too familiar with kafka, so i had some questions. But overall looks good

iamjustinhsu · 2026-03-11T00:00:47Z

+        return self._serialize(key, self.key_serializer)
+
+    def _extract_key(self, row_dict: Any) -> Optional[bytes]:
+        """Extract and encode message key from row dict."""


can you add a comment/docstring on what None signifies, and if/how it affects message distribution?

iamjustinhsu · 2026-03-11T01:18:32Z

+                    except BufferError:
+                        # Internal queue is full, poll to serve callbacks
+                        # and free space, then retry
+                        producer.poll(_BUFFER_FULL_POLL_TIMEOUT_S)


if there are a lot of backlog messages that need to be delivered, will this line try to free up all space, and if so, do u know how long that will take? I'm wondering if this should be less blocking in the worst, like 5 seconds, but not sure

@staticmethod

- Add SerializerFormat enum and extract _serialize as standalone function - Add _produce_with_retry method to encapsulate produce+retry-on-BufferError logic - Document _extract_key None return (default partitioner behavior) - Document constants rationale (_POLL_BATCH_SIZE sizing, configurability) - Add librdkafka CONFIGURATION.md link to producer_config docstring - Include topic name in KafkaException error message - Change write completion log from info to debug (fires once per task) - Make _row_to_dict a @staticmethod Signed-off-by: Justin Miller <justinrmiller@gmail.com>

iamjustinhsu

lgtm!

owenowenisme · 2026-03-12T04:01:17Z

Overall LGTM. One thing to note: we should document that the current implementation provides best-effort
delivery only — partial writes can occur if a task fails midway, and duplicates are possible on system-error
task retries (e.g., node failures).

For future improvements, we could explore:

Per-task Kafka transactions — to guarantee atomicity within each write task (no partial messages from failed
tasks, no duplicates on retry)
Checkpoint-based delivery tracking — to enable resumable writes and stronger end-to-end guarantees

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

cursor · 2026-03-16T05:58:24Z

+            break
+        messages.append(msg)
+
+    return messages


Test never polls Kafka when expected count is zero

Low Severity

The consume_messages helper has a while len(messages) < expected_count loop condition. When expected_count=0, 0 < 0 is False, so the loop body never executes — the function returns an empty list immediately without ever polling Kafka. In test_write_kafka_empty_dataset, the assertion assert len(messages) == 0 is therefore trivially true regardless of whether messages were actually written, making the verification meaningless.

Additional Locations (1)

python/ray/data/tests/datasource/test_kafka.py#L1088-L1091

cursor · 2026-03-16T05:58:24Z

+        except KafkaException as e:
+            raise RuntimeError(
+                f"Failed to write to Kafka topic '{self.topic}': {e}"
+            ) from e


Dead exception handler catches nothing in try block

Low Severity

The except KafkaException handler in write() is unreachable. Within the try block, producer.produce() is asynchronous and reports errors via the on_delivery callback (not by raising). producer.poll() and producer.flush() also don't raise KafkaException. All other operations are Ray/Python-internal and can't produce a KafkaException. This creates a false sense of error handling — a real KafkaException from the constructor on line 236 occurs before the try block and would propagate uncaught.

justinrmiller · 2026-03-17T08:21:55Z

@owenowenisme Are we good to go on this one? Thanks!

cursor · 2026-03-17T08:31:08Z

+        topic_partitions = [
+            TopicPartition(topic, p, 0) for p in topic_meta.partitions.keys()
+        ]
+        consumer.assign(topic_partitions)


Stale consumer assignment when topic metadata missing

Medium Severity

The consume_messages helper only calls consumer.assign() when topic metadata is found (if topic_meta and topic_meta.partitions). Since the kafka_consumer fixture is session-scoped and shared across all integration tests, if topic metadata is not found for a given test, assign() is never called, and the consumer retains the partition assignment from a previous test. This means it would poll from the wrong topic, returning stale messages and causing misleading test results instead of a clean failure.

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme

LGTM

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

cursor · 2026-03-18T03:20:06Z

+        except KafkaException as e:
+            raise RuntimeError(
+                f"Failed to write to Kafka topic '{self.topic}': {e}"
+            ) from e


Missing producer flush on non-KafkaException error paths

Medium Severity

The try/except KafkaException block in write() only catches KafkaException. If _produce_with_retry raises RuntimeError (persistent BufferError) or serialization raises TypeError (e.g., numpy types from Pandas blocks not being JSON-serializable), producer.flush() is never called. Messages already successfully enqueued in the producer's internal buffer before the error are silently dropped without any delivery attempt. A try/finally wrapping the producer.flush() call would ensure buffered messages get a chance to be delivered even on unexpected error paths.

Not a bug. Flushing on error paths would be worse — the task has failed and Ray Data may retry it from scratch. Flushing a partial batch means the retry produces duplicates for the already-delivered prefix. Skipping flush minimizes the partial write window, which is the right behavior. The class docstring already documents the at-most-once-per-attempt delivery semantics.

### Description This PR introduces a Kafka `Datasink` for Ray Data, allowing users to write Ray Datasets directly to Apache Kafka topics. The implementation leverages Confluent's Kafka python library and supports distributed writes with configurable serialization, key extraction, and performance tuning (e.g., periodic flushing to manage memory). Signed-off-by: ryanaoleary <ryanaoleary@google.com> --- ### Key Changes #### `ray.data._internal.datasource.kafka_datasink.py` - **New `KafkaDatasink` Class**: Implements the `Datasink` interface to handle parallel writes across Ray workers. - **Serialization Support**: Provides built-in support for `json`, `string`, and `bytes` for both message keys and values. - **Smart Row Conversion**: Includes a logic handler to convert various Ray row types (Dict, ArrowRow, PandasRow, NamedTuple) into serializable formats. - **Memory Management**: - * Implements a `_FLUSH_INTERVAL` (set to 10,000) to periodically flush the producer and wait for acknowledgments, preventing memory exhaustion from too many un-flushed futures. - * Handles `BufferError` by flushing and retrying the send operation. - **Error Handling**: Captures and reports the number of failed messages, ensuring that the first encountered exception is chained to the final `RuntimeError` for easier debugging. #### `ray.data.dataset.py` * **`write_kafka()` Method**: Added as a top-level convenience method on the `Dataset` class. * **API Exposure**: Decorated with `@ConsumptionAPI`, allowing users to call `ds.write_kafka(...)` with standard Ray remote arguments and concurrency controls. #### `ray.data.tests.datasource.test_kafka.py` * **Unit Tests**: Validates initialization, row-to-dict conversion, and serialization logic in isolation. * **Integration Tests**: Comprehensive test suite (requires a Kafka broker) covering: * Basic writes and multi-block writes. * Key extraction and custom serializers. * Producer configurations and delivery callbacks. * Edge cases like empty datasets, null values, and invalid connection strings. --- ### Basic Usage ```python import ray ds = ray.data.range(100) ds.write_kafka( topic="my-topic", bootstrap_servers="localhost:9092", key_field="id", key_serializer="string", value_serializer="json" ) ``` ## Related issues Closes ray-project#58725 --------- Signed-off-by: Justin Miller <justinrmiller@gmail.com> Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: You-Cheng Lin <mses010108@gmail.com>

### Description This PR introduces a Kafka `Datasink` for Ray Data, allowing users to write Ray Datasets directly to Apache Kafka topics. The implementation leverages Confluent's Kafka python library and supports distributed writes with configurable serialization, key extraction, and performance tuning (e.g., periodic flushing to manage memory). --- ### Key Changes #### `ray.data._internal.datasource.kafka_datasink.py` - **New `KafkaDatasink` Class**: Implements the `Datasink` interface to handle parallel writes across Ray workers. - **Serialization Support**: Provides built-in support for `json`, `string`, and `bytes` for both message keys and values. - **Smart Row Conversion**: Includes a logic handler to convert various Ray row types (Dict, ArrowRow, PandasRow, NamedTuple) into serializable formats. - **Memory Management**: - * Implements a `_FLUSH_INTERVAL` (set to 10,000) to periodically flush the producer and wait for acknowledgments, preventing memory exhaustion from too many un-flushed futures. - * Handles `BufferError` by flushing and retrying the send operation. - **Error Handling**: Captures and reports the number of failed messages, ensuring that the first encountered exception is chained to the final `RuntimeError` for easier debugging. #### `ray.data.dataset.py` * **`write_kafka()` Method**: Added as a top-level convenience method on the `Dataset` class. * **API Exposure**: Decorated with `@ConsumptionAPI`, allowing users to call `ds.write_kafka(...)` with standard Ray remote arguments and concurrency controls. #### `ray.data.tests.datasource.test_kafka.py` * **Unit Tests**: Validates initialization, row-to-dict conversion, and serialization logic in isolation. * **Integration Tests**: Comprehensive test suite (requires a Kafka broker) covering: * Basic writes and multi-block writes. * Key extraction and custom serializers. * Producer configurations and delivery callbacks. * Edge cases like empty datasets, null values, and invalid connection strings. --- ### Basic Usage ```python import ray ds = ray.data.range(100) ds.write_kafka( topic="my-topic", bootstrap_servers="localhost:9092", key_field="id", key_serializer="string", value_serializer="json" ) ``` ## Related issues Closes ray-project#58725 --------- Signed-off-by: Justin Miller <justinrmiller@gmail.com> Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: You-Cheng Lin <mses010108@gmail.com>

Adding Kafka datasink.

bc79a73

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

justinrmiller requested a review from a team as a code owner January 20, 2026 00:59

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/dataset.py Outdated

Update python/ray/data/dataset.py

6f74cdb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com>

cursor Bot reviewed Jan 20, 2026

View reviewed changes

Comment thread python/ray/data/dataset.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 20, 2026

owenowenisme reviewed Jan 30, 2026

View reviewed changes

iamjustinhsu assigned owenowenisme Feb 4, 2026

Justin Miller added 2 commits February 4, 2026 20:40

Addressing PR comments and adding test for Kafka datasink.

a6b042d

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

Addressing PR comments.

d482c96

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

justinrmiller requested a review from owenowenisme February 5, 2026 04:44

Merge branch 'master' into 58725-Kafka-Datasync

eeb7f13

cursor Bot reviewed Feb 5, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Justin Miller added 2 commits February 4, 2026 20:57

Missed using the _serialize_key value, added a test to ensure that wo…

75a572a

…n't happen again. Signed-off-by: Justin Miller <justinrmiller@gmail.com>

Merge branch '58725-Kafka-Datasync' of https://github.com/justinrmill…

f4d9610

…er/ray into 58725-Kafka-Datasync

cursor Bot reviewed Feb 5, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Merge branch 'master' into 58725-Kafka-Datasync

286feea

owenowenisme reviewed Feb 9, 2026

View reviewed changes

Merge branch 'master' into 58725-Kafka-Datasync

817cf5b

cursor Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/dataset.py Outdated

Merge remote-tracking branch 'upstream/master' into 58725-Kafka-Datasync

f787181

cursor Bot reviewed Feb 17, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/dataset.py Outdated

Merge branch 'master' into 58725-Kafka-Datasync

5292fe5

cursor Bot reviewed Feb 21, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/dataset.py Outdated

Justin Miller added 2 commits February 21, 2026 13:30

Address PR comments.

d17c447

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

Merge branch '58725-Kafka-Datasync' of https://github.com/justinrmill…

ede6646

…er/ray into 58725-Kafka-Datasync

cursor Bot reviewed Feb 21, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Surfacing first exception in Kafka datasink.

f9eeb01

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

cursor Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

gemini-code-assist Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

Merge branch 'master' into 58725-Kafka-Datasync

8b95dc3

cursor Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

fix doc test and simplify serialization

94f505c

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme added the go add ONLY when ready to merge, run all tests label Mar 9, 2026

cursor Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

iamjustinhsu reviewed Mar 11, 2026

View reviewed changes

justinrmiller requested a review from iamjustinhsu March 11, 2026 21:26

Merge branch 'master' into 58725-Kafka-Datasync

bf9ad94

iamjustinhsu approved these changes Mar 12, 2026

View reviewed changes

Merge branch 'master' into 58725-Kafka-Datasync

5cb9136

cursor Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/kafka_datasink.py Outdated

[data] Document best-effort delivery semantics for KafkaDatasink

2989585

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

cursor Bot reviewed Mar 16, 2026

View reviewed changes

Merge branch 'master' into 58725-Kafka-Datasync

550ba47

justinrmiller requested a review from iamjustinhsu March 17, 2026 08:21

cursor Bot reviewed Mar 17, 2026

View reviewed changes

small tweak

3cd6442

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme approved these changes Mar 18, 2026

View reviewed changes

cursor Bot reviewed Mar 18, 2026

View reviewed changes

Merge branch 'master' into 58725-Kafka-Datasync

abfc914

richardliaw changed the title ~~Adding Kafka datasink.~~ Mar 18, 2026

richardliaw merged commit 0951139 into ray-project:master Mar 18, 2026
6 checks passed

Uh oh!

Conversation

justinrmiller commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

ray.data._internal.datasource.kafka_datasink.py

ray.data.dataset.py

ray.data.tests.datasource.test_kafka.py

Basic Usage

Related issues

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinrmiller commented Feb 21, 2026

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iamjustinhsu Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iamjustinhsu Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

owenowenisme commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinrmiller commented Jan 20, 2026 •

edited

Loading

`ray.data._internal.datasource.kafka_datasink.py`

`ray.data.dataset.py`

`ray.data.tests.datasource.test_kafka.py`

owenowenisme commented Mar 12, 2026 •

edited

Loading