Techdots

March 22, 2025

Real-Time ETL Processing in Ruby: Streaming Data with Kafka and WebSockets

What if your data pipelines could process and deliver insights in real-time, enabling instant decision-making? 

Traditional Extract, Transform, Load (ETL) workflows, often batch-oriented, struggle to meet the demands of modern applications requiring real-time processing. Enter streaming ETL—a paradigm shift that processes data as it arrives, powered by event-driven architectures and tools like Kafka and WebSockets. 

In this blog, we’ll explore how to build real-time ETL pipelines in Ruby, leveraging Kafka for high-throughput message queues and WebSockets for seamless data delivery. Whether you’re handling financial transactions or live analytics, this approach ensures low-latency, scalable, and fault-tolerant data pipelines, transforming how you handle asynchronous processing in your applications. Let’s dive into the blog to know more: 

Streaming Data vs. Batch Processing

Traditional ETL operates in batches, where data is collected, processed, and loaded at scheduled intervals. While effective for many use cases, batch processing introduces latency that can be problematic for applications requiring real-time insights.

Streaming ETL, on the other hand, processes data as it arrives, allowing immediate transformations and delivery. This event-driven approach is crucial for applications like financial transactions, real-time monitoring, and live analytics dashboards.

Building Real-Time ETL with Ruby, Kafka, and WebSockets

To implement a real-time ETL pipeline in Ruby, we will leverage:

Kafka: A distributed message broker for handling high-throughput event streaming.

Ruby Kafka client (ruby-kafka): A library for producing and consuming messages in Kafka.

ActionCable: A WebSockets framework built into Rails for real-time communication.

Sidekiq (optional): For asynchronous background processing when transformations require additional computation.

Step 1: Setting Up Kafka with Ruby

First, install the ruby-kafka gem in your Rails project:

bundle add ruby-kafka

Configure a Kafka producer to send streaming data:

require 'kafka'

kafka = Kafka.new(["localhost:9092"], client_id: "real_time_etl")

producer = kafka.async_producer

def stream_data(producer, topic, data)

 producer.produce(data.to_json, topic: topic)

 producer.deliver_messages

end

stream_data(producer, "events", { user_id: 123, action: "login", timestamp: Time.now })

Step 2: Consuming Kafka Messages in Ruby

To process events in real-time, set up a Kafka consumer:

consumer = kafka.consumer(group_id: "etl_group")

consumer.subscribe("events")

consumer.each_message do |message|

 event_data = JSON.parse(message.value)

 process_event(event_data)

end

def process_event(data)

 transformed_data = transform_data(data)

 send_to_websocket(transformed_data)

end

def transform_data(data)

 data.merge({ processed_at: Time.now })

end

Step 3: Integrating with WebSockets using ActionCable

To push transformed data to clients in real-time, use ActionCable:

class DataChannel < ApplicationCable::Channel

 def subscribed

   stream_from "real_time_data"

 end

end

From the Kafka consumer, broadcast transformed messages:

ActionCable.server.broadcast("real_time_data", transformed_data)

Ensuring Durability, Fault Tolerance, and Scalability

Message Durability

Kafka ensures message durability by persisting messages to disk and replicating them across multiple brokers. To configure durability:

producer = kafka.producer(acks: :all, max_retries: 5, retry_backoff: 1)

  • acks: :all ensures that a message is confirmed only when it has been written to all replicas.
  • max_retries allows Kafka to retry sending messages in case of failure.
  • retry_backoff adds a delay between retries to prevent overloading the broker.

Fault Tolerance

Consumers should be able to recover from failures gracefully:

begin

 consumer.each_message do |message|

   process_event(JSON.parse(message.value))

 end

rescue Kafka::ProcessingError => e

 Rails.logger.error("Kafka error: #{e.message}")

 retry

end

  • Wrapping message consumption in a begin-rescue block ensures the consumer continues running even after errors.
  • Logs provide visibility into failures.
  • Consumers can be restarted automatically using process managers like systemd or supervisord.

Scalability

Kafka allows scaling consumers horizontally by using consumer groups:

consumer = kafka.consumer(group_id: "etl_group")

consumer.subscribe("events", start_from_beginning: false)

  • Multiple consumers within the same group process messages in parallel.
  • Partitioning ensures workload distribution.
  • start_from_beginning: false ensures only new messages are processed, avoiding redundant work.

Conclusion

Real-time ETL pipelines in Ruby can be efficiently built using Kafka for message streaming and WebSockets for real-time updates. By leveraging event-driven architectures, we can ensure low-latency data transformation and delivery for modern applications. Whether you're handling streaming ETL for financial data or real-time analytics, Ruby, Kafka, and WebSockets provide a robust foundation for building scalable, fault-tolerant, and asynchronous data pipelines.

If you're looking to implement real-time processing solutions or optimize your data pipelines, get in touch with Techdots. Our expertise in Ruby, Kafka, and event-driven systems ensures your projects are future-proof and performant. Let’s transform your data workflows together!

Ready to start a project?

Let’s work together to ensure your digital space is inclusive and compliant. Reach out to our team and start building an application that works for everyone.

Book Meeting