Excited Real Industrial Based Use case with Kafka
Kafka that boosts Developers Productivity!
These days, Kafka popularity is on cloud nine. This article is a deep dive to the core concepts of Kafka and will help you get into in thoroughly. Let’s start!!
Every modern business revolves around data. That’s precisely why a number of technologies, platforms, and frameworks have emerged to support advanced data management over the years. One such solution is Apache Kafka, a distributed streaming platform that’s designed for high-speed, real-time data processing.
Up to date, Kafka has already seen large adoption at thousands of companies worldwide. We took a closer look at it to understand what’s so special about it, and how it can be used by different businesses out there.
Broadly, Kafka accepts streams of events written by data producers. Kafka stores records chronologically in partitions across brokers (servers); multiple brokers comprise a cluster. Each record contains information about an event and consists of a key-value pair; timestamp and header are optional additional information. Kafka groups records into topics; data consumers get their data by subscribing to the topics they want.
Events
An event is a message with data describing the event. For example, when a new user registers with a website, the system creates a registration event, which may include the user’s name, email, password, location, and so on.
Consumers and Producers
A producer is anything that creates data. Producers constantly write events to Kafka. Examples of producers include web servers, other discrete applications (or application components), IoT devices, monitoring agents, and so on. For instance:
- The website component responsible for user registrations produces a “new user is registered” event.
- A weather sensor (IoT device) produces hourly “weather” events with information about temperature, humidity, wind speed, and so on.
Consumers are entities that use data written by producers. Sometimes an entity can be both a producer and a consumer; it depends on system architecture. For example, a data warehouse could consume data from Kafka, then process it and produce a prepared subset for rerouting via Kafka to an ML or AI application. Databases, data lakes, and data analytics applications generally act as data consumers, storing or analyzing the data they receive from Kafka.
Kafka acts as a middleman between producers and consumers. You can read our previous post to learn more about Kafka and event-driven architecture.
Brokers and Clusters
Kafka runs on clusters, although there is now a serverless version of Kafka in preview at AWS. Each cluster consists of multiple servers, generally called brokers (and sometimes called nodes).
That’s what makes Kafka a distributed system: data in the Kafka cluster is distributed amongst multiple brokers. And multiple copies (replicas) of the same data exist in a Kafka cluster. This mechanism makes Kafka more stable, fault-tolerant, and reliable; if an error or failure occurs with one broker, another broker steps in to perform the functions of the malfunctioning component, and the information is not lost.
Kafka topic is an immutable log of events (sequences). Producers publish events to Kafka topics; consumers subscribe to topics to access their desired data. Each topic can serve data to many consumers. Continuing with our example, the registration component of the website publishes “new user” events (via Kafka) into the “registration” topic. Subscribers such as analytics apps, newsfeed apps, monitoring apps, databases, and so on in turn consume events from the “registration” topic and use it with other data as the foundation for delivering their own products or services.
Partitions
A partition is the smallest storage unit in Kafka. Partitions serve to split data across brokers to accelerate performance. Each Kafka topic is divided into partitions and each partition can be placed on a separate broker.
ENLIGHTENED SOME DEEP FACTS ABOUT POWERFUL TOOL KAFKA?
Essentially, Kafka is an open source, distributed streaming platform that enables storing, reading, and analyzing data. It might not sound like much at first, but it’s actually a powerful tool capable of handling billions of events a day and still operating quickly, mostly due to its distributed nature.
Before it was passed to the community and Apache Foundation in 2011, Kafka was originally created at LinkedIn, to track the behavior of its users and build connections between them. Back in the days, it was meant to solve a real problem that LinkedIn developers struggled with low-latency intake of large event data. There was a clear need for real-time processing, but there were no solutions that could really accommodate it.
That’s how Kafka came into life. Since then, it came a long way and evolved to a full-fledged distributed publish-subscribe messaging system, which provides the backbone for building robust apps.
What is Apache Kafka used for?
According to Kafka’s core development team, there are a few key use cases that it’s designed for, including:
- Website activity tracking: With Kafka we can build user activity tracking pipeline as a set of real-time publish-subscribe feeds. Website activity (page views, searches, or other actions users may take) is published to central topics and becomes available for real-time processing, dashboards and offline analytics in data warehouses like Google’s Big Query.
- Log aggregation: Kafka can be used across an organization to collect logs from multiple services and make them available in standard format to multiple consumers. It provides lower-latency processing and easier support for multiple data sources and distributed data consumption.
- Operational metrics: Kafka is often used for operation monitoring data pipelines and enables alerting and reporting on operational metrics. It aggregates statistics from distributed applications and produces centralized feeds of operational data.
- Stream processing: While Kafka wasn’t originally designed with event sourcing in mind, it’s design as a data streaming engine with replicated topics, partitioning, state stores and streaming APIs is very flexible. Hence, it’s possible to implement an event sourcing on top of Kafka without much effort.
- Messaging: Kafka works well as a replacement for traditional message brokers; Kafka has better throughput, built-in partitioning, replication, and fault-tolerance, as well as better scaling attributes.
- Commit log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps supports this usage. In this usage Kafka is similar to Apache Boot Keeper project.
Whenever there’s a need for building real-time streaming apps that need to process or react to “chunks” of data, or reliably transfer data between systems or applications — Kafka comes to the rescue.
It’s one of the reasons why Kafka works well with banking and financial applications, where transactions have to be processed in a specific order. The same applies to transport & logistics, as well as retail — especially when there are IoT sensors involved. Within these industries, there’s often a need for constant monitoring, real-time & asynchronous applications (i.e., inventory checks), advanced analytics and system integration, just to name a few.
In fact, any business that wants to leverage data analytics and complex tool integration (for example between CRM, POS and e-commerce applications) can benefit from Kafka. It’s precisely where it fits well into the equation.
Event Streaming is happening all over the world. This blog post explores real-life examples across industries for use cases and architectures leveraging Apache Kafka. Learn about architectures for real-world deployments from Audi, BMW, Disney, Generali, PayPal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track trace, live betting, and much more.
Usage
So what is a Kafka used for? Unsurprisingly, this data streaming platform has a huge number of possibilities — and once a company starts using it for one data need, they typically want to start using it for all the other data needs in their company.
- Data Movement — data replication & ETL, across platforms or clouds
- Message queue — passing raw data between applications
- Event streams — tracking triggered events or data changes
- Service bus — messaging backbone between microservices
- Logging & Telemetry — pushing measurements out, asynchronously
- Transaction processing — allows for exactly-once semantics
- Data archival — push data to external storage
- Event processing — feed data into real-time query and analytical systems
Apache Kafka is an event streaming platform. It provides messaging, persistence, data integration, and data processing capabilities. High scalability for millions of messages per second, high availability including backward-compatibility and rolling upgrades for mission-critical workloads, and cloud-native features are some of the capabilities.
Hence, the number of different use cases is almost endless. If you learn one thing from the examples in this blog post, then remember that Kafka is not just a messaging system! While data ingestion into a Hadoop data lake was the first prominent use case, this implies <5% of actual Kafka deployments. Business applications, streaming ETL middleware, real-time analytics, and edge/hybrid scenarios are some of the other examples:
APache Kafka — Next Gen Event Streaming System!
Companies that leverage Apache Kafka
Kafka is used heavily in the big data space as a reliable way to ingest and move large amounts of data very quickly.
According to stack share there are 741companies that use Kafka. Among them Uber, Netflix, Activision, Spotify, Slack, Pinterest, Coursera and of course LinkedIn. Wanna find out more about some of the use cases?
It shouldn’t come as a surprise that Kafka still forms a core part of LinkedIn infrastructure. It’s used mostly for activity tracking, message exchanges, and metric gathering, but the list of use cases doesn’t end here. Most of the data communication between different services within LinkedIn environment utilizes Kafka to some extent.
For the time being, LinkedIn admits to maintaining more than 100 Kafka clusters with over 4,000 brokers, that serve 100,000 topics and millions of partitions. The total number of messages handled by LinkedIn’s Kafka deployments, on the other hand, already surpassed 7 trillion per day.
Even though no other service uses Kafka at LinkedIn’s scale, plenty of other applications, companies, and projects take advantage of it. At Uber, for example, many processes are modelled with Kafka Streams — including customer & driver matching and ETA calculations. Netflix also embraced the multi-cluster Kafka architecture for stream processing and now seamlessly handles trillions of messages each day.
Real-time data processing
Many systems require data to be processed as soon as it becomes available. Kafka transmits data from producers to consumers with very low latency (5 milliseconds, for instance). This is useful for:
- Financial organizations, to gather and process payments and financial transactions in real-time, block fraudulent transactions the instant they’re detected, or update dashboards with up-to-the-second market prices.
- Predictive maintenance (IoT), in which models constantly analyze streams of metrics from equipment in the field and trigger alarms immediately after detecting deviations that could indicate imminent failure.
- Autonomous mobile devices, which require real time data processing to navigate a physical environment.
- Logistical and supply chain businesses, to monitor and update tracking applications, for example to keep constant tabs on cargo vessels for real-time cargo delivery estimates.
How Data-Intensive Companies Use Up solver to Analyze Kafka Streams in Data Lakes
ETL pipelines for Apache Kafka can be challenging. In addition to the basic task of transforming the data, you also must account for the unique characteristics of event stream data. Up solver’s declarative self-orchestrating data pipeline platform sharply reduces the time to build and deploy pipelines for event data. Using declarative SQL commands enables you to build pipelines without knowledge of programming languages such as Scala or Python. It also frees you from having to manually address the “ugly plumbing” of orchestration (DAGs), table management and optimization, or state management, saving you substantial data engineering hours and heartache.
3 industries relying on Apache Kafka
The open-source messaging platform, built for website activity tracking, has migrated into new industries and new use cases.
Kafka lets enterprises from these verticals take everything happening in their company and turn it into real-time data streams that multiple business units can subscribe to and analyze. For these companies, Kafka acts as a replacement for traditional data stores that were siloed to single business units and as an easy way to unify data from all different systems.
Kafka has moved beyond IT operational data and is now also used for data related to consumer transactions, financial markets, and customer data. Here are three ways different industries are using Kafka.
Internet services:
A leading internet service provider (ISP) is using Kafka for service activation. When new customers sign up for internet service by phone or online, the hardware they receive must be validated before it can be used. The validation process generates a series of messages, then a log collector gathers that log data and delivers it to Kafka, which sends the data into multiple applications to be processed.
Financial services
Global financial services firms need to analyze billions of daily transactions to look for market trends and stay on top of rapid and frequent changes in financial markets. One firm used to do that by collecting data from multiple business units after the close of the market, sending it to a vast data lake, then running analytics on the captured data.
Entertainment
An entertainment company with an industry-leading gaming platform must process in real time millions of transactions per day and ensure it has a very low drop rate for messages. In the past, it used Apache Spark, a powerful open-source processing engine, and Hadoop, but it recently switched to Kafka.
NETFLIX Use cases:
uses Kafka as the gateway for data collection for all applications, requiring hundreds of billions of messages to be processed daily. For example, it’s using Kafka for unified event publishing, collection, routing for batch and stream processing, and ad hoc messaging.
In the cloud and on-premises, Kafka has moved beyond its initial function of website activity tracking to become an industry standard, providing a reliable, streamlined messaging solution for companies in a wide range of industries.
Modernized Security Information and Event Management (SIEM)
SIEM and cybersecurity are getting more important across industries. Kafka is used as open and scalable data integration and processing platform. Often combined with other SIEM solutions: IEM, Streaming Machine Learning, Stateful Stream Processing
The following covers a few architectures and use cases. The presentation afterward goes into much more detail and examples from various companies about these and other use cases from various industries:
- Insurance
- Manufacturing.
- Automotive.
- Telecom.
- Retailing / Transportation / Logistics
- Gaming.
- Health care / Pharma Service.
- Financial Services.
DEEP DIVE FACTS ABOUT KAFKA!
This kind of centralized data pipeline, as Confluent puts it so eloquently it in their S-1, becomes a real-time central nervous system for modern digital enterprises. In fact, you’l hear a lot of fancy phrases in descriptions of Kafka (either from Confluent or the thousands of companies that use it), such as:
- “real-time central nervous system”
- “allows for data in motion”
- “Nexus of real-time data”
- “Much-needed glue between disparate systems.”
- “Intelligent connective tissue”
- “Centralized data plumbing”
- “Backbone of a data platform”
Bridge to Cloud, Hybrid or Multi-cloud Scenarios:
Users can have more than one cluster, to interconnect across on-prem, hybrid, or cloud environments, in order to replicate or share data between them. Integrates with a wide variety of serverless, storage, and data services across the 3 major cloud vendors and SaaS providers. Allows for instant connectivity regardless of environment, at infinite scale.
Fraud Detection
Create analytical capabilities over real-time event streams for fraud detection and risk analysis. Can integrate with Apache Spark Streaming or any number of open-source stream processing tools, or perform the analytics in line with ksqlDB stream processing. Gain a real-time vision into patterns of fraud, abuse, spam, or other bad behavior.
Event-driven microservices
Serves as the backbone for event streaming over microservices. Integrates with serverless engines and storage across the 3 major cloud vendors. But besides input handling from an event stream, also helps with output handling, tying microservices into other data flows around log aggregation, ETL, SIEM, or real-time analytics.
Streaming ETL
Turn batch ETL processes into real-time streams. Can be used for data movement, including enrichment, cleaning, or processing while in transit. Integrates with a wide variety of inputs and outputs.
SIEM Optimization
Break down silos, via integration with a wide variety of systems and services. Aggregate, filter, or transform data in transit, reducing the overall size of the end data. Export the centralized stream of security data into a wide variety of SIEM platforms (Elasticsearch, Datadog, Splunk), and even SOAR systems that automate response. Turn batch processes into real-time anomaly and threat detection, using stream processing capabilities.
Internet of Things
Handles the enormous flow of incoming data from dispersed IoT devices (POS systems, cameras, sensors, smart buildings/cities, connected vehicles). Bi-directional data flows can push events back into operational technology (OT) control systems. Transform, filter or enrich the data in transit, reducing the overall size of the end data. Derive real-time insights via stream processing. Operate globally, across multiple cloud environments.
Nearly every high-tech company out there utilizes or support Kafka, including many of the companies I follow on this blog. Some examples:
- LinkedIn wrote in 2019 on how it uses Kafka , processing 7T messages daily across 100 clusters.
Kafka is used extensively throughout our software stack, powering use cases like activity tracking, message exchanges, metric gathering, and more. We maintain over 100 Kafka clusters with more than 4,000 brokers, which serve more than 100,000 topics and 7 million partitions. The total number of messages handled by LinkedIn’s Kafka deployments recently surpassed 7 trillion per day.
- ZScalar has mentioned Kafka and Kafka Streams as part of their stack, per a developer job listing.
- Datadog showed us in 2019 that they make heavy use of Kafka, using 40 clusters to handle trillions of data points daily across their globally distributed infrastructure. Also, they have plugins on their observability platform to help companies maintain their Kafka infrastructure.
- CrowdStrike wrote in 2020 on how they use kafka as a vital part of their cybersecurity infrastructure, ingesting over 500B events a day — trillions per week — in real-time.
At CrowdStrike, we use Kafka to manage more than 500 billion events a day that come from sensors all over the world. That equates to a processing time of just a few milliseconds per event. Our approach of using concurrency and error handling — which helps us avoid mass multiple failures — is incredibly important to the overall performance of our system and our customers’ security.
- Netflix wrote in 2020 on the technical details of how they utilize kafka to drive their microservice-oriented stack.
- Cloudflare has blogged repeatedly about its extensive use of Kafka across its globally distributed stack. Their large Kafka cluster was handling 8.5M events per second. Around that time, they also detailed the technical specifics of improving performance of their Kafka clusters by experimenting with compression methods. By 2020, they showed that their raw logs were ingesting 360TB per day via Kafka.
- Snowflake developed a kafka connector to be able to directly import data streams into their data platform, through their Snowpipe importer tool.