techiehub.in

Kafka Producer: Publish Message with Avro Schema

Overview

This article defines the key components and the setup required to publish an avro schema based serialized message to the Kafka topic. Summary of what this article will cover:

  • Need for serialization and why to use AVRO
  • Setup Kafka along with other required components in a local MacOS based machine
  • Auto Generate Java classes using avro-tool from a predefined avro schema
  • Write a Kafka producer for publishing the messages to the topic

Why AVRO

Before we get into why AVRO is required, we need to understand why serializing the data at first place. Serialization is a translation of objects(e.g. Java Objects) values/states to bytes to send it over the network or save it in a disk. The data needs to be sent in a binary i.e bits and bytes format because the network infrastructure or the hard disk does not understand objects (for eg: Java Objects). In case of Kafka the messages are transferred over the network and is stored in a file corresponding to the given topic partition hence serialization of messages is required.

After getting an understanding of the need for serialization of the data, next question comes as to why not use textual format like JSON, XML , CSV and send the serialized message, given these formats are language agnostic and can be read in any programming language. Key challenges while using textual formats:

  • Confusion around serialization of numbers. For eg: JSON does not distinguish between integer and float
  • No support for binary strings (A binary string is a sequence of bytes. Unlike a character string which usually contains text data, a binary string is used to hold non-traditional data such as pictures)
  • Serialized messages are bigger in size as they do not prescribe any schema

AVRO, Thrift, JSON, MessagePack etc. are binary serialization format. They are generally used to exchange the data/objects within an organization. Advantages of using the binary serialization format:

  • Message is compact
  • Schema evolution
  • Forward and Backward compatibility of schema

The choice of selecting binary serialization format is based on the application use case. 

For this article we have chosen Avro as it is supported out of the box by Confluent Kafka and the built in libraries.

Pre-requisites

For publishing the data to a Kafka topic in a AVRO format we will require the below components:

  • Zookeeper
  • Kafka Brokers
  • Schema Registry
  • Avro Tool Jar
  • Docker Desktop
  • Java8+
  • Gradle
  • Kafka CLI Toolsets

Setup

  • Copy the docker file in your project directory 
  • Run docker-compose up -d –build

Above two steps will download the docker images for Zookeeper, Kafka and Schema Registry and spin up the Kafka cluster. You can access the cluster at port 29092

Define AVRO Schema 

Let’s consider a simple object Item having three properties 

Property NameData Type
Namestring
Categorystring
Pricefloat

The resulting avro schema file for the Item object can be found here:

https://github.com/deadzg/Kafka-Integration/blob/master/kafka-explore/item.avsc

Generate Java Class from AVRO Schema file

Once the avro schema is defined we need to generate the corresponding Java file. 

For this we would require the avro-tool jar. Download the jar from here

Generate the Java class using the below command:

java -jar avro-tools-1.10.2.jar compile schema item.avsc .

Copy the generated Java file to the specific package.

Note: The package where you intend to copy this file should match the once given corresponding to the namespace field in the avsc file E.g. https://github.com/deadzg/Kafka-Integration/blob/master/kafka-explore/item.avsc#L2

Kafka Producer with Avro Schema Messages

For producing the message to the Kafka topic, we need to set the Producer properties first.

The minimum properties are listed below

Producer PropertiesSample Value
ProducerConfig.BOOTSTRAP_SERVERS_CONFIGhttp://localhost:29092/
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIGStringSerializer.class.getName()
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIGKafkaAvroSerializer.class.getName()
KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG“http://localhost:8081”

Once the values for the above properties are identified we will write the code to produce the message to the Kafka topic 

https://github.com/deadzg/Kafka-Integration/blob/master/kafka-explore/src/main/java/com/smalwe/kafka/explore/ItemAvroProducer.java

Conclusion

In this article we have understood the why serialization is required, what are available options. The we have defined the tools required to publish an avro based message to the Kafka topic along with the reference code 

Resources

Kafka Client Maven Repositoryhttps://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/2.8.0
Kafka Producer Configurationhttps://docs.confluent.io/platform/current/installation/configuration/producer-configs.html
Avro Usagehttps://avro.apache.org/docs/current/gettingstartedjava.html
Download Avro Tool jarhttps://mvnrepository.com/artifact/org.apache.avro/avro-tools/1.10.2
Why Serializationhttps://stackoverflow.com/questions/2475448/what-is-the-need-of-serialization-in-java
Binary Stringhttps://www.ibm.com/docs/en/i/7.3?topic=types-binary-strings

Categories