Start docker, once started, open a terminal and run the next command:

  1. This command will download and run a broker in standalone mode, 3.1.1 version. ┌<▪> ~/git
  2. └➤ docker run -it -p 6650:6650 -p 8080:8080 --mount source=pulsardata,target=/pulsar/data --mount source=pulsarconf,target=/pulsar/conf apachepulsar/pulsar:3.1.1 bin/pulsar standalone
  3. Pull complete ...
  4. Lets check if container is up. We can see that a container named awesome_vaughan is up.
  5. ┌<▪> ~/git
  6. └➤ docker ps -a
  7. CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 16f876f10f07 apachepulsar/pulsar:3.1.1 "bin/pulsar standalo…" 6 minutes ago Up 6 minutes 0.0.0.0:6650->6650/tcp, 0.0.0.0:8080->8080/tcp awesome_vaughan
  8. In a new tab terminal, we can attach to the logs…
  9. ┌<▪> ~/git
  10. └➤ docker attach awesome_vaughan
  11. ...
  12. 2023-12-04T10:31:34,059+0000 [ForkJoinPool.commonPool-worker-2] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [04/Dec/2023:10:31:34 +0000] "GET /admin/v2/persistent/public/functions/coordinate/stats?getPreciseBacklog=false&subscriptionBacklogSize=false&getEarliestTimeInBacklog=false HTTP/1.1" 200 2597 "-" "Pulsar-Java-v3.1.1" 3 2023-12-04T10:31:34,064+0000 [ForkJoinPool.commonPool-worker-2] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [04/Dec/2023:10:31:34 +0000] "GET /admin/v2/persistent/public/functions/coordinate/stats?getPreciseBacklog=false&subscriptionBacklogSize=false&getEarliestTimeInBacklog=false HTTP/1.1" 200 2597 "-" "Pulsar-Java-v3.1.1" 2
  13. ...
  14. Lets connect to the container with a bash session.
  15. Command bin/pulsar-admin namespaces list apache will list every namespaces within tenant named Apache. Look, I did not create it, so if i try to list it, Tenant not found.
  16. ┌<▪> ~/git
  17. └➤ docker exec -it awesome_vaughan bash
  18. I have no name!@16f876f10f07:/pulsar$ ls LICENSE NOTICE README bin conf data download examples instances lib logs packages-storage trino
  19. I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin namespaces list apache
  20. Tenant not found
  21. I will list all tenants, actually there are two. I will create my own tenant.
  22. ┌<▪> ~/git
  23. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin tenants list public pulsar
  24. Creating a tenant named my-tenant-alonso. You can see a tenant like your organization.
  25. ┌<▪> ~/git
  26. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin tenants create my-tenant-alonso
  27. I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin tenants list
  28. my-tenant-alonso public pulsar
  29. Creating a namespace named pulsar-alonso under tenant my-tenant-alonso
  30. ┌<▪> ~/git
  31. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin namespaces create my-tenant-alonso/pulsar-alonso
  32. I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin namespaces list
  33. my-tenant-alonso my-tenant-alonso/pulsar-alonso
  34. I have no name!@16f876f10f07:/pulsar$
  35. Create test-topic with 4 partitions in the namespace pulsar-alonso under tenant my-tenant-alonso.
  36. ┌<▪> ~/git
  37. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin topics create-partitioned-topic my-tenant-alonso/pulsar-alonso/test-topic -p 4
  38. List all the partitioned topics in the namespace my-tenant-alonso/pulsar-alonso.
  39. ┌<▪> ~/git
  40. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-admin topics list-partitioned-topics
  41. my-tenant-alonso/pulsar-alonso persistent://my-tenant-alonso/pulsar-alonso/test-topic
  42. Produce 10 messages to the topic my-tenant-alonso/pulsar-alonso/test-topic
  43. ┌<▪> ~/git
  44. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-client produce my-tenant-alonso/pulsar-alonso/test-topic -m "----------hello Pulsar------" -n 10 2023-12-04T11:19:05,588+0000 [pulsar-client-io-1-3] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xe1a4c05e, L:/127.0.0.1:41320 - R:localhost/127.0.0.1:6650]] Connected to server 2023-12-04T11:19:05,771+0000 [pulsar-client-io-1-3] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Starting Pulsar producer perf with config: {"topicName":"my-tenant-alonso/pulsar-alonso/test-topic","producerName":null,"sendTimeoutMs":30000,"blockIfQueueFull":false,"maxPendingMessages":1000,"maxPendingMessagesAcrossPartitions":50000,"messageRoutingMode":"RoundRobinPartition","hashingScheme":"JavaStringHash","cryptoFailureAction":"FAIL","batchingMaxPublishDelayMicros":1000,"batchingPartitionSwitchFrequencyByPublishDelay":10,"batchingMaxMessages":1000,"batchingMaxBytes":131072,"batchingEnabled":true,"chunkingEnabled":false,"chunkMaxMessageSize":-1,"encryptionKeys":[],"compressionType":"NONE","initialSequenceId":null,"autoUpdatePartitions":true,"autoUpdatePartitionsIntervalSeconds":60,"multiSchema":true,"accessMode":"Shared","lazyStartPartitionedProducers":false,"properties":{},"initialSubscriptionName":null} ...
  45. Consume the messages from that topic, in another terminal…
  46. ┌<▪> ~/git
  47. └➤ I have no name!@16f876f10f07:/pulsar$ bin/pulsar-client consume -s sub my-tenant-alonso/pulsar-alonso/test-topic -n 0
  48. 2023-12-04T11:26:06,457+0000 [pulsar-client-io-1-3] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xf4d7e549, L:/127.0.0.1:44758 - R:localhost/127.0.0.1:6650]] Connected to server ...
  49. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  50. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  51. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  52. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  53. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  54. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  55. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  56. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  57. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------
  58. ----- got message ----- key:[null], properties:[], content:----------hello Pulsar------

You can create and consume messages using java and python. I will write about that in another post.

Similarities and differences between Apacha Kafka and Apache Pulsar:

To sum up:

We can think of Apache Pulsar as competition for Apache Kafka, as they seem to serve the same purpose, but there are some differences. Apache Pulsar is first and foremost a distributed and scalable ordered messaging delivery technology, while Apache Kafka is first and foremost a distributed and scalable data streaming technology.
If orderly message delivery is very important to you, then Apache Pulsar is the best choice, it gives you that by default, if you need to deliver and consume streaming data and order is not important, then Kafka may be a better choice.

Apache Kafka Apache Pulsar features
Type Streaming Messaging and streaming
Popularity Most popular Least popular
Community Largest Smallest
Durability Yes Yes
High availability Yes Yes
Scale horizontally Yes Yes
Cost Free Free

Here is a series of questions and answers to try to find out the similarities and differences between Apacha Kafka and Apache Pulsar:

QUESTION:

do you not see the difference between messaging and streaming data?, in my eyes, it is almost the same thing, literally, delivery of messaging to a mailbox, a topic, which may be partitioned, distributed among different brokers, such messages may have a schema to describe a format to use a mechanism to write and read efficiently.

ANSWER:

Messaging and data streaming are two related concepts, but they are not exactly the same.

Messaging refers to the sending of messages between two or more parties. Messages can be of any size and format, and can contain any type of data. Messaging is often used for communication between applications, services or devices.

Data streaming refers to the transmission of data on a continuous basis. Data can be transmitted in real time or can be stored for further processing. Data streaming is often used for data collection from sensors, social networks or mobile applications.

The main difference between messaging and data streaming is the way data is delivered. In messaging, data is delivered in the form of individual messages. Each message has a header that contains information about the message, such as the sender, the receiver and the type of data. The body of the message contains the actual data.

In data streaming, data is delivered in the form of a continuous stream. There are no message headers, and the data is delivered in sequential order.

Another difference between messaging and data streaming is the use of data. In messaging, data is typically used for communication between applications or services. In data streaming, data is typically used for analysis or visualisation.

In your description, you are describing a messaging and streaming platform. This platform uses a messaging approach to deliver data. Messages are delivered to a topic, which is a shared message queue. Messages can have a schema to describe their format.

This approach is suitable for applications that require both messaging and streaming. For example, a data integration application could use this platform to send data from one source to another.

However, there are other data streaming platforms that use a different approach. These platforms deliver data in the form of a continuous stream. This approach can be more efficient for applications that require processing large amounts of data in real time.

Ultimately, the best platform for a given application will depend on the specific needs of the application.

QUESTION:

so, with pulsar, the order in which the messages are delivered is important, whereas in kafka it is not?.

ANSWER:

Yes, in general, the order in which messages are delivered is important in Pulsar, whereas in Kafka it is not. However, there are some exceptions to this rule.

In Pulsar, the order in which messages are delivered is guaranteed by default. This means that consumers will receive messages in the same order in which they were sent. However, consumers can choose to receive messages in any order.

In Kafka, the order in which messages are delivered is not guaranteed by default. This means that consumers can receive messages in any order. However, consumers can opt to receive messages in order.

The following table summarises the general rules for message delivery order in Pulsar and Kafka:

PlatformMessage Delivery Order
PulsarGuaranteed Default
KafkaNot guaranteed by default

In practice, the importance of message delivery order depends on the specific needs of the application. If the application requires messages to be received in the same order in which they were sent, then Pulsar is a good choice. If the application does not require messages to be received in a specific order, then Kafka may be a better choice.

Here are some examples of how the order of message delivery can be important:

  • Data integration: If a data integration application needs to combine data from different sources, then it is important that the data is received in the same order in which it was sent.
  • Notifications: If a notification application needs to send notifications to users in a specific order, then it is important that the messages are received in the same order in which they were sent.
  • Data analysis: If a data analysis application needs to analyse data in real time, then it is important that the data is received in a specific order.

Yes, I think the order of message delivery is one of the most important features that differentiate Pulsar and Kafka.

Other characteristics that may be important for the choice of platform include:

Popularity: Kafka is more popular than Pulsar, which means it has a larger user base and a more active community.
Cost: Both platforms are free, but Kafka has a proprietary licence for some advanced features.
Scalability: Both Pulsar and Kafka can scale horizontally to meet the needs of high-traffic applications.
Durability: Both platforms store messages durably, but Pulsar offers greater flexibility in the way data is stored.

QUESTION:

As of today, Pulsar only produces the lowest scale of the pyramid of functionalities that KAfka does offer, right? At the bottom are the producers and consumers, then comes kafka-streams, above it, ksqldb, at the top is kafka-connect. While Apache Pulsar for now seems to focus on the base of the pyramid, right? producers and consumers

ANSWER:

Yes, generally speaking Pulsar offers a lower layer of functionality than Kafka. Pulsar focuses on message delivery, while Kafka offers a broader range of features, including:

Kafka Streams – A data stream processing framework that allows developers to build serverless data streaming applications.
KSQLDB: A SQL query engine for streaming data.
Kafka Connect – A framework for data integration between Kafka and other data sources.

Pulsar is working to expand its functionality to compete with Kafka in these areas. For example, Pulsar already offers support for data stream processing through its Functions module. Pulsar is also developing an SQL query framework for streaming data.

However, there are still some areas where Kafka offers more functionality than Pulsar. For example, Kafka Connect offers a broader range of connectors for data sources than Pulsar.

Ultimately, the best platform for a given application will depend on the specific needs of the application. If the application requires specific functionality that Pulsar does not offer, then Kafka may be a better option.

QUESTION:

That is, does Apache Pulsar have something equivalent to kafka-streams and KSQLDB?

ANSWER:

Yes, Apache Pulsar has something equivalent to Kafka Streams and KSQLDB.

Kafka Streams is a data stream processing framework that allows developers to build serverless data streaming applications. Pulsar offers support for data stream processing through its Functions module. Functions are code fragments that run on Pulsar brokers. Functions can be used to process messages asynchronously, without needing to create a separate process or service.

KSQLDB is a SQL query engine for streaming data. Pulsar is developing an SQL query framework for streaming data. This framework is still under development, but already offers support for a number of basic SQL operations, such as selection, filtering, and aggregation.

Overall, Pulsar offers similar functionality to Kafka Streams and KSQLDB. However, there are still some differences between the two platforms. For example, Kafka Streams offers a broader range of data stream processing operations than Pulsar. KSQLDB also offers a broader range of SQL query operations than Pulsar.

Ultimately, the best platform for a given application will depend on the specific needs of the application. If the application requires specific functionality that Pulsar does not offer, then Kafka may be a better option.

QUESTION:

Is Apache Pulsar intended to be used in production? Is it in production by any company?

ANSWER:

Yes, Apache Pulsar is intended to be used in production. In fact, there are many companies that are already using it in production.

Some companies using Apache Pulsar in production include:

Bloomberg: Bloomberg uses Pulsar for message delivery to its trading applications.
Spotify: Spotify uses Pulsar for data integration and data analysis.
Netflix: Netflix uses Pulsar for message delivery to its content apps.
Twitter: Twitter uses Pulsar for message delivery to its social media applications.

Pulsar offers a number of features that make it suitable for use in production, including:

Scalability: Pulsar can be scaled horizontally to meet the needs of high-traffic applications.
Durability: Pulsar stores messages durably, ensuring that data is not lost.
Security: Pulsar offers a number of security features to protect data.

Overall, Apache Pulsar is a solid streaming and messaging platform that is suitable for use in production.

QUESTION:

If orderly delivery of data is something guaranteed in Pulsar, while it is not in Kafka, we could argue that the performance when writing and reading in distributed and partitioned topics is superior in Apache Kafka, even if we reach the point of writing to disk the offset of the data, right?

ANSWER:

In general, yes, it is possible that the performance when writing and reading distributed and partitioned topics is better in Apache Kafka than in Apache Pulsar.

In Kafka, messages are written to and read directly from Kafka brokers. Kafka brokers are responsible for delivering messages to consumers. Kafka does not guarantee ordered delivery of messages, but consumers can use the offset of messages to ensure that they are received in the correct order.

In Pulsar, messages are written to and read from topics. Topics are made up of partitions, which are units of data that can be distributed and replicated among Pulsar brokers. Pulsar guarantees orderly delivery of messages within each partition.

To ensure orderly delivery of messages on Pulsar, Pulsar brokers must maintain a record of messages that have been delivered. This logging can affect the performance of writing and reading messages.

If consumers in Kafka use message offset to ensure orderly delivery, then message writing and reading performance can be similar to Pulsar. However, if consumers do not use message offset, then message writing and reading performance may be higher in Kafka.

In general, the best platform for a given application will depend on the specific needs of the application. If the application requires orderly delivery of messages, then Pulsar is a good option. If the application does not require orderly delivery of messages, then Kafka may be a better option.

Here are some factors that can affect message writing and reading performance in Kafka and Pulsar:

The number of partitions: The larger the number of partitions, the lower the performance of writing and reading messages.
The size of the messages: The larger the messages, the lower the performance of writing and reading messages.
The number of consumers: The more consumers there are, the lower the performance of writing and reading messages.

In general, it is important to perform testing to determine the platform that offers the best performance for a given application. Scientific method.

Thank you for reading until here, i will continue testing this tecnology.

This is official link.

Deja un comentario