[Improve][Producer] Refactor internalSend() and resouce managment #1071

gunli · 2023-07-25T08:57:54Z

(If this PR fixes a github issue, please add Fixes #<xyz>.)

Fixes #1043

(or if this PR is one task of a github issue, please add Master Issue: #<xyz> to link to the master issue.)

Master Issue: #1043, #1055 #1059 #1060 #1067, #1068

Motivation

As discussed in [fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055, we need to calculate how many pending items and how many memory are required before appending the sendRequest to the dataChan, but currently, we do schema-encoding/compressing in internalSend(), this may lead to inaccurate memory limit cotrolling, and as described in [Improve][Producer]Simplify the MaxPendingMessages controlling #1043, make the code complicated and difficult to maintain, we need to simplify the send logic;
In JAVA client, schema-encoding/compressing are done in application thread, it better to make it work like JAVA client;
As discussed in [fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055 and described in [Improve][Producer]Simplify the MaxPendingMessages controlling #1043, resource(memory and semaphore) acquiring/releasing logic are written across the whole file, we need to simplify the resource management logic, encapsulate them into functions, call them when necessary, make it 'Low Coupling, High Cohesion';
As discussed in [Bug][Producer]Inaccurate transaction endSendOrAckOp for chunked message #1060, transaction is not correctly ended for chunked message, it better to encapsulate the transaction ending logic into one func which will be called when sendRequest is done.

Modifications

Move shema encoding from internalSend() to internalSendAsync();
Move compressing from internalSend() to internalSendAsync();
Calculate total chunks before entering the dataChan;
Reserve required semaphore and memory before entering the dataChan;
sendRequest store the semaphore and memory it holds;
Encapsualte relative code blocks into individual funcs to make the skeleton of internalSendAsync() clearer;
Add sendRequest.done() to release the resources it holds;
When a sendRequest is done, call sendRequest.done() ;
In sendRequest.done() run callback, update metrics, end transaction, run interceptors callback...

Verifying this change

Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as (please describe tests).

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)

Documentation

Does this pull request introduce a new feature? no)
If yes, how is the feature documented? (not applicable / docs / GoDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

gunli · 2023-07-25T08:59:55Z

@Gleiphir2769 @RobertIndie @graysonzeng @shibd Would you please review this PR ?

Gleiphir2769

Hi @gunli, I leaved some comments.

Gleiphir2769 · 2023-07-27T11:42:38Z

pulsar/producer_partition.go

+
+func (p *partitionProducer) internalSendAsync(ctx context.Context, msg *ProducerMessage,
+	callback func(MessageID, *ProducerMessage, error), flushImmediately bool) {
+	err := p.validateMsg(msg)


Could we make it inline?

inline will make internalSendAsync a BIG func, about 200 lines, it hard to read, spilt into small funcs will be more clear and readable.

Sorry, maybe my description is not clear. Could we make L1191-L1192 as a One Liner if...else Statements.

Hmm, I think that is OK. inlining can reduce the code line number, non-inlining are better for debugging.

Gleiphir2769 · 2023-07-27T11:47:50Z

pulsar/producer_partition.go


-	p.dataChan <- sr
+	err = p.updateSchema(sr)


Could we make it inline?

inline will make internalSendAsync a BIG func, about 200 lines, it hard to read, spilt into small funcs will be more clear and readable.

Same as above

Gleiphir2769 · 2023-07-27T11:54:22Z

pulsar/producer_partition.go

+
+func (p *partitionProducer) updateUncompressPayload(sr *sendRequest) error {
+	// read payload from message
+	sr.uncompressedPayload = sr.msg.Payload


It seems msg.Payload is cloned to sr.uncompressedPayload and it will take up unnecessary memory. I think the type of uncompressedPayload is *[]byte may be better.

in go, []byte assignment just copy the address

Gleiphir2769 · 2023-07-27T11:56:18Z

pulsar/producer_partition.go

+	return nil
+}
+
+func (p *partitionProducer) updateUncompressPayload(sr *sendRequest) error {


Typo updateUncompressPayload -> updateUncompressedPayload

OK, I have renamed it.

Gleiphir2769 · 2023-07-27T11:58:52Z

pulsar/producer_partition.go

+		checkSize = int64(sr.compressedSize)
+	}
+
+	sr.maxMessageSize = int32(int64(p._getConn().GetMaxMessageSize()))


p._getConn().GetMaxMessageSize() makes an rpc call to broker. This breaks the semantics of async.
For example, when user invokes the method producer.SendAsync, he must wait for an rpc to return.

What do you think?

MaxMessageSize is cached in the conn when a conn is ready, you can check the code of connection and connectionPool, this is OK.

Actually, p.getOrCreateSchema() will trigger a block RPC call. What make SendAsync not a real asyn func is the fixed length PendingQueue, if pengding queue can be expanded at runtime, shema-encoding/compress/getOrCreateSchema can be done in internalSend(), that can make it a real asyn func.

I don't think memLimit and fixed length queue is neccessary in a language with GC like JAVA and Go, 'cause we have semaphore and dataChan to control how many messages can be pending. As I mentioned in #1043 .

MaxMessageSize is cached in the conn when a conn is ready, you can check the code of connection and connectionPool, this is OK.

You're right. This point is good to me.

Actually, p.getOrCreateSchema() will trigger a block RPC call. What make SendAsync not a real asyn func is the fixed length PendingQueue

But the size of PendingQueue is decided by user. And if the size of dataChan is set as MaxPendingMessages, it will not be the async limit.

I don't think memLimit and fixed length queue is neccessary in a language with GC like JAVA and Go

This is not only related to memory, but also related to broker. Here is the interface description of Java client.

Set the max size of the queue holding the messages pending to receive an acknowledgment from the broker

I think we should find a way to solve the p.getOrCreateSchema() problem. What do you think?

But the size of PendingQueue is decided by user. And if the size of dataChan is set as MaxPendingMessages, it will not be the async limit.

dataChan and semaphore can work together to make it async, when there is enough semaphore, add to the queue, or wait until the response back from the broker and release a semaphore.

I think we should find a way to solve the p.getOrCreateSchema() problem.

Since the schema can be set by the user at runtime, we can only solve it by moving the blocking logic to `internalSend()', but at the same time, we have to reserve memory and peding queue spaces before adding a message to dataChan, so if we want to solve it, we must drop memLimit and the fixed length pending queue.

when there is enough semaphore, add to the queue, or wait until the response back from the broker and release a semaphore.

I support the idea which reserve resource before internalSend. And if we reserve the semaphore firstly, the block of dataChan will not happen if it's made with capacity MaxPendingMessages.

we must drop memLimit and the fixed length pending queue

Sorry, I am not get the point why we should drop the memLimit? It's a useful feature for the users who is lack of resources.
And I don't think the fixed length pending queue will become a problem to SendAsync. Why should we make it flexible?

I support the idea which reserve resource before internalSend. And if we reserve the semaphore firstly, the block of dataChan will not happen if it's made with capacity MaxPendingMessages.

We do not block on dataChan, just block on semaphore, we just treat dataChan as a channel between the main goroutine(user's goroutine) and the partitionProducer's goroutine(IO goroutine), semaphore represents the available resource(pendingItem), When aquire semaphore succeed, we can add it to dataChan, otherwise, block until one semaphore has been released(one message has been done) .

Sorry, I am not get the point why we should drop the memLimit? It's a useful feature for the users who is lack of resources.
And I don't think the fixed length pending queue will become a problem to SendAsync. Why should we make it flexible?

Because we limit the memory and pending queue, we have to reserve memory and pending queue before adding a message to dataChan, which make we have to do schema encoding and compressing first, or we have no idea about how much memory and how many pending items we need, these are bloking logic. When these blocking jobs are done in the user's goroutine, they block the user's logic, which make it a non-async method.

gunli · 2023-08-07T03:31:36Z

@RobertIndie Would you please take a look at this PR, it pending for a long time.

gunli · 2023-08-31T07:57:49Z

@tisonkun @RobertIndie I think there is a bug in reserveSemaphore(), I will update it now.

gunli · 2023-08-31T08:11:37Z

@tisonkun @RobertIndie I think there is a bug in reserveSemaphore(), I will update it now.

done.

RobertIndie · 2023-08-31T09:54:31Z

I noticed that you did a lot of refactoring work in this PR, including changes to the critical path for publishing messages. I’m concerned that these changes may impact publishing performance.

While there are many modifications, it’s difficult for the reviewer to see the relationship between these modifications and the bug fixes. Could you provide more detail in the PR description, particularly regarding why you’re making these changes to fix bugs?

I recommend separating the refactoring work from the bug fixes. Otherwise, it will be challenging for us to cherry-pick the bug fixes to other release branches.

gunli · 2023-08-31T11:08:39Z

I noticed that you did a lot of refactoring work in this PR, including changes to the critical path for publishing messages. I’m concerned that these changes may impact publishing performance.

While there are many modifications, it’s difficult for the reviewer to see the relationship between these modifications and the bug fixes. Could you provide more detail in the PR description, particularly regarding why you’re making these changes to fix bugs?

I recommend separating the refactoring work from the bug fixes. Otherwise, it will be challenging for us to cherry-pick the bug fixes to other release branches.

@RobertIndie The details we have disccussed in #1059 #1043 #1060, the bug fix just fixed the bug in the PR itself, not the exist code.
As we disscussed in #1055 #1059, we have to move some logic from internalSend() to internalSendAsync(), these logic work together, so I can't seperate them into multiple PRs. It is better to review this PR with some compare tools like beyond compare.

RobertIndie · 2023-08-31T11:44:54Z

@gunli Could you summarize them and add the detailed explanation to the PR description? They are very important context for this PR. A well-written PR description not only helps with review but also enables other developers to learn about the context of this PR.

gunli · 2023-08-31T12:24:20Z

@gunli Could you summarize them and add the detailed explanation to the PR description? They are very important context for this PR. A well-written PR description not only helps with review but also enables other developers to learn about the context of this PR.

@RobertIndie I have updated the motivation.

Gleiphir2769 · 2023-09-04T07:47:54Z

Hi @gunli . Looks like it does not pass the CI test now. You can check the CI log an fix it. Seems like it is related to Chunking.

gunli · 2023-09-06T03:35:15Z

@Gleiphir2769 Thank you, I am busy these days, I will fix it ASAP.

gunli · 2023-09-06T07:13:26Z

@Gleiphir2769 Thank you, I am busy these days, I will fix it ASAP.

@Gleiphir2769 I have fix the test case, and it can run PASS now, but since only the last chunk will trigger the callback, I commented the wait logic in sendSingleChunk(), I am not sure whether it is right to do that, please help to review it.

Gleiphir2769

Hi @gunli. I leaved some comments about sendSingleChunk .

Gleiphir2769 · 2023-09-06T13:59:16Z

pulsar/message_chunking_test.go

 	mm.ChunkId = proto.Int32(int32(chunkID))
 	producerImpl.updateMetadataSeqID(mm, msg)

-	doneCh := make(chan struct{})


I think we can remove this doneCh. The goal of this UT is to verify whether consumer can discard the oldest chunk message. It has no impact whether the callback of sendRequest is called or not.

OK, I have deleted these line in a new commit.

Gleiphir2769 · 2023-09-06T14:01:30Z

pulsar/message_chunking_test.go

@@ -548,30 +556,57 @@ func createTestMessagePayload(size int) []byte {
 }

 //nolint:all
-func sendSingleChunk(p Producer, uuid string, chunkID int, totalChunks int) {
+func sendSingleChunk(p Producer, uuid string, chunkID int, totalChunks int, wholePayload string, callbackOnce *sync.Once, cr *chunkRecorder) {


Should we add the parameterswholePayload callbackOnce and chunkRecorder ? I think it's more appropriate to make them inside thesendSingleChunk .

Hmm, in the current implemention, all the chunking sendRequest should share the whole message payload, callbackOnce, chunkRecorder, see internalSend() and addRequestToBatch(), I add these params just want to make the send procedure work correctly, I think it better to keep them.

gunli · 2023-09-07T11:02:39Z

@RobertIndie @tisonkun Would you please review this PR again?

…lusters to __local__

gunli · 2023-10-18T03:11:40Z

message_chunking_test-ac9c1a6399336461d2d3ce1cdd31cac6debd5ed5.txt
message_chunking_test-PR.txt
producer_partition-ac9c1a6399336461d2d3ce1cdd31cac6debd5ed5.txt
producer_partition-PR.txt
@RobertIndie @tisonkun I have uploaded the files that have been changed, would you please review this PR with some compareing tool line BeyondCompare?

nodece · 2023-10-23T07:27:29Z

This PR looks complex, could you split this PR?

gunli · 2023-10-24T02:47:01Z

This PR looks complex, could you split this PR?

@nodece OK, I will do it today.

This was referenced Jul 25, 2023

[fix] [issue 1067] Fix the excessive dataChan capacity #1068

Closed

[fix][txn] Fix the transaction acknowledgement and send logic for chunk message #1069

Merged

gunli force-pushed the refactor-internalSend branch from 4b6c0c2 to 68f1a0d Compare July 26, 2023 12:56

Gleiphir2769 reviewed Jul 27, 2023

View reviewed changes

gunli force-pushed the refactor-internalSend branch from 68f1a0d to 3bd5c0e Compare July 27, 2023 12:59

gunli force-pushed the refactor-internalSend branch from 63b2726 to 632f367 Compare August 30, 2023 08:01

tisonkun requested review from RobertIndie and tisonkun August 31, 2023 02:32

Gleiphir2769 reviewed Sep 6, 2023

View reviewed changes

gunli added 9 commits October 8, 2023 15:21

[Improve][Producer] Refactor internalSend() and resouce managment

c0a3937

rename updateUncompressPayload to updateUncompressedPayload

4dce9e8

fix lint errors

a3f47d2

refactor: call sendRequest.done() in failPendingMessages()

cfcb15c

refactor: rename pendingItem.Complete() to pendingItem.done()

38d2f6b

fix: fix reserveSemaphore bug

32109d3

fix: fix reserveSemaphore bug

2d57dbb

fix test case

58e6fda

add comment for the test case

d6065aa

gunli added 2 commits October 8, 2023 15:21

delete useless line in sendSingleChunk()

04a83a1

merge apache#1100: Construct the metadata after changing ReplicationC…

dc03dd6

…lusters to __local__

gunli force-pushed the refactor-internalSend branch from 36edd30 to dc03dd6 Compare October 8, 2023 07:28

tisonkun closed this in #1109 Oct 24, 2023

This was referenced Oct 24, 2023

refactor: factor out validateMsg #1117

Merged

refactor: factor out prepareTransaction #1118

Merged

refactor: prepare sendrequest and move to internalSendAsync #1120

Merged

fix: normalize all send request resource release into sr.done #1121

Merged

[Improve][Producer] Refactor internalSend() and resouce managment #1071

[Improve][Producer] Refactor internalSend() and resouce managment #1071

Conversation

gunli commented Jul 25, 2023 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

gunli commented Jul 25, 2023

Gleiphir2769 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunli Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunli Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunli Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunli Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gleiphir2769 Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

gunli Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

gunli commented Aug 7, 2023

gunli commented Aug 31, 2023

gunli commented Aug 31, 2023

RobertIndie commented Aug 31, 2023

gunli commented Aug 31, 2023 • edited Loading

RobertIndie commented Aug 31, 2023

gunli commented Aug 31, 2023

Gleiphir2769 commented Sep 4, 2023 • edited Loading

gunli commented Sep 6, 2023

gunli commented Sep 6, 2023

Gleiphir2769 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunli commented Sep 7, 2023

gunli commented Oct 18, 2023

nodece commented Oct 23, 2023

gunli commented Oct 24, 2023

gunli commented Jul 25, 2023 •

edited

Loading

gunli Jul 27, 2023 •

edited

Loading

gunli Jul 27, 2023 •

edited

Loading

gunli Jul 27, 2023 •

edited

Loading

gunli Jul 27, 2023 •

edited

Loading

Gleiphir2769 Jul 31, 2023 •

edited

Loading

gunli Aug 1, 2023 •

edited

Loading

gunli commented Aug 31, 2023 •

edited

Loading

Gleiphir2769 commented Sep 4, 2023 •

edited

Loading