llama : fix non-quantization of expert gating tensors #5754
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This reverts a single line from #5475 and adds a comment.
Using
LLM_TN
here to get the tensor name can't work, because the layer number is not known, so the string compared to the actual tensor name contains%d
instead of a layer number, so it never matches, and expert gating tensors are quantized anyway.I've added a comment so that it won't happen again by accident.
I've noticed this when refactoring some Mamba-related code (in #5328) preventing some tensors to be quantized to use
LLM_TN
(since not using hard-coded strings seemed cleaner), and it didn't work anymore, while checking for the suffix (as was done before for the expert gating tensors) works.I've tested this with the smallest MoE model I could find, and it works (i.e. the
ffn_gate_inp.weight
tensors are not quantized, while onmaster
, they are even though they should not).