Model "thinks" for too long

#12

by Moisha1985 - opened Feb 15

Moisha1985

•

Serving the model via vllm. Gen speed is at around 70 tps
Even simple "Hi" takes around 10-20 seconds to for thinking.
Am I doing something wrong?

Why-T

Feb 15

they acknowledged its overthinking issue and they said they are working to fix it without the loss of performance . i hope its not going to take long cause it seems like the overthinking issue makes it sometimes unusable , specially in some riddles or problems it took over 25 minutes thinking which is overkill .

Matinmollapur01

Feb 15

•

edited Feb 15

Hope they fix it soon, the performance is actually good for 3b model and I really like to use it but this issue needs to be resolved!

vandetta23

Feb 15

Let’s join first; winning is just a matter of time.

urtuuuu

Feb 19

Did you try these parameters --temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.01
From my observations, these might be best, and may help with thinking time

Moisha1985

Feb 19

•

edited Feb 19

Did you try these parameters --temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.01
From my observations, these might be best, and may help with thinking time

I’ve tried first two but not the last two. I’ll give it a try. Thanks

daydreamwarrior

Feb 23

Did you try these parameters --temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.01
From my observations, these might be best, and may help with thinking time

I’ve tried first two but not the last two. I’ll give it a try. Thanks

discussions/23

Setting top_k to 0 indeed shortens the thought chain.

urtuuuu

Feb 23

•

edited Feb 23

Setting top_k to 0 indeed shortens the thought chain.

Ok, but what is more important, COT or output quality? When i was testing top_k, anything that is not 40 produced worse results, mostly code. (same with min_p 0.01)

daydreamwarrior

Feb 24

Setting top_k to 0 indeed shortens the thought chain.

Ok, but what is more important, COT or output quality? When i was testing top_k, anything that is not 40 produced worse results, mostly code. (same with min_p 0.01)

However, 40 is just the default value in llama.cpp; there's nothing magical about it, nor any specific reason to stick with it. Top_p alone should be sufficient to control token selection, and setting min_p to 0.01 likely won't make a difference.

dreamer11121

Feb 24

Did you try these parameters --temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.01
From my observations, these might be best, and may help with thinking time

I’ve tried first two but not the last two. I’ll give it a try. Thanks

discussions/23

Setting top_k to 0 indeed shortens the thought chain.

where you run it what the program at your screen?

daydreamwarrior

Feb 25

•

edited Feb 25

Did you try these parameters --temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.01
From my observations, these might be best, and may help with thinking time

I’ve tried first two but not the last two. I’ll give it a try. Thanks

discussions/23

Setting top_k to 0 indeed shortens the thought chain.

where you run it what the program at your screen?

Jan by llama.cpp, Nanbeige4.1-3B-heretic Q6_K GGUF, metal, no prompt, f16 K q8_0 V cache.

Narutoouz

Mar 6

•

edited Mar 6

top k is indeed important value for LLMs. K value is analogous to a light source that LLMs use on their dark knowledge base, more k value means more light source channels are exposed on the dark knowledge of LLM, so higher chance to get info related to user asked for from its dark interconnected knowledge base. Low K value means LLM is completely dark and high K value means it is getting exposed to too much knowledge. Too much knowledge can cause knowledge paralysis to the model, so only it is nice have k = 20 or k = 40, so model only gets exposed to adequate info from its dark knowledge to give us accurate and sensible answers.
I read lot of research papers related to AI for fun, this is what I understand of the purpose of K value.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment