Pretty bad results compared to Devstral 1 (?)
From my non-representative early testing with Q6_K_XL, it's much worse than Devstral Small "1" with the same quant.
I've been having lots of hallucinations, using RooCode with llama.cpp (vulkan), it's creating useless files, useless tasks, not totally following instructions, etc.
I'm having much better results with the original Devstral Small (UD_Q6_K_XL).
Not sure if this is due to unsloth quantization, llama.cpp, chat template, or if it's also a problem with the original model?
I've been running it with the recommended params, and ctx/cache adapted to my hardware (same as Devstral 1):
--flash-attn 1
--jinja
--temp 0.15
--min-p 0.01
--ctx-size 100000
--cache-type-k q4_0 --cache-type-v q4_0
I'm curious to hear about other Devstral Small 1 users feedback, please don't hesitate to share
Can you also remove the cache type
Support for the model in llama.cpp is still being worked on and experimental and the chat template still needs work
Unfortunately right now I'm testing it on real tasks, that need at least 60k tokens.
So it won't fit in my 32GB of ram with F16 kv_cache.
But I'd be curious if someone else wants to give it a try!
I hope there isn't some regression, with previous Devstral there was really little to no impact using Q4 kv_cache from my testing
EDIT: I just gave a shot to 36K not-quant ctx (the maximum I can fit), but as I feared it's not enough for my current task (and I don't have any "small testing task" for coding agent).
I see there's been a recent commit for converting to gguf though, don't know if unsloth uses it? https://github.com/ggml-org/llama.cpp/pull/17889/files
So FYI I've been able to test with Q8_0 cache, and it seems as bad as Q4_0.
I won't be able to go to native cache with my current tests though
Tested with IQ4_XS quant from Unsloth and Bartowski. I don't think it's issues with the quant, I think it's the model. It's just bad. I couldn't make it past my first lightly modified Snake game benchmark.
Watch out the top-p and top-k parameters maybe ? (not specified in unsloth guide) but they are respectively 0.9 and 40 by default in llama.cpp
Devstral version 1 was suggesting top-p 0.95 and top-k 64
I confirm it seems better with the updated versions!
As far as I can see from my quick testing, Q4_K_XL still isn't so great as a coding agent, but Q6_K_XL seems pretty good! (as it's usually the case for most models of this size)
There still might be some weird behavior that I didn't have with Devstral 1, but yeah maybe using top-p and top-k recommended values for Devstral 1 would help. And even maybe tweaking dry-multiplier / dry-penalty-last-n
Anyway I'll need to do more testing, but that's quite promising!
I confirm it seems better with the updated versions!
As far as I can see from my quick testing, Q4_K_XL still isn't so great as a coding agent, but Q6_K_XL seems pretty good! (as it's usually the case for most models of this size)
There still might be some weird behavior that I didn't have with Devstral 1, but yeah maybe using
top-pandtop-krecommended values for Devstral 1 would help. And even maybe tweakingdry-multiplier/dry-penalty-last-nAnyway I'll need to do more testing, but that's quite promising!
Amazing to hear thanks for testing! π
I am also getting poor performance with with the Q6 version.
So it seems, even if it's better than before, I still often have errors with tool calling (i.e quite often broken diff).
At some point it even responded with broken formatting (i.e messages containing </user_message> </read_file> <notice>)
It's still not as good as Devstral 1 (same quant / same agent framework), which probably mean there are still some issues
@danielhanchen
According to my non-representative testing, yeah.
Maybe it's really a problem with quantization and/or cache quantization though
Also there seems to still be issues related to tool use with llama.cpp : https://github.com/ggml-org/llama.cpp/issues/17928