undefined | Better HN

0 pointsgwerbin11d ago0 comments

Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.

What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

0 comments

2 comments · 2 top-level

girvo11d ago

>What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

That is quite literally what I have setup :)

I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time

>Do you think the choice of quantization matters that much for other models

It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.

Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.

rhdunn11d ago

I use promptfoo for evaluation. I'm experimenting with tests for my workflow/use cases.

I have a custom assert for loop/repeat detection that works well:

    def count_repeats(text: str, length: int) -> int:
        n = len(text)
        pattern = text[n - length : n]
        count = 1 # Include the end of the string as matching the substring.

        text = text[: -length]
        while text.endswith(pattern):
            text = text[: -length]
            count = count + 1

        return count


    def repeats(output: str, context: dict[str, any]) -> bool|float|dict[str, any]:
        threshold = context.get('config', {}).get('threshold', 3)
        count = 0
        length = 0

        for n in range(1, (len(output) // 2) + 1):
            n_count = count_repeats(output, n)
            if n_count > count:
                count = n_count
                length = n

        if count >= threshold:
            return { 'pass': True, 'score': 1.0, 'reason': f'Output repeats {count} times with length {length}.' }
        else:
            return { 'pass': False, 'score': 0.0, 'reason': f'Output doesn\'t repeat {threshold} or more times.' }


    def no_repeats(output: str, context) -> dict[str, any]:
        result = repeats(output, context)
        result['pass'] = not result['pass']
        result['score'] = 1.0 - result['score']
        return result

Just add it to your promptfooconfig.yaml:

    defaultTest:
      assert:
        - # ----- The output doesn't repeat/get stuck in a loop.
          type: python
          value: file://asserts/repeat.py:no_repeats

j / k navigate · click thread line to collapse