undefined | Better HN

0 pointscs7022y ago0 comments

> Why is this so shocking? Quantization has been widely explored, driving that to its extreme (and blowing up parameter count to make up for it) just seems like a natural extension of that.

I find it shocking that we don't even need lower floating-point precision. We don't need precision at all. We only need three symbols to represent every value.

> I feel like this follows naturally from having only ternary values, multiplication doesn't really bring much to the table here. It's a bit surprising that it's performing so well on existing hardware, usually multiplication hardware sees more optimization, especially for GPGPU hardware.

I find it shocking. Consider that associative addition over ternary digits, or trits, represented by three symbols (a,b,c) has only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter), and only three possible outputs, a, b, or c. Matrix multiplications could be executed via crazy-cheap tritwise operations in hardware. Maybe ternary hardware[a] will become a thing in AI?

---

[a] https://en.wikipedia.org/wiki/Ternary_computer

0 comments

9 comments · 4 top-level

jerf2y ago· 3 in thread

An integer is just a concatenation of bits. Floating point appears more complicated but from an information theory perspective it is also just a concatenation of bits. If, for the sake of argument, one replaced a 64-bit int with 64 individual bits, that's really the same amount of information and a structure could hypothetically then either choose to recreate the original 64-bit int, or use the 64-bits more efficiently by choosing from the much larger set of possibilities of ways to use such resources.

Trits are helpful for neural nets, though, since they really love signs and they need a 0.

So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.

This may seem an absolutely bizarre zigzag, but I am reminded of Busy Beavers, because of the way they take very the very small primitives of a Turing Machine, break it down to the smallest pieces, then combine them in ways that almost immediately cease to be humanly comprehensible. Completely different selection mechanism for what appears, but it turns out Turing Machine states can do a lot "more" than you might think simply by looking at human-designed TMs. We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.

cs702OP2y ago

> So from the perspective that it's all just bits in the end the only thing that is interesting is how useful it is to arrange those bits into trits for this particular algorithm, and that the algorithm seems to be able to use things more effectively that way than with raw bits.

Thank you. I find many other things interesting here, including the potential implications for hardware, but otherwise, yes, I agree with you, that is interesting.

SkyBelow2y ago

This sort of breakdown also reminds me of the explanation of why busy beavers grow faster than anything humans can ever define. Anything a human can define is a finite number of steps that can be represented by some turing machine of size M. A turning machine of size N > M can then use M as a subset of it, growing faster than than the turing machine of size M. Either it is the busy beaver for size N, or it grows slower than the busy beaver for size N. Either way, the busy beaver for size N grows faster than whatever the human defined that was captured by the turning machine of size M. This explanation was what helped me understand why busy beavers is faster growing than any operator that can be formally defined (obviously you can define an operator that references busy beaver itself, but busy beaver can be considered to not be formally defined, and thus any operator defined used it isn't formally defined either).

The bit about floating point numbers just being a collection of bits interpreted in a certain way helps make sense why a bigger model doesn't need floating points at all.

eru2y ago

> We humans have very stereotypical design methodologies and they have their advantages, but sometimes just letting algorithms rip can result in much better things than we could ever hope to design with the same resources.

Yes. Though here the interesting point is not so much that these structures exist, but that 'stupid' back-propagation is smart enough to find them.

You can't find busy beavers like that.

jxy2y ago· 2 in thread

The matrices (weights) are ternary.

The vectors are not.

cs702OP2y ago

The activations are in (-1, 1), so they're also representable by (-1, 0, 1).

rfoo2y ago

This is wrong. The paper described that their activation is in int8 during inference.

That being said, before-LLM-era deep learning already had low bit quantization down to 1w2f [0] working back in 2016 [1]. So it's certainly possible it would work for LLM too.

[0] 1-bit weights, 2-bit activations; though practically people deployed 2w4f instead. [1] https://arxiv.org/abs/1606.06160

cs702OP2y ago

EDIT: Embarrassingly, on the last paragraph I got the number of possible input pairs wrong:

> only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter)

The correct number, ignoring order, is six pairs, because we have to include (a,a), (b,b), and (c,c).

p1esk2y ago

If you find three symbols per weight shocking, this paper should completely blow your mind: https://arxiv.org/abs/1803.03764

I admit it did shock me when it came out.

j / k navigate · click thread line to collapse

0 comments

9 comments · 4 top-level

jerf2y ago· 3 in thread

Trits are helpful for neural nets, though, since they really love signs and they need a 0.

cs702OP2y ago

Thank you. I find many other things interesting here, including the potential implications for hardware, but otherwise, yes, I agree with you, that is interesting.

SkyBelow2y ago

The bit about floating point numbers just being a collection of bits interpreted in a certain way helps make sense why a bigger model doesn't need floating points at all.

eru2y ago

Yes. Though here the interesting point is not so much that these structures exist, but that 'stupid' back-propagation is smart enough to find them.

You can't find busy beavers like that.

jxy2y ago· 2 in thread

The matrices (weights) are ternary.

The vectors are not.

cs702OP2y ago

The activations are in (-1, 1), so they're also representable by (-1, 0, 1).

rfoo2y ago

This is wrong. The paper described that their activation is in int8 during inference.

That being said, before-LLM-era deep learning already had low bit quantization down to 1w2f [0] working back in 2016 [1]. So it's certainly possible it would work for LLM too.

[0] 1-bit weights, 2-bit activations; though practically people deployed 2w4f instead. [1] https://arxiv.org/abs/1606.06160

cs702OP2y ago

EDIT: Embarrassingly, on the last paragraph I got the number of possible input pairs wrong:

> only three possible input pairs, (a,b), (a,c), or (b,c) (within each pair, order doesn't matter)

The correct number, ignoring order, is six pairs, because we have to include (a,a), (b,b), and (c,c).

p1esk2y ago

If you find three symbols per weight shocking, this paper should completely blow your mind: https://arxiv.org/abs/1803.03764

I admit it did shock me when it came out.

j / k navigate · click thread line to collapse