undefined | Better HN

0 pointszozbot2343mo ago0 comments

But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.

0 comments

2 comments · 1 top-level

stingraycharles3mo ago· 1 in thread

Ok I am by no means an expert on this and I immediately stand corrected. But as I understand it, in order to understand the amount of active memory that’s required, it’s more accurate to go by the ~82B number, right?

zozbot234OP3mo ago

The ~82B figure is an attempt to compare performance to an equivalent dense model. The amount of active parameters is given by the ~17B.

j / k navigate · click thread line to collapse