My yardstick so far of all LLMs has been to ask for an offensive joke, ask for a function to invert a string, and ask for directions to make lasagna. It seems stupid but it's remarkably effective.
With MLC being the first LLM-in-a-box to run on my M2 at faster than a token per minute, I'm impressed at the speed but also disappointed at the quality of the experience. For those interested in the outcome, it failed all 3 tests, which is not unexpected for a small model like this.
Using/producing models with censorship included voluntarily demonstrates a willingness to hobble the technology for peripheral reasons that do not directly correlate with the advancement of the field. For that reason, this is a disqualifying characteristic in the capacity of my own use on the basis that social sensibilities and decency varies across cultural and regional lines, anything so trivial as a crass joke being limited is such a low bar that other things of much more grave concern will undoubtedly be tampered with or limited, and not always in ways the authors intended.
Self-hindering behavior will not be the positive we think it will be, as with most measures to correct injustices with data.