You get to decide what is appropriate or not.
It works offline.
It can be used to by applications without the having to use an external service.
This can be important for a number of applications (I am thinking about open source games and the modding community right now, but it is just an example)
More to the point, I find it absolutely bizarre that one couldn't come up with quite a few reasons to have this be more private, whether personal or business.
Your personal or business stuff in other people's hands is generally not optimal or preferable, especially when more private options exist.
Because LLMs are expensive to host. It's more of a case that no one wants to run these on their PCs and it's the cheapest if it ends up running on client PCs instead of your own PCs. Not all use cases of LLMs need a super powerful model that is always up to date.
1) for people who want to make money using AI but can’t afford to pay for a LLM or servers, they can push that cost to the end user.
2) for people who want to generate porn or spam (probably, also to make money)
The privacy thing is complete nonsense. If you want a private server, rent your own private server. If you’re worried AWS is spying on you, you’re paranoid.
This is about money, making money and being cheap, not about good will.
So, you’re right; from a consumer perspective it’s pretty meaningless.
"The marvel is not that the bear dances well, but that the bear dances at all."
Those two events are causally related. The OS has to throttle down the CPU or else it will overheat and malfunction.
It is one of the reasons why heavy number crunching is often performed on the cloud instead.
The model we are using is a quantized Vicuna-7b, which I believe is one of the best open-sourced models. Hallucination is a problem to all LLMs, but I believe research on model side would gradually alleviate this problem :-)
I have 64gb RAM (not gpu just normal), I’d like to see proof of concepts that the bigger models can be fine tuned and have far more accepted results, or to know if we’re completely going the wrong direction with this
The way we make this happen is via compiling to native graphics APIs, particularly Vulkan/Metal/CUDA, making it possible to run with good performance.
I've scoured the web page for ram requirements for the various models but I can't see anything, will it be able to run let's say the 30B open assistant llama or 65B raw llama model on a consumer gpu (let's say 3060 with 12gb vram) using this?
Not trying to take anything away, but the readme etc is very lacking in actual technical details I feel without reading through the code or actually testing it.
It's like someone builds a CPU with a floating point unit specifically aimed at CAD software. Then someone else comes and builds a floating point unit for physics simulation. Then someone else ...
Can't we just get a generic compute model, and make that work everywhere? And don't we already have that, e.g. CUDA?
The history of GPGPU in a nutshell…
There are a few “generic compute models” but no incentive for the GPU manufacturers to support them over their proprietary model. Everyone could natively support Cuda and Vulcan and Metal and OpenCL and SPIR-V and…think I’m forgetting one but you get the point.
Our approach leverages TVM Unity, a machine learning compiler that supports compiling GPT/Llama models to a diverse set of targets, including Metal, Vulkan, CUDA, ROCm, and more. Particularly, we've found Vulkan great because it's readily supported by a wide range of GPUs, including AMD and Intel's.
BTW, an interesting data point from Reddit that it also works on steam deck: https://www.reddit.com/r/LocalLLaMA/comments/132igcy/comment....
But for some reason it dramatically slows down after a few messages
Edit:
Oh no, this one also gives lectures instead of answering questions.
https://i.imgur.com/eiuGzK4.jpg
I'm afraid, in near future the only organic content on the internet would be only the type of content that LLMs refuse to generate.
I've been tempted to try it myself, but then the thought of faster LLaMA / Alpaca / Vicuna 7B when I already have cheap gpt-turbo-3.5 access (a better model in most ways) was never compelling enough to justify wading into weird semi-documented hardware.
It's on our plan, but we haven't looked to enable them by default in the first place, mainly because we wanted to demonstrate it running on all GPUs including old models that don't come with TensorCore at all.
There are already efforts underway with GPTQ libraries but I have found they incur a substantial performance penalty, with the benefit of consuming much lower VRAM.
EDIT: I had a look at the repo, it appears the Vicuna model is using 3bit quantization.
What are you using local LLM's for?
So far, I've been only able to come up with:
- Aid in coding (which always ends up in chatGPT)
- Summarizing short articles
- whisper-ai + langchain + ffmpeg allows for some great video summarization (especially with non-english LORA's for us non-natives)
- generating stable diffusion prompts
Also, you hint at those many ideas, could you elaborate on that a bit? I'll be playing with LLMs in near future, might as well do something useful with them
(on the other hand I wouldn't be surprised if they didn't come with it neither, due to the difficulties of it)
> ASSISTANT: Understood! I'll be here to answer any questions you may have in the shell terminal. Let's get started!
> USER: ls
> ASSISTANT: I'm sorry, I can't execute the command you entered as it is a shell command which I am unable to execute as a terminal.
I think it needs a bit more work
Otherwise you should've asked it to pretend to be a terminal and generate fake command output for various common unix binaries.
+
Is there any optimization for LLM to run on RTX cards? 40XX,30XX I found out tha LLAMA.CPP is nice but I want to take advantage of my graphic cards also, and didn't found any documentations...
I've rented a server but it has no GPU. Does MLC work well through only CPU inference?
I'd like to get it set-up with langchain if it does work well
Are these the only supported models as of now? https://github.com/mlc-ai/mlc-llm/blob/d3e7f16c54238b7da5e78...
https://github.com/mlc-ai/binary-mlc-llm-libs
Is the code from which these are built available somewhere? How does one go about building one for their own model?