undefined | Better HN

0 pointsoverfeed1y ago0 comments

> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.

Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.

0 comments

3 comments · 2 top-level

rs1861y ago· 1 in thread

I have a 4070 super for gaming, and used it to play with LLM a few times. It is by no means a bad card, but I realize that unless I want to get 4090 or new Macs that I don't have any other use for, I can only use it to run smaller models. However, most smaller models aren't satisfactory and are still slower than hosted LLMs. I haven't found a model that I am happy with for my hardware.

Regarding agentic workflows -- sounds nice but I am too scared to try it out, based on my experience with standard LLMs like GPT or Claude for writing code. Small snippets or filling in missing unit tests, fine, anything more complicated? Has been a disaster for me.

taneq1y ago

As I understand it, these models are limited on GPU memory far more than GPU compute. You’d be better off with dual 4070s than with a single 4090 unless the 4090 has more RAM than the other two combined.

adastra221y ago

I have never found any agent able to put together sensible pull requests without constant hand holding. I shudder to think of what those repositories must look like.

j / k navigate · click thread line to collapse