So this poses the question. Orchestrating workflows using an agent frameworks seems like it goes against the gospel that hacker news users (like myself) preach all so often.
So what do you think about approaching problems with specialize agent flows? Is this something that will shine in many use cases, or become obsolete in the next few years of innovation?
That said, I can’t seem to do better than just “vibes”. Basically, oh this model gave me a good response to this question, it must be better.
Now I have tried keeping track of a couple benchmarks like the ones I mentioned above. But I generally can’t translate these benchmarks into utility outside of the small scope the benchmark test for. Also there are so many benchmarks to keep track of and each takes some learning to understand.
So perhaps my scope isn’t well enough defined. But as a programmer, everything >GPT4o feels pretty damn similar.
Would love to hear how others evaluate LLMs beyond “just vibes” generally for programming use, but also when trying to use create new ai projects.