Eh, that depends. A small model that's voice-and-text is probably more useful to most people than scaling up a voice-only model: the large voice-only model will have to compete on intelligence with e.g. Qwen and Llama, since it can't be used in conjunction with them; whereas a small voice+text model can be used as a cheap frontend hiding a larger, smarter, but more expensive text-only model behind it. This is an 8b model: running it is nearly free, it can fit on a 4090 with room to spare.
On the one hand, a small team focused on voice-to-voice could probably do a lot better at voice-to-voice than a small team focused on voice-to-voice+text. But a small team focused on making the most useful model would probably do better at that goal by focusing on voice+text rather than voice-only.