There were several links:
- Blog for details: https://homebrew.ltd/blog/llama-learns-to-talk
- Code: https://github.com/homebrewltd/ichigo
- Run locally: https://github.com/homebrewltd/ichigo-demo/tree/docker
- Demo on a single 3090: https://ichigo.homebrew.ltd/
A quick intro: We're a Local AI company building local AI tools and training open-source models.
Ichigo is our training method that enables LLMs to understand human speech and talk back with low latency - thanks to FishSpeech integration. It is open data, open weights, and weight initialized with Llama 3.1, extending its reasoning ability.
Plus, we are the creators and lead maintainers of: https://jan.ai/, Local AI Assistant - an alternative to ChatGPT & https://cortex.so/, Local AI Toolkit (soft launch coming soon)
Everything we build and train is done out in the open - we share our progress on:
https://x.com/homebrewltd https://discord.gg/hTmEwgyrEg
You can check out all our products on our simple website: https://homebrew.ltd/
I think Matrix is not publicly indexable unless the channel is unencrypted and set to public.
If I remember correctly, "ichigo" means strawberry in japanese. You are welcome.
Can you help me wrap my brain around this? Does it mean six? I'm struggling to understand how a word can mean two numbers and how this would actually be used in a conversation.
Thanks. I'm curious and trying to search for this to understand just returns anime.
Ban-kai 卍解
I'm trying to use chatgpt for ai translation, but the other big problem I run into is TTS and SST on non-top 40 languages (e.g. lao). Facebook has a TTS library, but it isn't open for commercial use unfortunately.
Bringing AI into this space enhances user experience while respecting their autonomy over data. It feels like a promising step toward a future where we can leverage the power of AI without compromising on privacy or control. Really looking forward to seeing how this evolves!
To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.
The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.
Hope this clears things up!
The documentation isn't very detailed yet, but we're planning to improve it and add support for various hardware.