This may feel like an anime betrayal, since you basically launched my career as a scientist. But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem, especially today. And TRC just does not support them anymore. I tried, many times, over the last year and a half.
You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs
Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.
I held out hope for so long. I thought it was temporary. It ain’t temporary, Zak. And I vividly remember when it happened. Some smart person in google proposed a new allocation algorithm back near the end of 2021, and poof, overnight our ability to make TPUs went from dozens to a handful. It was quite literally overnight; we had monitoring graphs that flatlined. I can probably still dig them up.
I’ve wanted to email you privately about this, but given that I am a small fish in a pond that’s grown exponentially bigger, I don’t think it would’ve made a difference. The difference is in your last paragraph: you allocate reserved instances to those who deserve it, and leave everybody else to fight over 45 minutes of TPU time when it takes 25 minutes just to create and fill your TPU with your research data.
Your non-preemptible TPUs are frankly a lie. I didn’t want to drop the L word, but a TPUv3 in euw4a will literally delete itself — aka preempt — after no more than a couple hours. I tested this over many months. That was some time ago, so maybe things have changed, but I wouldn’t bet on it.
There’s some serious “left hand doesn’t know that right hand detached from its body and migrated south for the winter” energy in the TRC program. I don’t know where it embedded itself, but if you want to elevate any other engineers from software devs to researchers, I urge you to make some big changes.
One last thing. The support staff of TRC is phenomenal. Jonathan Colton has worked more miracles than I can count, along with the rest of his crew. Ultimately he had to send me an email like “by the way, TRC doesn’t delete TPUs. This distinction probably won’t be too relevant, but I wanted to let you know” (paraphrasing). Translation: you took the power away from the people who knew where to put it (Jonathan) and gave it to some really important researchers, probably in Brain or some other division of Google. And the rest is history. So I don’t want to hear that one of the changes is “ok, we’ve punished the support staff” - as far as I can tell, they’ve moved mountains with whatever tools they had available, and I definitely wouldn’t have been able to do any better in their shoes.
Also, hello. Thanks for launching my career. Sorry that I had to leave this here, but my duty is to the open source community. The good news is that you can still recover, if only you’d revert this silly “we’ll slip you some reserved TPUs that don’t kamikaze themselves after 45 minutes if you ask in just the right way” stuff. That wasn’t how the program was in 2019, and I guarantee that I couldn’t have done the work I did then under the current conditions.