TRC is supposed to be the “something better”. This insider TPU stuff is for the birds. If TRC can only offer 4 hours with no preemptions, that’s fine, but they need to be up front about that. Saying that TPUs preempt every 24 hours and then killing them off after 45 minutes is… not very productive.
As for their comments, the third screenshot is the key; they’re agreeing that the situation is bad. They’re a friend, and they’re a little indirect with the way they phrase things. (If you’ve ever had a friend who really doesn’t want to be wrong, you know what I mean; they kind of say things in a circular way in order to agree without agreeing. After awhile it’s pretty cute and endearing though.)
I was particularly pessimistic in those DMs because it came a couple months after I thought I’d give TRC one last try, back in January, which was roughly a year after I’d started my “ok, I’m losing hope, but I’ll wait and see” journey. In the meantime I kept cheerleading TRC and driving people to their signup page. But after the TPUs all died in less than two hours yet again, that was that.
I have a really high tolerance for faulty equipment. This is free compute; me complaining is just ungrateful. But I saw what things were like in 2019. “Different” would be the understatement of the century. If my baby wasn’t being incubated in the NICU today, I’d show the charts where our usage went from thousands of cores down to almost zero, and not for lack of trying.
It also would’ve been fine to say “sorry, this is unsustainable, the new limits are one tpu per person per project” and then give me a rock solid tpu. We had those in 2021. One of our TPUv3s stayed online for so long that I started to host my blog on it just to show people that TPUs were good for more than AI; the uptime was measured in months. Then poof, now you can barely fire one up.