That's rather strange. The graphics part is lightweight (pre-rendering the background and then drawing few shapes), but if you could shrink the browser to very small dimensions and test we could eliminate this one.
The audio part is bit more involved. The vocal tract is simulated in segments, each segment receiving, filtering and reflecting the soundwave energy. The algorithm is computationally heavy, but it ran well on my mediocre smartphone.
Maybe if stuttering is detected it could lower the number of tract segments, which also lowers the quality. Increasing the buffer size would probably also help with glitches but I don't think it would solve the high CPU utilization.