It seems to me if you have even a somewhat popular app, you're gathering enough data that you can afford to use p < 0.001 and avoid a lot of the complexities of statistical analysis that comes from p < 0.05. If you don't have enough data to reach p < 0.001, it's probably better to work more on increasing traffic than getting the piddling gains from A/B testing so early.
That said, if a test has been running for a while and you don't have an answer, it can run for a looong time before it finishes. In my A/B testing tutorial I explored that starting at http://www.elem.com/~btilly/effective-ab-testing/#slide59 (just use the arrow keys to move forwards and backwards through those slides). I found that depending on whether random fluctuations took you in the same direction as the underlying bias or the opposite, there tends to be an order of magnitude difference in how long the test takes to run. Furthermore whichever is leading after many observations is usually really better, and in the worst case is overwhelmingly likely to not be much worse. Therefore there are times when it really is better to declare an answer and move on.
If you wish to formalize this, you could use the strategy used by some medical trials where they decide in advance what confidence levels will cause them to cut off early after 100 trials, 1000, trials, 10,000 trials, or to go to (say) 50,000 trials. And then they arrange that the sum of the odds that they make an early mistake are below some acceptable threshold.
For example, it works to flip two coins 50 times each, and then run a statistical-significance test. It does not work to flip two coins 50 times each, run a test; if no significance yet, continue to 100, then 150, etc. until you either find a significant difference or give up. That greatly increases the chance that you'll get a spurious significance, because your stopping is biased in favor of answering "yes": if you found a difference at 50, you don't go on to 100 (where maybe the difference would disappear again), but if you didn't find a difference at 50, you do go on to 100.
Put differently, it's using separate p-values for "what is the chance I could've gotten this result in [50|100|150|...] trials with unweighted coins?" to reject the null hypothesis each time, as if they were independent, but the null hypothesis for the entire series has to be the union, "what is the chance I could've seen this result at any of the 50, 100, 150, or 200, ... stopping points with unweighted coins?", which is higher. Yet that's exactly how many A/B tests are done: you start collecting data, and let the trials run until you find "significant" differences or give up.
(It's possible to set up a series of tests where you choose when to stop based on observed values, but you have to use different statistical machinery than the common significance-tests.)
Furthermore I note with interest that 2 of the 3 statistical techniques he named (Student's t test and ANOVA) only apply to cases where the observed variables are themselves normally distributed. Which is not a good description of binary yes/no outcomes. As for the remaining test, it is appropriate to use a chi-square, but statisticians tell us that the g-test is preferable.
Many people think they will get will become millionaires if they follow the style of person X.
Person X is like a trial in which a coin was tossed 10000 times and got 6000 heads.
Since there is no information about the others persons, the others trials, many choose to follow the illogical thinking that they will succeed in the same way.