The draws are at best evidence towards equality (not against it). Allow them to vary and the likelihood of seeing a difference of 9 wins in 64 games with 45 draws moves up to 0.13 or 13% (when we assume the two players are identical, an appropriate null hypothesis) (even less significant). So in about one tournament in 8 you would expect this much of a lead, even if it was one algorithm playing itself. So from one tournament we say it is likely the one algorithm is in fact better, but it doesn't rise to the standard of being statistically significant.
<code>
# R code to empirically estimate two-sided probablity of
# seeing a lead of 9 games when 64 games are played
# and the assumed probability of a draw is 45/64
# with the null assumption win/loss odds are equal
simulate <- function(nplay,ndraw) {
sample(c('w','d','l'),size=nplay,replace=TRUE,
prob=c((nplay-ndraw)/2,ndraw,(nplay-ndraw)/2)/nplay)
}wldiff <- function(v) { abs(sum(v=='w')-sum(v=='l')) }
set.seed(350920)
stats <- replicate(10000,wldiff(simulate(64,45)))
print(sum(stats>=13-6)/length(stats))
## [1] 0.1341
</code>
(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)