Recently we had to evaluate if chatbot we built for an Austrian telecommunication provider would perform better on other NLP engines than the one we had in use (a cloud-based one). We took the training data and calculated common performance metrics, confusion matrices and accuracy scores for a bunch of the blockbuster providers (IBM Watson, Google Dialogflow, Amazon Lex, Microsoft LUIS, Rasa and some more).
We published the scripts in a Github repository and a blog article with instructions: https://medium.com/@floriantreml/tutorial-benchmark-your-cha...