Evaluate AI and ML Applications: Aspects and Ideas

AI and machine learning applications are developed at a rapid pace. Every now and then one can read about the latest achievements and improved performance in various different domains. The applications are getting better and better, but:

What is good enough?

What is the threshold that makes an AI application viable? How to assess and evaluate these AI applications?

This post is the first step towards answering these questions.

It offers some aspects to consider when evaluating a certain application. It would be interesting to create some kind of a “AI maturity model” around the points. Notice that the points are more or less overlapping.

AI vs. Human performance

If a machine performs better (some relevant measure) than a human, it is viable. For example Google’s DeepMind performed better than a human in lip-reading. See the link.

Cost vs. benefit

The benefits from an AI application should be larger than the incurred costs. This kind of cost-benefit-analysis might include reduced employee costs vs. implementation costs. Chatbots are an obvious example of this kind of evaluation.

Turing-styled interpretation

This is a bit shuttle. If the application cannot be distinguished from a human it is good enough. This turing-styled approach is closely tied to customer experience. Again, chatbots serve as a great example. The customer experience has to be fluent. If the experience breaks down, the application is not mature enough.

Natural limits

Natural limits are closely knitted to sentiment analysis. To some extent sentiments are subjective. It is estimated that the natural limit for accuracy in positive-negative message classification is around 90%. If an application approaches one of these limits it is good enough from data science point of view. You should check your cross-validation methodologies if you are exceeding the limits.

What do you think?

How would you approach AI maturity? What kind of criteria and aspects would you propose?