This post is also available in: heעברית (Hebrew)

By Ilan Segelman

VP Sales & Business Development, Power Communications, and Operation Manager, Sophos Israel

Machine Learning has become a buzzword that many suppliers use as a magic word when they describe their information security solutions, in attempt to boost sales. Evaluating machine learning-based solutions has become one of the most difficult tasks.

In this article I will refrain from describing what machine learning is or why do we need it (that’s what Wikipedia is for), I will only stress that the approach to machine learning must always be based on science, transparency, and validation. In order to make our lives a bit easier, here are five concrete criteria that will help you examine the quality of the solutions in this context, no matter which algorithm type is being used:

Detection Rate Vs. False Positive Results – A high detection rate does not necessarily mean success. You can reach a 100% detection rate easily by “convincing” the algorithm that every file it screens is malicious. This is why the real important value is the rate of the false positive results. A false positive means, in fact, the prevention of the use of legitimate files mistakenly identified as malicious. The goal is, of course, to reach the lowest possible rate of false positive results. In machine learning, it can be examined graphically through the ROC (receiver operating characteristic curve) that describes the relation of identification rate vs. the false positive results. Ask the manufacturers to see the ROC graph now and in the past – a manufacturer that can not show you this data would not be able to promise the rate of malware he will be able to block from infiltrating into the organization.

Updates – Machine learning enables information security solutions to identify threats never been identified before and block them. A good machine learning technology does not need many updates along the way because it learns autonomously about new threats during a long period of time. A quality solution will show good results in the ROC graph not only for days or weeks but for several months. A solution that requires many updates and its accuracy deteriorates between updates does not really deliver the service.

Real-Time Decisions – If the screening and search for malware takes longer than the time the malware needs to commit its dirty job, we have an excellent identification solution but not a preventive one. In order to prevent malware infiltration, the machine learning algorithm has to work within a millisecond and not in seconds or minutes.  It is important to check if the algorithm starts operating in real time and how long does it take to make decisions. In addition, find out what happens to its accuracy level when the computer is offline – a machine learning solution with an information set that is not compatible with your endpoints will have to be connected to a cloud service and will be slow and unreliable.

Learning in The Real World – The algorithm’s performance depends to a large extent on the information on which it based its learning process. If the information is academic, outdated and irrelevant, the algorithm would not execute its mission in a reliable way in the real world outside the laboratory. Check out what is the information basis of the algorithm, is it realistic and what is the volume of the information.

Growth Potential – The learning process requires the capability to collect large amounts of new information, including legitimate files and malware. However, it is not enough – the information increases its volume with time, so the solution must be capable of learning and examining the new information collected while the database continues to grow significantly, and all that with high and constant agility.

In sum, machine learning is undoubtedly “the next thing” in the world of information security and computers in general. Many solutions are offered in the market, claiming to have these capabilities. But when you examine them, don’t forget to ask the right questions.