By Matt Wolff, Chief Data Scientist at Cylance
Technology moves swiftly. Nowhere is that more accurate than in the current state of machine learning. One merely has to look at a variety of ubiquitous technological experiences they undergo each day, and find a myriad of machine learning applications at their core. Take, for example, the task of online shopping. Almost every large online storefront will recommend items you may want to purchase. These recommendations are based on a few data points; for example, previous shopping history, your recent searches, or even based on who your friends are. Machine learning demands that one consider this vast amount of data in order to come up with a simple answer: What item might you like to purchase?
Shopping, of course, is not the only industry to leverage recent advances in machine learning. The list of companies and industries is growing by the day in addition to the various applications of machine learning. Common applications of machine learning in today’s technology include voice recognition, fraud detection, email spam filtering, text processing, search recommendations, video analysis, etc. In addition, these current technologies are being improved daily, with these improvements being fuelled by greater data analytics, reduction in the cost of computation, and advancements in the state of the art of machine learning research.
So with all of the recent technologies embracing machine learning approaches, one may ask what exactly is machine learning, and how is it applied in these situations. In a broad sense, machine learning refers to a series of techniques where one would “train” a machine how to solve a problem. As a simple example, say I want to train a machine to determine if a photo is an apple or an orange. To train the machine, I provide it 100 photos of apples, and 100 photos or oranges. Once the machine is trained, I can give it a picture and it can tell me if the photo is an apple or an orange.
However, not all machine learning solution are created equally. One measure to determine the effectiveness of a machine learning model would be its accuracy in future predictions. For example, I ask the apples and oranges model to tell me if a photo is an apple or an orange. Let’s say I provide it with 10 photos of apples, and of that 10 it says 8 are apples and 2 or oranges. We can then say the model is 80% accurate. While this is reasonably accurate, one can easily improve upon this model. One way to improve a machine learning system is to provide more data; essentially provide broader experiences to improve its capabilities. For example, instead of 100 photos, one might provide 1,000, or 1,000,000 photos to train the machine. Very often, this increase in volume provides huge improvements in the accuracy of such models.
The incredible rate of growth of data over the last few years has led to the coinage of a new term, “big data”. As you can imagine, this means lots of data, enough for special consideration to be given as to how to store, transport, manage, and analyze. Big data has been on of the pillars for the rapid growth and improvement in machine learning in recent years. Another major force behind the machine learning movement has been the availability of cheap and plentiful computation.
Cloud computing advances have been essential in providing massive computing power in a cost effective manner, helping to solve computation intensive problems. A famous example of this may be the “SETI@home” project, where volunteers donate their unused CPU cycles to help in the analysis of radio telescope data. The ability to leverage thousand upon thousands of machines committed to solving a single problem lends itself well to the field of machine learning, in particular while trying to deal with data sets that are extremely large. As a practical example, we have one computation at Cylance that we run which takes 1,000 machines approximately 30 days to solve. Even a few years ago, it was not practical to solve such a problem.
The ability to collect and handle big data, along with increased ability to perform previously impossible calculations, are significant achievements. Combined, they are helping to fuel an explosion of growth in machine learning areas.
When I look at the cyber security industry, I see two trends that lead me to the conclusion that machine learning approaches are a good fit for the industry. One, the collection and storage of large amounts of useful data points is already well underway in cyber security. It would be difficult for me to find a security analyst who is not currently overwhelmed by the vast amount of raw data that is collected every day in mature environments. There even exist a plethora of tools designed to help sort, slice, and mine this data in a somewhat automated fashion to help the analyst along in their day-to-day activities.
The second trend is the lack of qualified, experienced individuals to successfully defend vital infrastructure and systems. The defensive game is complex and never ending; and one slip up by a security team can be enough to open the door for a security incident. In addition, the projected demand for excellent security professionals will continue to grow, compounding the current challenges around the dearth of talent.
Given these two points, machine learning techniques are a great fit to improve the security posture of an organization. And in fact, there are probably machine learning approaches implemented at some level in your organization. But what we should see over the next couple of years is a vast improvement in current state-of-the-art machine learning in cyber security, and an increase in the number of areas where machine learning techniques are prevalent.
As an example of what the impact of improved machine learning will bring to cyber security, let’s consider the case of an analyst responsible for an incident response case. In this example, a network has been penetrated and malware has been placed on various machines in the network, with the purpose of exfiltration of sensitive information. The analyst in this case is charged with multiple tasks here; discover what exactly has been stolen, how it was stolen, and repair the system to prevent the same or similar attacks again.
Without the help of any form of machine learning system, the analyst would have a difficult time resolving these issues in a short timeframe. For example, to determine what has been stolen, perhaps file access logs or network traffic would be reviewed by the analyst, looking for access to sensitive files, or large amounts of data flowing out of the network. To determine how the attacked gained a persistent foothold in the network, malware analysis of the disk may be needed to try and track down known malware samples using signatures developed by other human analysts. Or perhaps an analysis of the running system, looking for unusual processes running or other anomalous behaviours would be conducted as part of the incident response.
With a machine learning approach, many of these tasks can be automated, and even deployed in real time to catch these activities before any damage is done. For example, a well-trained machine learning model will be able to identify unusual traffic on the network, and shut down these connections as the occur. A well-trained model would also be able to identify new samples of malware that can evade human generated signatures, and perhaps quarantine these samples before they can even execute. In addition, a machine learning model trained on the standard operating procedure of a given endpoint may be able to identify when the endpoint itself is engaging in odd behaviour, perhaps at the request of a malicious insider attempting to steal or destroy sensitive information.
Currently, a large majority of machine learning approaches in cyber security is used as a type of “warning” system. They often require a human in the loop to make the final decision. This requirement is usually the result of machine learning models that are not sufficiently accurate, to the point where a typical human analyst is more accurate. As a result, the analyst has the final decision due to their lower false rates.
But what we are starting to see, and projecting to become increasing common, are machine learning systems that are in fact more accurate than their human counterparts. This is happening due to not only the improvement in machine learning, but also to the difficulty in growing the cyber security analyst human talent pool. As an example, consider a SOC, where operations often last 24 hours. It may not be possible to have a exceptional security analyst around at all times for the purpose of analyzing potential malware threats. In some cases, a junior analyst will be tasked with making threat decisions. Being junior, they are expected to have a higher error rate in their ability to assess threat. In this case, it might be better to trust a machine learning solution that is proven to be as effective as an exceptional analyst.
In the cyber security industry at the moment, the answer to if one should trust machine learning over human analysis is often ‘no’. To some extent, a shift in the way we think about technology and its capabilities needs to occur before we fully trust the next wave of machine learning systems. Perhaps this is more a matter of trust. It’s easy to cultivate a relationship based on respect and trust with your peers in the cyber security industry. But to develop the same trust with a black box machine learning model will take time, and will only shift after repeated successful results from these systems.
The next few years will be interesting in the cyber security landscape. The massive amounts of data that can be generated, along with the problems of conducting large scale analysis to find the proverbial needle in the haystack, are the perfect combination for extensive and successfully machine learning architectures. ■
ABOUT THE AUTHOR
Matt Wolff is the Chief Data Scientist at Cylance, the first math-based advanced threat detection and prevention cybersecurity company. He is a 10+ year security veteran, and holds a M.S. in computer science with a focus on artificial intelligence awarded by Georgia Tech.