Machine Learning for Security: Things Will Work Better with Humans in the Loop
Why Machine Learning for Security Is Difficult
Malicious online behavior does not want to be detected and bad actors are getting better all the time at hiding their digital footprints. The result is that enterprises find themselves battling against increasingly sophisticated and frequent online attacks, making for challenging adversarial scenarios that get more complex over time.
In recent years, advances in machine learning (ML) have raised hopes that we might be on the cusp of an era where cyber safety defenders can rely on system smarts to automatically and rapidly sift through vast amounts of data to detect anomalous behavior and then uncover the intent raised by an alert.
But let’s not get too far ahead of ourselves.
Computers have the capacity to be better than humans at certain things. For instance, they’re great when it comes to handling very large datasets and complex statistics. As good as they can be, however, ML-based classification of malicious behavior still has limits. While they are very good at reasoning with statistics, computers are generally not very good at being creative. It’s important to recognize this because some forms of malicious behavior, such as malware, will try to disguise itself as benign software so that it's not blocked and the problems that we face require a certain amount of creativity.
For instance, ML recognizes patterns it’s been taught in the data it’s been given. So, if an activity fails to meet a certain threshold or act in a way that would trigger a detection, the system may miss a potential threat. What’s more, the process is based on statistics extracted from the available data. But this doesn’t guarantee the system is going to get it right all the time.
Let’s say the system is seeking to distinguish patterns to separate out benign from malicious behavior - with the ultimate goal of determining whether or not to block a possible threat. In the course of its work, the system examines all kinds of security information. When it looks at a file, for instance, what if the browser you used in the download did not save information about where the data’s coming from? Different versions of software may leave potential blank spots. What’s more, when collecting data from different sources, some users might be using security products that share certain information, while others might not. Further, different versions of operating systems may have different rules for sharing that information. And some information just might be flat wrong.
All this reinforces the fact that there’s still no substitute for human experts to provide the necessary context to make proper prioritization. Otherwise, we’re basically flying blind and can’t claim to understand the intent of the alert’s action.
We might assume that as more data comes online, these challenges will get resolved. But is that a realistic expectation? “We still run into the challenge posed by false positive and false negatives with so-called “unknown unknowns” such as zero-day exploits, where we may even train the system incorrectly, telling it that malicious behavior we still don’t know about is benign. In those cases, the system can’t be blamed for false negative results. This connects back to the quality of the information. If new behavior suddenly appears, a traditional ML approach will have a very hard time making sense of what it is because the system has had no relevant examples.
ML Needs Humans - and Vice Versa
While computers cannot replace experts, they can be very useful working in complement with humans. Computers are great at recognizing variations in statistical distributions. But they can’t really say what they are. Meanwhile, humans handle very little data, but they have much easier access to contextual information They can search for stuff online and understand it or they can find the right colleague to ask for help. They also have intuition, a very valuable trait that machines lack.
Case in point: we basically train systems using historical data. The system only knows about the training data that we feed into it. So, when something completely new shows up, it doesn’t have ways to understand what's happening. That’s where humans can step up and provide needed judgement; not only do they have more information than what’s available to the system but they're also able to reason creatively to come up with better results.
Of course, there are people who are optimistic about the potential of systems to handle everything. On the other end of the spectrum are skeptics who argue that ML does not really solve our problems. I think the truth is somewhere in the middle between those two polarities.
I envision a future in which humans and ML work in tandem to accentuate the advantages that each brings to the table with humans making judgments after systems examine big clusters of data.
And it wouldn’t add to the load borne by humans. Just the opposite. They would have extra time to focus on what is most important and better distinguish threats turned up by these clusters of data.
Bottom line: When it comes to cyber security, ML is going to be more effective when it is applied in conjunction with human experts in the loop, rather than working on its own. That’s where you can talk about a complete solution. But it needs humans to be fully effective.
LINKS:
-
Lorenzo Cavallaro of King’s College offers a great presentation on the challenge of evaluating ML in security. Other related research materials are available here. The project’s homepage is https://bit.ly/31Nj3mE.
-
Check out this research on building a clustering algorithm that avoids the pitfalls of numeric feature extraction, allowing experts to write arbitrary code to define similarity between data items while keeping the system scalable. This approach is also hierarchical, recognizing clusters within clusters, and supports incremental computation for data arriving in a streaming fashion.
-
My Norton Labs colleagues Yufei Han and Mahmood Sharif have done important research into the topic of incomplete data. The accompanying slides are available here.
-
For anyone who wants to learn more about adversarial ML, watch our colleague Principal Research Engineer Mahmood Sharif’s appearance at the Hugh Thompson Show at RSA.
-
A 2010 paper by Robin Sommer and Vern Paxson on the issues of machine learning for intrusion detection: https://bit.ly/2wevtZk (slides: https://bit.ly/39u3NxC)
-
Research visualizing and interacting with the results on machine learning in security operations center: https://vimeo.com/276405206 (video), https://bit.ly/3bB1zi6 (paper)
Copyright © 2020 NortonLifeLock Inc. All rights reserved. NortonLifeLock, the NortonLifeLock Logo, the Checkmark Logo, Norton, LifeLock, and the LockMan Logo are trademarks or registered trademarks of NortonLifeLock Inc. or its affiliates in the United States and other countries. Firefox is a trademark of Mozilla Foundation. Other names may be trademarks of their respective owners.
We encourage you to share your thoughts on your favorite social platform.