Active Learning as a powerful tool in the Cyber Security arsenal

When datasets are hard to label or highly skewed, Active Learning shows great potential to help both the algorithms and the analyst to make sense of data faster and more efficiently.

The promise of AI in cyber-security has long been that of helping humans to automate and simplify the daunting task of preventing data loss by detecting, tracking and blocking malicious software and intruders. AI is a tremendously powerful tool for such a task but, unlike what happens in other domains, gathering and labelling data to train any kind of engine/classifier is not only expensive but also hard.

Take Pinterest for a moment: you can zoom on those shoes you like so much and the platform will show you a series of pictures in which the same shoes, or similar ones appear. That’s useful and adds a lot of value to the experience. Clearly this is possible because there are a lot of pictures of shoes, bags, people, dogs and cats. Data is abundant and available and that’s where Deep Learning shines.

Now take malicious software: the only really abundant data is binary files, there’s plenty of repositories where it’s possible to download terabytes of executables and that’s why many companies came out with Machine Learning based detection of malicious binaries. Binaries are abundant and available, so it’s not too expensive to train an engine that can work relatively well, this is a low-hanging fruit worth investing time in. Generally speaking that’s not a bad idea, unless the binaries are encrypted/packed (VMProtect anyone?), in which case static detection fails completely, whether it’s Machine Learning based or not.

And now, take a real attack: that’s where current Machine Learning shows its limits, not because the task is too hard but because attack data is not abundant, not available, sparse and often complex to label. It’s – fortunately – hard to collect 100.000 different attacks, and with “attack” I mean the collection of activities carried out by an attacker. Just the process of acquiring data after a breach is labour-intensive; and it’s often hard, even to the human analyst, to detect a common pattern.

With Active Learning we train a Machine Learning algorithm in a semi-supervised fashion, allowing the engines to query the analyst for an answer when the outcome of a prediction is uncertain.

The next experiment was the result of an interesting conversation I had a few weeks ago, after giving a speech on how to detect supply-chain attacks at CIFI APAC in Singapore with several cyber-security experts from large and small organizations. The discussion made it clear that Active Learning had to become part of our platform, ReaQta-Hive, in order to help the analysts to obtain faster and more accurate detections even in the absence of a large dataset of samples.

Just as a reminder: with Active Learning we train a Machine Learning algorithm in a semi-supervised fashion, allowing the engines to query the analyst for an answer when the outcome of a prediction is uncertain. In a way it’s like asking a teacher if an action is correct when we are learning something new. In  other words it also means that we can incorporate human knowledge into the learning process, allowing the engines to accurately train on a reduced dataset.

To show the power of Active Learning, we setup an experiment using real data. We took an internal database of recon & lateral movements and we limited it to just 1000 entries, half of them were legitimate activities, the other half were malicious. Of course such a database is not representative as there are many different ways of initiating a lateral movement and even more ways of performing recon activities, so this is effectively a real problem. We want to answer certain questions:

  • Are the classifiers capable of converging faster when using Active Learning?
  • Is the accuracy comparable to that reached by a fully trained classifier?
  • What’s the impact of (human) errors on the overall accuracy?

Let’s start with a baseline experiment and then move on from there.

Baseline Experiment

First we need to define a baseline. For this we have created a simple Deep Learner trained on 75% of the dataset without overfitting, the other 25% is used for validation. This is the result.

Baseline setup
Baseline setup

The accuracy (K-fold=5) is 93.8%, not bad for such a small dataset and we observe no overfitting. We then trained the same classifier over just 15% of the whole dataset.

Training on a fraction of the dataset
Training on a fraction of the dataset

The accuracy (K-fold=5) is 86.9% with a considerable amount of overfitting showing up already at the 14th epoch, this shows that the data is clearly not enough to properly train the classifier, still the test accuracy is not bad at all, meaning that the features chosen are informative and potentially the boundary between malicious activities and normal activities is not as blurry as we initially thought. And now let’s get to the exciting part: Active Learning.

Active Learning Experiment

In the next experiment we have trained the classifier over 15% of the dataset (only 150 unique labels) for 12 epochs – before we start experiencing overfitting – and then we asked the classifier to make a prediction on a small batch of unlabelled data. The classifiers uses softmax so it returns a probability for each class (malicious and non-malicious) and we give the classifier the option of querying for a label for the most uncertain predictions. In simple words: when the separation between the two classes isn’t clear, the classifier can ask the analyst to provide a label for the event. After few labels were collected, the classifier was retrained on this new data for a limited number of runs and the cycle repeated. The results were unexpectedly positive:

Active Learning Approach vs Traditional Learning
Active Learning Approach vs Traditional Learning

The Active Learning approach achieves better performances than the traditional one in just 6 runs. What’s more interesting is that the Active Learning approach with only 225 labels (150 initial ones and 75 queried to the analyst) reaches the same level of accuracy reached by the normal training with 750 labels! So the Active Learning training converges 3x faster than the normal one! Also the overall accuracy ends up being higher than the normal approach, reaching 98.6%, this is already a result worth exploring.

Accuracy vs Number of Labels requested
Accuracy vs Number of Labels requested

Dealing with analysts errors: 5% error rate

What about errors? The analyst can make mistakes when the system asks for the label belonging to a sample: this is the same experiment run with an analyst error rate of 5%.

Accuracy with 5% error rate from the analyst
Accuracy with 5% error rate from the analyst

Accuracy is at par with the normal training and overall the system shows good resilience against misclassifications.

Dealing with analysts errors: 20% error rate

This is what happens with an error rate of 20%, so the analyst mislabels one sample out of 5:

Accuracy with 20% error rate from the analyst
Accuracy with 20% error rate from the analyst

The classifier struggles a bit compared to the normal approach but eventually it manages to reach a pretty high accuracy anyway.

Let’s answer our Active Learning questions!

We can now answer the questions raised at the beginning:

  • Are the classifiers capable of converging faster when using Active Learning?
  • Definitely yes, the human knowledge provided helps to simplify the overall task, so the  classifier is capable of learning faster when compared to a traditional approach.
  • Is the accuracy comparable to that reached by a fully trained classifier?
  • Surprisingly, at least in this experiment, it appears to be even better than the traditional approach. This is due to the fact that hard samples are requested more often so the decision boundary is nudged and adjusted more often where the separation is more difficult.
  • What’s the impact of (human) errors on the overall accuracy?
  • Ultimately the classifiers react well to mistakes up to the point that a large labelling error (20%) still leads to decent performances over a longer training period.

Conclusions

Active learning has a lot of potential in those situations where data is skewed, scarce or expensive to label. It’s interesting to note that skewed datasets appear to benefit most from active learning: the classifiers in fact end up asking more often for scarce samples than for abundant ones, thus reducing the impact of unbalanced labels, provided that the training sets are resampled to avoid overfitting on those same labels.

Active learning adds value even in those contexts where data is abundant, in the form of accelerated learning, shorter convergence time and higher accuracy, meaning that the classifiers can become operational faster, while being less prone to false positives.

Many interesting questions remain open and they will be part of future experiments. For instance it would be interesting to understand if labels found to be difficult by the classifiers are also difficult for human analysts. Profiling the analysts over time can also be used as a way to model a downstream classifier that contributes to the learning process. For instance, higher weights can be given to the labels provided by an analyst that is found to be particular good at detecting a particular behavior, while giving lower weights to the answers offered by other analysts that appear to perform better on different identification tasks.

This is an exciting area of development that duly tuned can become a fundamental part of the interactions between humans and machines in a SOC or Security Team.

References

  • Using Active Learning in Intrusion Detection (2004)
  • Research on Query-by-Committee Method of Active Learning and Application (2006)