Perspectives on Bayesian Statistics and Alert Fatigue

Most people interested in Security Informatics will have come across Bayesian analysis or Base Rate Fallacy analysis of log intrusion data. The seminal work is the paper by Stefan Axelsson; his published work is available here.  Here is a very brief recap example-the actual values are not that critical as we want mainly to introduce and highlight certain concepts that may not be clear from examples around the web.

Assume
A = an intrusion
B = an intrusion alert (e.g. an alert from your IPS)

Bayes’ theorem tells us:

Translated to English, this means that  the probability of an intrusion  given an alert, P(A|B), is equal to the prior established probability of the incidence of intrusions as a factor of the accuracy levels of the detection tool. The accuracy levels or the tools are established by you or the vendor empirically: they relate to how often it flags true positives (i.e. its ability to flag an intrusion record as such) and how often it flags a non-intrusion record as an intrusion record (the false positive rate).

Examples of this calculation relating to disease detection are ubiquitous on the web. Usually there is some rare disease under consideration, and the clinical test is 99% accurate at finding the disease when it’s actually present (true positive rate) and an expression of the instances it finds the disease even when it’s not present, say .5% (the false positive rate).

Put simply, if we believe we  know the background incidence of intrusion P(A)maybe by experimentation or previous experience-we can calculate the expectation of an intrusion given an alert.  We must also know the accuracy rate of the detector, i.e. the true positive rate and the false positive rate. We probably have reason to believe that the device accuracy is good: that rates are somewhere around 99% true positives and 1% false (otherwise how do you explain to Mgmt why you bought and use it).

A point worth keeping in mind: in P(A|B), ‘A’ is the hypothesis and ‘B’ is the fact at hand–we know we have an alert, but is it an intrusion? P(A|B) is the probability that it is.  The Bayesian approach is to leverage what you know to analyze something you don’t know.

A concrete example, if we log 10,000,000 records a day, and each record is examined by an intrusion detector, what can we project about how the rate of false positives impacts our intrusion hunting?  We need the three items discussed: the rate of actual intrusion and the two facets of device accuracy. Now, accurately setting the incidence of intrusion for all contexts is likely far more problematic that most articles on this subject maintain–let’s come back to that later. For now, let’s say we know:

  1. background incidence of intrusion is .005% of discrete logged records (roughly a few intrusions per day)
  2. detector accuracy of true positives is 99%
  3. detector rate of false positives is .01 – that is 1% of the time an alert will be a false positive

This gives us  an incidence rate per record of

Switching to an alternative but equivalent form of Bayes’ theorem convenient for calculation:

Filling in the values from what we know:

  • P(A) = .00005
  • P(B|A) = .99 (99% of the time if there is a true intrusion record, it will be flagged)
  • P(B|~A) = .01 (the established rate of detector false positives-note this rate is independent of the  99% rate of true positives–they don’t need to  add to 100%)
  • P(~A) = 1 – P(A) = .99995 (by definition )

This gives us:

Base Rate Fallacy

This means that 9951 times out of 10,000, the alert you track will be a false positive.  The revelation of Bayes is that you would have not have guessed this looking at the various accuracy assumptions we made. If a system is 99% accurate at detecting intrusions, we might have guessed that almost all of the alerts were going to be intrusions. This misplaced  intuition is what is termed the Base Rate Fallacy. Here is an eye-opening example on Wikipedia based in terms of breathalyzers and drunk driving.

Why does this phenomenon occur? This occurs in cases where an incidence is rare enough that almost any non-zero rate of false positive measurement overwhelms the true positives. Think of it this way: what if there was a single person in the entire world that had condition X. Suppose you have a test for condition X, and you tested everyone in the entire world. Suppose further that your test was absolutely guaranteed to turn up positive if someone had condition X (100% true positive rate-no missed positives). Now, what if your test had a .0001% false positive rate. That would be one incredibly accurate test! However, the results would be that 7,000 people tested positive for condition X, and only 1 person actually had it. Hence, the results are overwhelmed by false positives.

Now, back to IT: was this a useful exercise with respect to how we actually log , correlate, alert and analyze logs? Not all logs are equal: some logs come from DMZ systems where there might be a very high incidence of attacks. Other logs may come from our web servers and merely relate who is visiting what links. If you leave a switch in the DMZ open to SSH, you are likely to log hundreds of brute force password attempts against it–this is not likely to be lost in a sea of false positives and it’s not in a context where a .0005 incidence of intrusion rate per record holds true. It’s probably not a needle in a haystack: there is likely some visual or tabular summary of events designed to make such an alert standout.

Possible takeaways include:

  • Reinforce that reducing false positives to near zero is critical in environments where the incidence of actual intrusion is very low and quantity of records astronomically high.
  • Not all logs are equal – weighting can be critical
  • Maybe it’s not even possible to reduce false positives in some environments down to an alert-actionable status. Such instances require more anecdotal methods and creative visualizations to aid hunting teams
Event Grouping

Raw alerts may not be that fruitful for sifting through without a winnowing strategy, but what if one can arrange alerts in the context of other interesting log events? Hundreds of login attempts from a source IP might be interesting: should the system have a countermeasure against rapid brute-force attacks. However, what would be even more interesting is if a login attempt from that IP ever ended up working. Teaching a system to bubble up events as patterns of attack emerge is the real answer to the Base Rate Fallacy and alert fatigue.

About Ray Zupancic

Relationship Visualization

Network Science is, roughly speaking, the application of traditional Graph Theory to ‘real’ or empirical data versus mathematical abstractions. Modern Network Science courses, versus older Graph Theory courses, describe techniques for the analysis of large, complex ‘real world’ networks such as social networks. The topics tend to be mathematically challenging including community detection, centrality, Scale-Free modeling versus Random modeling and the associated probability distributions underlying the degree structure. In short, If you have a lot of interconnected data or logs, Network Science can likely help you organize and understand it.

Simple Visualization

For starters, complex network graphs often lend themselves to abstract relationship visualization of qualities not otherwise apparent. We can think of two categories of visualization: explicit attributes versus more subtle attributes inferred from algorithmic analysis such as community detection on what is perceived as a ‘lower‘ or secondary quality.  This second category could be based on machine learning, but no use in getting lost in marketing.

Most security professionals use explicit visualizations throughout the day, and likely would be lost without them–for example a chart of some event occurrence against time: if you use the Splunk timeline to pinpoint spikes in failed logins, you are using a data visualization to spot and explore potential attacks. Splunk is doing a large amount of work behind the scenes to present you with this, but it is still a simple representation against a time series, and it was always a relationship that  was readily apparent in conceptual terms–incredibly useful, but simply an implicit use of a standard deviation to note trends.

Another example is the ubiquitous use of Geo-IP data. Many organizations like to display the appearance and disappearance of connections in geographic terms: this makes for a nice world map at the SOC. Everyone can collectively gasp in the unlikely event that North Korea pops up. The reality is that the North Korean hackers are likely off hijacking satellite internet connections to launch their attacks as a source IP of Pyongyang is not all that discreet. Hence, discovering more subtle correlations may be warranted.

The deviation in this case consists of visualizing IP traffic from  ‘suspicious’ sources not normally seen. This geographic profiling is a valid tool in a threat hunter’s arsenal. The more interesting question, though: what more subtle qualities can we layer below the surface of that geographic profiling lead to glean more useful results-are there ways we could associate this with  a satellite-service IP or a pattern that leads us to look at other related domain registrations and cross-reference against our traffic? If we find some type of association, can we find even more subtle attributes in a like community: for example, is there a pattern or idiosyncrasy  of domain registration/registrars that an algorithm could uncover through community detection (use of free registrars, pattern in name of registration, contact details, time, etc)?

This is a potentially rich area of research. Going forward it will be interesting to study schemes for enriching data (e.g. essentially tying graph nodes and edges to JSON documents with meta-information). For now, just a simple demonstration of Threat Intelligence to graphing will be the exercise.

Threat Intelligence can lead to  more useful Simple  Visualizations

One way to glean useful insights involves  comparing traffic and connections logs with malware feeds, new domain lists, or other lists sites of low reputation. The results can be processed as a text list to review, but a graph depicting where an internal ‘victim’ address has touched along with internal connections to the victim is more interesting and potentially more helpful to a hunting team. The novel aspect of a graph visualization is the potential to view paths in detail.

A starter kit for reputation-based path visualizations might include:

  1. a threats or malware feed
  2. a connections log
  3. Python code parsing the above and saving results to a Networkx module graph
  4. Gephi for visualization

Suppose we have a threats feed with malware addresses (for simplicity lets use IPs), and a simple log of connections: the threats feed can be a list of IPs and the connections log a sequence of [Sender-IP, S-port, Receiver-IP, R-port, Transport, Protocol] objects. The objective  is to leverage simple visualizations to gauge exposure–a very simple example of visualizing a compromised internal address 10.3.2.2 is shown below.

en

We start by developing some code using the Python Networkx module to evaluate traffic logs and threat feeds and come up with intersections. Connections with bad actors get colored ‘red’. Conversations with outside hosts in general might be ‘blue’, and internal hosts might be ‘green’.

##################################################
# name: color_threats(g, u_lst, t_lst, e_lst)
# desc: add a color attribute to designate threat nodes
# input: nx.Graph g, list lst , list lst
# output: na
# notes: unique list, threats list and external IPs list
# are used to color
connected nodes
##################################################
def color_threats(g, u_lst , t_lst, e_lst):
    # e_lst is the list of external IPs
    # if an address is external, color it blue
    ext = re.compile(r’^192.168.*|^10.’)
        for j in e_lst:
            if not ext.match(j):
                g.node[j][‘viz’] = {‘color’: {‘r’: 0, ‘g’: 0, ‘b’: 255, ‘a’: 0}}
            else:
                g.node[j][‘viz’] = {‘color’: {‘r’: 34, ‘g’: 139, ‘b’: 34, ‘a’: 0}}

    # color the malware nodes
    risk_nodes = list(set(u_lst).intersection(t_lst))
        for i in risk_nodes:
            g.node[i][‘viz’] = {‘color’: {‘r’: 255, ‘g’: 0, ‘b’: 0, ‘a’: 0}}

The whole program is available on GitHub here. Below is a depiction of a malware connection on a busy network. Depending on how you lay them out, the visualizations can be large, hence this is an excerpt.

Layout on the fly (in one step) remains an interesting issue as Python code and Gephi visualizer are not entirely integrated.  A finished code base would ideally provide daily automated visualizations that one could click into to bring up details such as protocols, data volume, etc.

Looking to Network Science

Simple graph visualizations might end up being very useful in some contexts in place of trying to understand many lines of summarized logging. However, the strength of Network Science lies in attempting to use algorithmic analysis of very large data sets to bring to light things not quickly apparent. A typical large complex network simply looks like a blob when brought up graphically. Injecting algorithmic  interpretations of centrality and community detection based on attribute information embedded in the edges and nodes can lead to visualizations that provide breakthrough insights.