What’s the sound of one hand clapping - only using ELK as your threat hunting strategy
So here goes...
I used to be a big advocate for Elasticsearch. The more I use different tools for different things, the more I find that, while the tool might be brilliant on its own for certain things, it’s not exactly great on its own for what I need.
Data Science is hard. The good thing though is that you can use the same dataset for multiple different types of Data Science. For example; Marketing can use data from web servers to look at trends and promote or demote products based on those trends. It’s very useful. In Cyber Security, the analyst can use the exact same data in the opposite way – Looking for anomalies and potentially find threats and outliers.
That’s what Threat hunting is about. A threat isn’t just something in the top 10 like we’re traditionally used to seeing with SIEM solutions. The standard SIEM model is to look at Brute Force – noisy traffic. This model is good only at detecting automated tools, or kids who don’t know what they’re doing and are probably just following a random Youtube video.
So, here’s the thing. Elasticsearch is great for searching huge swathes of data very quickly, but every blog post or article I’ve seen written today about threat hunting has one or a combination of these three formats:
- Using manual search queries to find known unknowns
- Using pie charts and graphs to find top xx or bottom xx of events.
- Using ElasticSearch’s Native Machine Learning to find peaks and troughs
How is any of that good at finding “unknown unknowns” or low and slow attacks?
Kibana is a very powerful dashboard for performing complex search queries on large data sets, but your search query needs to be specific to what you’re looking for. If you don’t know what you’re looking for, then how will you find it with Kibana? Of course – you could just search for “*” – but then you’d have thousands, millions, billions, or even trillions of events to review manually.
So, you can summarise that with graphs and pie charts. You can have a fancy dashboard with maps – let’s talk about that first. What does that even mean in a security context? Sure – world maps of users are great for understanding marketing demographics, but for Threat Hunting? Really? Is that highly-skilled criminal really coming from China or Russia? Or are they just using a compromised machine there? Are all connections from the USA generally safe because that’s where your office is? No. By blocking China, for instance, you’re blocking 19% of the world’s population, does that always make good business sense for a global company?
Your top 10 info isn’t very insightful either. Let’s say you’re looking at the top 10 sudo commands – how do you find something like a single fork bomb? So, you look at the bottom 10 and now you’re finding commands like “reboot”, “exit”, etc. and you begin to ignore them due to console burn in.
“Aha”, you say “But I use Elasticsearch’s Machine Learning”. Good for you. It’s great for finding trends. You might find things like DoS attacks or Brute Force attacks or even botnets – Welcome to 1998! Elasticsearch’s Machine learning is a great tool for marketing and trending. It’s even marketed as a security tool that can augment tools like Arcsight. But remember, contrary to other algorithms, like in Marketing; in Threat Hunting you’re not just looking for trends, you’re looking for anomalies.
Elasticsearch’s Machine Learning uses summary data, which makes it very efficient and cost effective at what it does. It doesn’t need to review petabytes of data each time it runs a query because it just looks at a summary of that data which could be reduced to mere GBs of the actual raw data. That has a performance advantage, but at the expense of the accuracy of the data.
How do we do it properly then?
Elasticsearch is great within its own context. We use it at Knogin as a component amongst a bunch of other tools within a larger scope. The goal for Threat Hunters is to find anomalies in behaviours through the collection of logs. Those behaviours could be in People, Applications, Machines or Subnets for instance.
Here’s our basic high-level stack. There’s more to it than this, but these define the basics:
What does our stack do?
We find that Logstash has certain limitations when it comes to the number of events per second that it can handle. We want to not only handle millions of EPS but we want to send the same events back through for processing multiple times – Logstash can’t do that, so we binned Logstash.
The first job for Nifi is to accept the events from the agent deployed on the client’s side. It does multiple things with that event – one is to store the raw event in Hadoop and others are for it to parse and batch events for shipping to Kafka. When Kafka receives the batched events, it then sends them to our parsing cluster within Storm.
Storm parses the data and then sends it back to Kafka. Then Kafka receives the events which are further along in the process than the first time it received them, so it sends those events to our Enrichment cluster within Storm. This is where the data gets all its extra telemetry data added – such as Threat Intel, Geo info and any other info that we or the customer wants to add to it (e.g. Data/Asset Classification, Region, tags, etc).
Now the data is sent back to Kafka again. Kafka gets the bigger event and again, knows what part of the process the event is in, and then forwards the event to the Storm Indexing cluster, where Storm prepares the data to be indexed in Elasticsearch and again in Hadoop.
Storm passes the data one last time to Kafka, which then indexes the data into Elasticsearch and Hadoop.
What’s the biggest problem with SIEM solutions today? The follow-up!
So, you’ve gotten to the end of your threat hunting, you’ve found the criminal – it’s an employee. Now you want to prosecute the person for selling your customer data to a third-party, but Elasticsearch has enriched the data automatically by adding Geo data. Bummer! Now your evidence is inadmissible because it’s been tampered with.
So, at Knogin, we also use Hadoop for two reasons:
- You can use it to very cost effectively store unmodified raw data, so when it does come time for court you have a proper unmodified original evidence store (you have to guarantee C.I.A. too – that’s for another post).
- You can let Spark access Read-Only raw data
Since we’ve already covered the first point for Chain of Custody – let’s now look at the second point.
Elasticsearch uses summary data in its machine learning. This isn’t necessarily bad, because Data Science requires working on a clean and consistent dataset. So, in theory, using summary data should be fine for trending information – and it is. But we’re not just looking for trends. We want to find specific anomalies. For instance: Let’s say your standard ML algorithm is looking for trends in your sudo commands. You find that in one day, someone has increased the number of commands they’ve typed on all your servers combined. It’s a peak in the trend, indicating lateral movement across the network.
But how do you know it’s lateral movement?
Maybe someone was performing a server migration at the time. It’s easy to see what the commands were in Elasticsearch – but what if, during the server migration, some stuff was left open to the world and the person responsible missed it? All you know is that there was a peak of commands being run during a data migration, the bulk of them seem fine, but you’re missing a ton of other data within that pool that you don’t have at a granular level, such as you would have with raw events.
This is where Spark with Hadoop can really help. By looking at the raw data, it can find the anomaly within the anomaly.
Elasticsearch is a good component for Threat Hunting but it shouldn’t be relied upon solely as the only means of finding threats within your environment. Other types of automation can help to point you to the unknown unknowns within your environment by performing behavioural analytics.
The real way to detect threats is by monitoring behaviours and profiling. Standard signature-based correlation and BigData is just not good enough any more. Machine Learning really needs to be a standard, not an exception.