Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften
In the first post on edit distance, we took a look at hunting for malicious executables with edit distance (i.e., how many character modifications it takes to make two matching text strings). Now let’s take a look at how we can utilize edit distance to search for destructive domains, and how we can utilize edit distance functions that can be integrated with other domain functions to identify suspect activity.
Background to this Case Study
What are bad actors doing with malicious domains? It may be merely using a similar spelling of a common domain name to fool careless users into viewing advertisements or picking up adware. Genuine sites are gradually picking up on this strategy, in some cases called typo-squatting.
Other harmful domains are the product of domain generation algorithms, which can be used to do all kinds of dubious things like avert countermeasures that obstruct known jeopardized sites, or overwhelm domain servers in a dispersed DOS attack. Older versions use randomly-generated strings, while more advanced ones include tricks like injecting typical words, further confusing defenders.
Edit distance can assist with both usage cases: here we will find out how. Initially, we’ll exclude common domain names, considering that these are typically safe. And, a list of typical domain names supplies a baseline for discovering anomalies. One good source is Quantcast. For this discussion, we will stay with domain names and prevent subdomains (e.g. ziften.com, not www.ziften.com).
After data cleansing, we compare each candidate domain name (input data observed in the wild by Ziften) to its potential next-door neighbors in the exact same top level domain (the last part of a domain name – classically.com,. org, and so on but now can be nearly anything). The fundamental task is to discover the nearby neighbor in regards to edit distance. By finding domain names that are one step away from their closest neighbor, we can easily identify typo-ed domain names. By discovering domain names far from their neighbor (the stabilized edit distance we presented in the first post is beneficial here), we can likewise discover anomalous domain names in the edit distance space.
What were the Results?
Let’s take a look at how these results appear in real life. Be careful when browsing to these domains since they could include destructive material!
Here are a few possible typos. Typo squatters target popular domains given that there are more chances somebody will go to them. Several of these are suspect according to our danger feed partners, but there are some false positives as well with adorable names like “wikipedal”.
So now we have developed 2 helpful edit distance metrics for hunting. Not only that, we have 3 features to possibly add to a machine learning model: rank of nearest next-door neighbor, distance from neighbor, and edit distance 1 from next-door neighbor, indicating a threat of typo shenanigans. Other features that might play well with these are other lexical functions like word and n-gram distributions, entropy, and the length of the string – and network functions like the number of unsuccessful DNS demands.
Simplified Code that you can Experiment with
Here is a streamlined version of the code to play with! Developed on HP Vertica, however this SQL should function with many sophisticated databases. Keep in mind the Vertica editDistance function might vary in other executions (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).