Book Chapter Review - Math for Security
Posted on Wed 10 July 2024 in Books
Introduction
note: this post was not generated by any AI.
I was at the library looking for some books and I saw an interesting title: Math for Security: From Graphs and Geometry to Spatial Analysis
. I felt it was a good time to review and learn new concepts, try few new things, and maybe update this blog.
Overall, I really enjoyed what I have read so far. Chapters 3 and 4 were really good to review some concepts. The reading was very different from Knuth's' famous books Concrete Mathematics
or the Art of Computer Programming,v.1
. I always felt really dumb/blind when Knuth says: "it's easy to see that..." and I was like wtheck?
I have to revisit chapter 5 and its Social network analysis (SNA). I liked the idea but I found the example given could be different or have more details. This is not a demerit, I honestly think Daniel Reilly, the book author, has spent a good amount of time trying to explain the fundamental concepts and creating practical examples to his book.
However, I was hopping to see the theory been applied to actual problems or use cases where Security Professionals could use daily. In my humble option, this was the issue with the book chapters 3 and 4: too much time explaining theory or too vague to tackle the issues.
I used Graph theory during my Master and PhD degree at some point to detect anomalous traffic. I even published a paper how to use Graphs at the TLD to identify suspicious behavior[1]. I am not graph researcher or applied math researcher or anything like that. I just think graphs are an important subject and can be fun, specially it can help a lot in daily security tasks.
This post brings more examples how graph theory can be used to identity malicious behavior on SSH and DNS traffic, and also how to extract relevant information during a vulnerability assessment. I hope the additional examples can help Security Professionals trying to learn more about graphs.
Practical Examples
SSH Brute Force
Given the following SSH brute force attack logs [2]. What can we learn from it?
ian 24 08:54:55 Linux-Server sshd[372745]: Invalid user skaret from 201.184.50.251 port 51582
ian 24 08:54:55 Linux-Server sshd[372745]: pam_unix(sshd:auth): check pass; user unknown
ian 24 08:54:55 Linux-Server sshd[372745]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=201.184.50.251
ian 24 08:54:57 Linux-Server sshd[372743]: Failed password for invalid user root from 218.92.0.29 port 23264 ssh2
ian 24 08:54:57 Linux-Server sshd[372745]: Failed password for invalid user skaret from 201.184.50.251 port 51582 ssh2
ian 24 08:54:59 Linux-Server sshd[372743]: Received disconnect from 218.92.0.29 port 23264:11: [preauth]
ian 24 08:54:59 Linux-Server sshd[372743]: Disconnected from invalid user root 218.92.0.29 port 23264 [preauth]
ian 24 08:54:59 Linux-Server sshd[372743]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.92.0.29 user=root
ian 24 08:54:59 Linux-Server sshd[372745]: Received disconnect from 201.184.50.251 port 51582:11: Bye Bye [preauth]
ian 24 08:54:59 Linux-Server sshd[372745]: Disconnected from invalid user skaret 201.184.50.251 port 51582 [preauth]
ian 24 08:55:13 Linux-Server sshd[372748]: User root from 180.101.88.221 not allowed because not listed in AllowUsers
ian 24 08:55:13 Linux-Server sshd[372748]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=180.101.88.221 user=root
ian 24 08:55:15 Linux-Server sshd[372748]: Failed password for invalid user root from 180.101.88.221 port 62046 ssh2
ian 24 08:55:18 Linux-Server sshd[372748]: Failed password for invalid user root from 180.101.88.221 port 62046 ssh2
ian 24 08:55:21 Linux-Server sshd[372748]: Failed password for invalid user root from 180.101.88.221 port 62046 ssh2
ian 24 08:55:23 Linux-Server sshd[372748]: Received disconnect from 180.101.88.221 port 62046:11: [preauth]
ian 24 08:55:23 Linux-Server sshd[372748]: Disconnected from invalid user root 180.101.88.221 port 62046 [preauth]
ian 24 08:55:23 Linux-Server sshd[372748]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=180.101.88.221 user=root
ian 24 08:56:04 Linux-Server sshd[372762]: Invalid user ubuntu from 201.184.50.251 port 43720
ian 24 08:56:04 Linux-Server sshd[372762]: pam_unix(sshd:auth): check pass; user unknown
ian 24 08:56:04 Linux-Server sshd[372762]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=201.184.50.251
ian 24 08:56:06 Linux-Server sshd[372762]: Failed password for invalid user ubuntu from 201.184.50.251 port 43720 ssh2
ian 24 08:56:08 Linux-Server sshd[372762]: Received disconnect from 201.184.50.251 port 43720:11: Bye Bye [preauth]
ian 24 08:56:08 Linux-Server sshd[372762]: Disconnected from invalid user ubuntu 201.184.50.251 port 43720 [preauth]
ian 24 08:56:48 Linux-Server sshd[372771]: Invalid user alberik from 118.163.63.23 port 38078
ian 24 08:56:48 Linux-Server sshd[372771]: pam_unix(sshd:auth): check pass; user unknown
In the example below, there is directed graph (digraph) where the nodes are the representation of IP address and username. The edges are the connection starting from IP address to the username. For example, the username ubuntu
was requested by the IP address 201.184.50.251
.
ian 24 08:56:04 Linux-Server sshd[372762]: Invalid user ubuntu from 201.184.50.251 port 43720
Few points that can be inferred about the logs:
- The
root
username was requested by two different IPs: {180.101.88.221 and 218.92.0.29} - The out-degree form a IP node reveals how many different users an IP has tried to attack the SSH host
- Identify possible target account/usernames
- Most frequent username requested
Taking another sample data available in the Internet [3], the behavior would be bit different as the attacker had tried at least 38 different usernames. In the case you found a username from your logs been attacked, you could restrict such requests using AllowUsers
in sshd_config
:
AllowUsers admin@10.10.1.1 foobar@192.168.1.*
Question
To prevent attacks given the above data, would you block the attacks by which strategy? By the number of tries, number of usernames? Ping me on @kaiux and let me know.
The data provided so far is quite small, thus using Graph theory here is just a fancy dictionary counter. However, for larger datasets, Graph theory is very handy as we can use to filter out irrelevant data. For example, taking data available at https://yourdata.forsale/nicetry.txt. There are over 9k SSH entries.
$ wc nicetry.txt
9087 109005 890288 nicetry.txt
$ cat nicetry.txt
...
Jul 14 04:03:07 yourdata.forsale3 sshd[238216]: Invalid user admin from 135.125.133.180 port 60070
Jul 14 04:02:40 yourdata.forsale3 sshd[238209]: Invalid user ftptest from 95.85.47.10 port 49664
Jul 14 04:01:49 yourdata.forsale3 sshd[238192]: Invalid user ali from 101.32.128.77 port 51630
Jul 14 04:01:20 yourdata.forsale3 sshd[238183]: Invalid user oracle from 27.72.62.222 port 42770
...
Jul 13 21:12:31 yourdata.forsale3 sshd[231987]: Invalid user user5 from 210.212.47.83 port 42980
Jul 13 21:11:55 yourdata.forsale3 sshd[231982]: Invalid user ubuntu from 186.13.143.106 port 32796
Jul 13 21:11:45 yourdata.forsale3 sshd[231976]: Invalid user oracle from 188.166.160.119 port 45016
...
Jul 9 01:24:33 yourdata.forsale3 sshd[83516]: Invalid user user from 43.154.203.106 port 36390
Jul 9 01:24:27 yourdata.forsale3 sshd[83513]: Invalid user ubuntu from 170.79.37.82 port 33196
Jul 9 01:24:20 yourdata.forsale3 sshd[83509]: Invalid user test02 from 43.156.132.217 port 37736
Jul 9 01:24:02 yourdata.forsale3 sshd[83505]: Invalid user test from 43.134.168.209 port 39296
...
Trying to plot the data as is won't be useful as there are too many data (noise) to investigate. There are possible ideas to better visualize the data.
- Filtering by day
- Filtering by day and hour
- Filtering by date, hour and minute (etc)
- Filtering by out-degree (biggest scanners)
- Filtering by in-degree (most attacked names)
- Filtering by biggest scanner
Example using out_degree
and in_degree
to identify biggest scanners and top attacked names:
out_degree = G.out_degree()
top5 = sorted(out_degree, key=itemgetter(1))[-5:]
print("** Top-5 Out-Degree")
print(top5)
in_degree = G.in_degree()
top5 = sorted(in_degree, key=itemgetter(1))[-5:]
print("** Top-5 In-Degree")
print(top5)
Output:
** Top-5 Out-Degree
[('31.184.198.71', 79), ('94.156.71.74', 105), ('85.209.133.20', 201), ('185.68.22.235', 267), ('193.233.253.20', 419)]
** Top-5 In-Degree
[('oracle', 175), ('test', 262), ('user', 292), ('admin', 341), ('ubuntu', 418)]
From the output above, we can see that ubuntu
is the most target username, and 193.233.253.20
is the IP address that have attacked the most on the logs. Interestingly, if you check this IP address against its reputation 193.233.253.20 you can find it has been reported in SSH attacks, Web Scrapping, etc, confirming the graph analysis.
Besided that, by using out-degree
property, it is possible to identity more details about the scan behavior. The code below was used to generate the number of request frequency. The filterout
is applied to ignore the username nodes.
out_degree = G.out_degree()
filterout = lambda x: x >= 1
out = [x[1] for x in out_degree if filterout(x[1]) ]
plt.hist(out, bins=100)
plt.yscale('log')
plt.xlabel("Num of requests")
plt.ylabel("Frequency in log")
plt.show()
Basically, most of the scan frequency can be found under 50 requests, and requests above that I would consider as outlier. Nevertheless, if you are running a SSH daemon that only supports public key authentication, there is no reason to keep your server getting overloaded with so many requests. You can configure SSHGuard
or Fail2Ban
to block an attack attempt after N amount of times, depending of your risk tolerance.
More examples
I explained how Graph theory could be use in a SSH brute force attack, however this theory can be extended and applied against any protocol or use case including Web crawler/Brute force authenticating, DNS scanning, and Visualizing Vulnerabilities.
DNS Example
DNS traffic is rich and provide many components that can be used in a graph analysis. I spent a lot of time during my undergrad life researching DNS traffic to identify botnets and suspicious traffic. For example, taking the dataset from CAIDA [4].
A graph can be used to identify source IP looking for PTR (reverse) against different block CIDR. In my research paper [1], I classified this behavior as PTR-scan attacks, where botnets use PTR-type to search for IP addresses that are live and activated. Kind of silent scanning methodology.
With QNAME data (eg.: 1.82.110.216.in-addr.arpa.
) it is possible to identity which networks are been scanned at the moment if one breaks it into CIDR.
$ grep PTR dns-traffic.20150724.txt| head
64.132.94.250 1437696007.046802000 AA,RA ANS PTR 86400 1.82.110.216.in-addr.arpa. BHN-SNAN-7.bridgeheadnetworks.com.
68.87.76.228 1437696007.066666000 AA ANS PTR 3600 1.120.249.98.in-addr.arpa. c-98-249-120-1.hsd1.nm.comcast.net.
213.197.27.201 1437696007.129005000 AA ANS PTR 86400 1.38.127.128.in-addr.arpa. 807f2601.ftth.concepts.nl.
212.27.53.199 1437696007.268300000 AA ANS PTR 86400 1.28.254.78.in-addr.arpa. mrj31-1.dslg.proxad.net.
In a corporated network, the DNS traffic sent by a host should be most of the time A-type requests. However, a node of IP address and QTYPE (e.g: A, SOA, MX, NS) would also identify suspicious hosts.
Code example using IP and QTYPE to identify the largest DNS requester and its connection type.
...
for line in fileinput.input("dns-traffic.20150724.txt", encoding="utf-8"):
tokens = line.strip().split()
if len(tokens) != 8:
continue
G.add_edge(tokens[0], tokens[4])
largest = max(G.out_degree, key=lambda t: t[1])
print(G.edges(largest[0]))
Output:
[('64.132.94.250', 'PTR'), ('64.132.94.250', 'NS'), ('64.132.94.250', 'A'), ('64.132.94.250', 'AAAA'), ('64.132.94.250', 'SOA')]
I am not fully updated with DNS anomalous detection state-of-the-art, however you can perform lexical analysis against the QNAME and identify different behaviors such as Typosquatting domains, domain name generated by algorithms, and DNS data exfiltration.
Vulnerability management
Let's assume you are scanning a container images and you need to know which packages have the most number of vulnerabilities. For that matter, I used Trivy
to scan and export the SBOM results into CycloneDX format. Having SBOM output is easy to collect more details from the vulnerability including severity, CVSS score, related vulnerabilities etc. You can also use other scanners for that such as grype
and Amazon Inspector Sbomgen as well.
For the analysis, I pulled an old Debian 10 (buster) image and scanned accordingly:
# pulling old debian image
$ docker pull debian:buster-20210902
# Scanning and generating SBOM with Trivy
$ /usr/bin/trivy image --format cyclonedx \
--output /tmp/result.json \
--scanners vuln --ignore-unfixed debian:buster-20210902
The idea here is to read the JSON file generated by the scanner and fetch the items from Vulnerability
key. For each CVE id found, create a node affecting a package name (node). In the following example we can see the CVE-2016-10228
associated with packages libc6
and libc-bin
.
"vulnerabilities": [
{
"id": "CVE-2016-10228",
...
"updated": "2023-11-07T02:29:33+00:00",
"affects": [
{
"ref": "pkg:deb/debian/libc-bin@2.28-10?arch=amd64&distro=debian-10.10",
...
{
"ref": "pkg:deb/debian/libc6@2.28-10?arch=amd64&distro=debian-10.10",
..
}
]
},
The following code parse the SBOM output and create nodes as mentioned above. The remove
list was used to exclude nodes not affected by at least 3 CVEs.
import json, math
from packageurl import PackageURL
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
with open("result.json") as fh:
data = json.load(fh)
vulns = data.get("vulnerabilities")
for item in vulns:
cve_id = item['id']
for pkg in item['affects']:
_purl = PackageURL.from_string(pkg['ref'])
pkg_name = _purl.to_dict()['name']
G.add_edge(cve_id, pkg_name)
remove = [node for node,degree in dict(G.degree()).items() if degree <= 3 ]
G.remove_nodes_from(remove)
pos = nx.layout.shell_layout(G, scale=100)
nx.draw(G, pos, with_labels=True)
plt.show()
In the output below we can see packages like libtinfo6
affected by multiple CVEs.
Final words
I hope I was able to bring additional examples of graph theory and how it can be helpful to include into your tasks as Security Professional. With the data extracted, you could create a pipeline to ingest it into different services such as AWS Network Firewall or use a threat list in Amazon GuardDuty or border router and protect your organization from differect actors.
References
- https://ieeexplore.ieee.org/abstract/document/7363066
- https://serverfault.com/questions/1152134/what-do-these-logs-mean-is-someone-attempting-to-hack-into-my-server-via-ssh
- https://security.stackexchange.com/questions/110706/am-i-experiencing-a-brute-force-attack
- https://publicdata.caida.org/datasets/topology/ark/ipv4/dns-traffic/daily/dns-traffic.20150724.txt.gz