t-SNE Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) provides a powerful way to visualize high-dimensional data in 2D space. It functions similarly to PCA except nonlinear: We essentially build two different distributions, one in high dimensional space, this distribution is gaussian and we have a hyperparameter “perplexity” that controls the clustering. And one in 2D or 3D, here we use student t-test with 1 degree of freedom, and we use an optimizer like gradient descent to minimize the KL divergence between both distributions.

Note: t-SNE computations are disabled by default (eval: false) to speed up rendering. To regenerate t-SNE plots, set eval: true in the code chunk below. Computation takes 10-30 minutes for 2D embeddings. Cached results will be used if available.

2D t-SNE Visualization

While groups in a t-SNE mean the points are related the size of the clusters and their distance from one another can change a lot with perplexity. Nevertheless we can see some interesting properties of our dataset: Zero days are one of the most spread out groups, intuitively since zero-days are introducing brand new vulnerabilities they are often completely different from other malware. This is also why they have no “families”. I’m curious if the zero-day clusters could be worth investigating. Perhaps they do actually have similarities and are the work of some, up till now, undiscovered hacker. Trojans(gold) form distinct clusters apart from those made by Riskware/Ransomware(green/cyan) and adware(black) these are the main clusters.