Android Malware Classification
Overview
The global cybsecurity industry is valued at ~285 Billion USD currently, but in an era filled with deepfakes, increasingly stronger agentic AI coupled with uncertain international relations I think this is an underestimate. There will be an increasing need to apply more data science techniques to catch malware, identify malicious entities and enhance tools for tasks like penetration testing. This project applies visualization tools and random forest models to understand Android malware samples by using features extracted from dynamic analysis after reboot.
Research Questions
What sort of permissions and API calls are most distinctive for each category? Do these same features also work for classifying families? How about original malware(zero-day)?
Dataset
We use the AndMal2020 dataset collected by UNB’s Canadian Institute for Cybersecurity.
The CCCS-CIC-AndMal-2020 dataset can be viewed as three main groups. First, a static benign group (e.g., Ben0, Ben1, etc.), where each row is a benign Android app represented by a very high-dimensional static feature vector of roughly 9.5k manifest/metadata features (permissions, activities, system features, and counts). Second, a static malware group consisting of CSVs such as Adware.csv, Riskware.csv, and other malicious families, where each row is a malware app described by the same ~9.5k static features as the benign group. Third, a dynamic malware group, which combines the before- and after-reboot executions of malware apps; here each row is a malware sample run in an instrumented environment and summarized by 141 dynamic behavior features (API usage, memory, network, battery, logcat, and process statistics), along with malware family/category labels. This third set is reproduced after rebooting the emulator as there are malware that begin specific functionalities only after reboot.
Data Preprocessing
All data loading and feature engineering is performed in preprocessing.qmd:
- Loads raw CSV files from the AndMal2020 dataset
- Creates derived features using logarithms and ratios/counts of:
- Memory utilization metrics
- Network activity summaries
- Database access patterns
- Privacy-related API calls
- IPC/Binder behaviors
Tools & Technologies
All preprocessing, model training and plotting was done remotely through SSH to a Runpod cluster with 8-VCPU, 64GB RAM, 25GB Container Storage, 85GB Volume Storage. I prioritized RAM and to a lesser degree CPU to be able to handle the large dataset in memory while building decision trees. Since the calculations are mainly sequential and there is no massive matrix operations I did not bother wasting any of my budget on GPUs. I also decided to use ranger for random forests because the library was made to work with parallel computing.