Exploratory Analysis

Malware often have different objectives and structure that bare strong resemblance to their names. From our dataset we have the following categories:

Adware Ad-supported apps that aggressively push unwanted ads, often in notifications or full-screen overlays. They may track user behavior and run persistently in the background.

Backdoor Malware that opens a hidden remote-access channel into the device so an attacker can send commands, upload/download files, or install additional payloads, usually without the user noticing.

File Infector Malware that attaches itself to Android APK files, so installing or updating the app also installs the malicious code. It can degrade performance, drain battery, and execute whenever the infected app runs.

No Category Clearly malicious samples that don’t fit cleanly into any of the other 13 behavior-based categories or have inconsistent/ambiguous AV labels. In the dataset this is essentially a catch-all “miscellaneous malware” bucket.

PUA (Potentially Unwanted Apps) Legit-looking apps bundled with extra components (toolbars, ad injectors, trackers) the user probably didn’t explicitly want. They may harvest data, abuse permissions, or show intrusive ads while still posing as “utility” or “optimization” software.

Ransomware Mobile malware that locks the screen or encrypts data, then demands a payment (ransom) to restore access. On Android it often overlays an uncloseable ransom screen rather than encrypting every file.

Riskware Legitimate apps or tools (e.g., remote control, rooting utilities, SMS senders) that become dangerous when abused by attackers. They can be used to bypass security, exfiltrate data, or perform actions on the user’s behalf without clear consent.

Scareware Apps or pop-ups that use fake warnings (“your phone is infected!”) to scare users into installing other malicious software or paying for bogus “security” products. The goal is to generate panic and force quick, bad decisions.

Trojan Malware disguised as a normal or useful app; once installed it can steal data, spy on activity, modify files, or pull down more malware. The key trait is deception—users think they’re installing something benign.

Trojan-Banker A Trojan focused on banking and financial apps, often overlaying fake login screens or reading screen content to steal credentials, 2FA codes, and then drain accounts or crypto wallets.

Trojan-Dropper A Trojan whose main job is to smuggle in and install other malware (like ransomware, RATs, or more Trojans), often using packing/obfuscation to evade detection. Think of it as an installer for additional payloads.

Trojan-SMS Trojans that abuse the device’s SMS functionality—sending texts to premium-rate numbers, subscribing the user to services, or intercepting verification codes—usually without any visible UI.

Trojan-Spy Spyware-like Trojans that focus on surveillance: logging keystrokes, recording SMS/call data, tracking location, and exfiltrating contacts or other sensitive information to a command-and-control server.

Zero-day Malware that exploits previously unknown or unpatched vulnerabilities, so traditional signatures and patches don’t yet stop it. In this dataset, these are samples linked to zero-day exploits active at collection time.

These different goals leads to there being various level of api, hardware usage and transmitting information. This flow of information can be looked at through a kaleidoscope of angles: How many bytes are being sent or received through some tcp port? Is the code mainly writing or reading to the database? What sort of SQL queries is the code running to the db and file I/O? What API is the code calling on and how frequently?

Load Libraries and Preprocessed Data

# Load preprocessed data from preprocessing.qmd
andmal_after <- readRDS("data/processed/andmal_after.rds")
andmal_before <- readRDS("data/processed/andmal_before.rds")
feature_definitions <- readRDS("data/processed/feature_definitions.rds")

# Extract feature definitions for use in visualizations
ui_features <- feature_definitions$ui_features
dex_apis <- feature_definitions$dex_apis
webview_cols <- feature_definitions$webview_cols
fileio_apis <- feature_definitions$fileio_apis
db_apis <- feature_definitions$db_apis
db_read_apis <- feature_definitions$db_read_apis
db_write_apis <- feature_definitions$db_write_apis
filedb_apis <- feature_definitions$filedb_apis

# Global color scheme
andmal_theme <- theme_minimal(base_family = "sans") +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0),
    plot.subtitle = element_text(hjust = 0),
    axis.text.x   = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

category_col  <- scale_colour_viridis_d(option = "H", end = 0.9)
category_fill <- scale_fill_viridis_d(option = "H", end = 0.9)

cat(sprintf("Loaded andmal_after: %d rows, %d columns\n", nrow(andmal_after), ncol(andmal_after)))

Loaded andmal_after: 25059 rows, 185 columns

cat(sprintf("Loaded andmal_before: %d rows, %d columns\n", nrow(andmal_before), ncol(andmal_before)))

Loaded andmal_before: 28380 rows, 188 columns

Visualizations

Memory Analysis

# Ensure required libraries are loaded (defensive in case cached session missed load step)
library(ggplot2)
library(dplyr)
library(scales)
library(viridis)
library(tidyr)
library(tibble)



# 1) Category-level means for all File/DB APIs ----
filedb_means <- get_combined_data() %>%
  select(Category, all_of(filedb_apis)) %>%
  group_by(Category) %>%
  summarise(
    across(everything(), ~ mean(.x, na.rm = TRUE)),
    .groups = "drop"
  )

# Turn into matrix: rows = Categories, cols = APIs ----
filedb_mat <- filedb_means %>%
  column_to_rownames("Category") %>%
  as.matrix()

# Replace NaN/Inf (e.g., mean of all-NA column) with 0 ----
filedb_mat[!is.finite(filedb_mat)] <- 0

# 2) Drop APIs with zero variance (no information) ----
col_sds <- apply(filedb_mat, 2, sd, na.rm = TRUE)

keep_cols <- names(col_sds)[col_sds > 0 & !is.na(col_sds)]
filedb_mat2 <- filedb_mat[, keep_cols, drop = FALSE]

# 3) Z-score across categories per API ----
filedb_mat_scaled <- scale(filedb_mat2)  # center & scale each column

# 4) Hierarchical clustering on rows (Categories) and columns (APIs) ----
row_clust <- hclust(dist(filedb_mat_scaled))
col_clust <- hclust(dist(t(filedb_mat_scaled)))

row_order <- rownames(filedb_mat_scaled)[row_clust$order]
col_order <- colnames(filedb_mat_scaled)[col_clust$order]

# 5) Long format for ggplot, using clustered ordering ----
filedb_long <- as.data.frame(filedb_mat_scaled) %>%
  rownames_to_column("Category") %>%
  pivot_longer(
    cols      = -Category,
    names_to  = "api_call",
    values_to = "z_mean_calls"
  ) %>%
  mutate(
    Category = factor(Category, levels = row_order),
    api_call = factor(api_call, levels = col_order)
  )

# 6) Heatmap ----
p_filedb_heatmap <- ggplot(
  filedb_long,
  aes(x = Category, y = api_call, fill = z_mean_calls)
) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(
    option = "B",
    name   = "Z-scored\nmean calls"
  ) +
  labs(
    title    = "File I/O and database activity",
    subtitle = "Z-scored mean call counts per API",
    x = "Malware category",
    y = "File / DB API"
  ) +
  andmal_theme +
  theme(
    panel.grid = element_blank()
  )


p_filedb_heatmap

By normalizing counts per API and computing per-Category cluster means we hope to separate out clusters of categories with heavy DB writes vs light readers vs almost no DB usage. In particular here we can see that risk ware, adware and trojan are far more database intensive than file infectors, trojan-sms and shareware; this makes sense since these are lightweight and often have one-shot payloads. Riskware is often large benign code that interfaces with a lot of the system but has been taken advantage of by another entity.

# Heatmap All categories: reads vs writes ---- 
# Filter to categories with sufficient non-zero data points AND variance for density estimation
db_data <- get_combined_data() %>%
  group_by(Category) %>%
  summarise(
    non_zero_count = sum(log_DB_reads > 0 | log_DB_writes > 0),
    reads_var = var(log_DB_reads, na.rm = TRUE),
    writes_var = var(log_DB_writes, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(
    non_zero_count >= 50,  # Require at least 50 non-zero points
    reads_var > 0.01,      # Require minimum variance in reads
    writes_var > 0.01      # Require minimum variance in writes
  ) %>%
  left_join(get_combined_data(), by = "Category")

# Create separate dataset for density estimation (non-zero points only)
# This prevents bandwidth estimation failures when there are too many zeros
db_data_density <- db_data %>%
  filter(log_DB_reads > 0 | log_DB_writes > 0)

p_db_read_write <- ggplot(
  db_data,
  aes(x = log_DB_reads, y = log_DB_writes)
) +
  # Use non-zero data for density estimation to avoid bandwidth issues
  ggdensity::geom_hdr(
    data = db_data_density,
    aes(fill = after_stat(probs)),
    probs = c(.99, .95, .80, .50),
    n = 300,
    alpha = 0.6  # Lower alpha for better visibility
  ) +
  facet_wrap(~ Category, scales = "free") +  # Free scales for better visibility of each category
  scale_fill_viridis_d(
    option = "inferno",
    name = "Density Level",
    labels = c("99%", "95%", "80%", "50%"),
    guide = guide_legend()
  ) +
  labs(
    title    = "Database read vs write activity by category KDE",
    subtitle = "(points where dimensions > 0,  categories with > 50)",
    x = "log10(total DB read calls + 1)",
    y = "log10(total DB write calls + 1)"
  ) +
  andmal_theme

p_db_read_write

I had high hopes for comparing the # of read and writes directly. My thought process was that malware optimized to steal credentials or threaten to damage your computer like ransomware would be more write heavy; on the other hand something like spyware might be more read heavy. But the facet plots all share essentially the same axis and there is not any distinct patterns to their overall distributions.

# Memory Activities vs WebViews by category ----
# Filter to categories with sufficient data points where BOTH dimensions are non-zero
# This is required for proper bandwidth estimation in density plots
memory_data_summary <- get_combined_data() %>%
  group_by(Category) %>%
  summarise(
    both_nonzero_count = sum(Memory_Activities > 0 & Memory_WebViews > 0),
    .groups = "drop"
  ) %>%
  filter(both_nonzero_count >= 50)  # Require at least 50 points with both > 0

# Calculate variance and unique values on the both > 0 subset for each category
# Need sufficient unique values for kernel density estimation to work
memory_data_variance <- get_combined_data() %>%
  filter(Memory_Activities > 0 & Memory_WebViews > 0) %>%
  group_by(Category) %>%
  summarise(
    activities_var = var(Memory_Activities, na.rm = TRUE),
    webviews_var = var(Memory_WebViews, na.rm = TRUE),
    activities_unique = length(unique(Memory_Activities)),
    webviews_unique = length(unique(Memory_WebViews)),
    .groups = "drop"
  ) %>%
  filter(
    activities_var > 0.01,      # Require minimum variance in activities
    webviews_var > 0.01,         # Require minimum variance in webviews
    activities_unique >= 5,       # Require minimum unique values for density estimation
    webviews_unique >= 5         # Require minimum unique values for density estimation
  )

# Get categories that pass both filters
valid_categories <- intersect(memory_data_summary$Category, memory_data_variance$Category)

# Filter full data to valid categories
memory_data <- get_combined_data() %>%
  filter(Category %in% valid_categories)

# Create separate dataset for density estimation (both dimensions > 0 only)
# This prevents bandwidth estimation failures when one dimension has many zeros
memory_data_density <- memory_data %>%
  filter(Memory_Activities > 0 & Memory_WebViews > 0)

p_memory_activities_webviews <- ggplot(
  memory_data,
  aes(x = Memory_Activities, y = Memory_WebViews)
) +
  # Use non-zero data for density estimation to avoid bandwidth issues
  ggdensity::geom_hdr(
    data = memory_data_density,
    aes(fill = after_stat(probs)),
    probs = c(.99, .95, .80, .50),
    n = 300,
    alpha = 0.6  # Lower alpha for better visibility
  ) +
  facet_wrap(~ Category, scales = "free") +  # Free scales for better visibility of each category
  scale_fill_viridis_d(
    option = "inferno",
    name = "Density Level",
    labels = c("99%", "95%", "80%", "50%"),
    guide = guide_legend()
  ) +
  labs(
    title    = "Memory Activities vs WebViews by category",
    subtitle = "(points where dimensions > 0,  categories with > 50)",
    x = "Memory_Activities",
    y = "Memory_WebViews"
  ) +
  andmal_theme

p_memory_activities_webviews

Looking at how big the memory cost of WebViews vs Activities was a lot more fruitful. Adware and Riskware have relatively high values in both of these and their distributions are much more concentrated. Since my scales vary from facet to facet we can hypothesize that these features or weighted combinations of them separate our data cleanly in some high dimensional space.

Notice that for both heat maps there are a few categories left out for example; read vs write is missing Scareware, Trojan Banker.
This is because for one or both of these features there was too many 0 values in that category. Even though this ruins the KDE plot for these categories it also exposes a valuable feature for classification.

IPC/Binder plots

We’ll group APIs into 4 behaviors:

Broadcasts - sendBroadcast / sendStickyBroadcast

Services - startService / stopService

Receivers - any *_registerReceiver / ActivityThread_handleReceiver

Activities - any *_startActivity

Then compute, for each Category, the proportion of samples that use each behavior at least once.

# Build per-sample flags for each IPC behavior group ----
ipc_group_summary <- get_combined_data() %>%
  select(Category, has_broadcast, has_service, has_receiver, has_activity) %>%
  pivot_longer(
    cols      = c(has_broadcast, has_service, has_receiver, has_activity),
    names_to  = "ipc_group",
    values_to = "present"
  ) %>%
  mutate(
    ipc_group = recode(
      ipc_group,
      has_broadcast = "Broadcasts",
      has_service   = "Services",
      has_receiver  = "Receivers",
      has_activity  = "Activities"
    )
  ) %>%
  group_by(Category, ipc_group) %>%
  summarise(
    prop_present = mean(present, na.rm = TRUE),
    n_samples    = n(),
    .groups      = "drop"
  )

# Bar chart: proportion by Category & IPC group ----
p_ipc_groups <- ggplot(
  ipc_group_summary,
  aes(x = Category, y = prop_present, fill = ipc_group)
) +
  geom_col(position = "dodge") +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_viridis_d(
    option = "B",
    end    = 0.9,
    name   = "IPC/Binder behavior"
  ) +
  labs(
    title    = "IPC, Binder & broadcast behavior by malware category",
    subtitle = "Proportion of samples that use each IPC group at least once",
    x = "Malware category",
    y = "% of samples"
  ) +
  andmal_theme +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

p_ipc_groups

These define how the app communicates inside Android—background services, broadcast receivers, etc.—classic backdoor/spy behavior. Apps that use these are system-integrated and work in the background constantly as opposed to one-shot payloads. Trojans and Ransomware are using much less broadcasts than Adware and Riskware since the first set places a lot of importance on being covert. Again we see evidence that lighterweight malware such as scareware and fileinfector use much less services.

# Long + summary for PII type proportions
pii_long <- get_combined_data() %>%
  select(Category, has_ids, has_accounts, has_location, has_mic) %>%
  pivot_longer(
    cols      = c(has_ids, has_accounts, has_location, has_mic),
    names_to  = "pii_type",
    values_to = "present"
  ) %>%
  mutate(
    pii_type = recode(
      pii_type,
      has_ids      = "Identifiers (device + WiFi)",
      has_accounts = "Accounts & content",
      has_location = "Location",
      has_mic      = "Microphone"
    )
  )

pii_summary <- pii_long %>%
  group_by(Category, pii_type) %>%
  summarise(
    prop_present = mean(present, na.rm = TRUE),
    n_samples    = n(),
    .groups      = "drop"
  )

# Stacked bar per Category: what PII is accessed?
p_pii_type <- ggplot(
  pii_summary,
  aes(x = Category, y = prop_present, fill = pii_type)
) +
  geom_col(position = "stack") +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_viridis_d(
    option = "B",
    end    = 0.9,
    name   = "PII type"
  ) +
  labs(
    title    = "Types of PII accessed by malware category",
    subtitle = "Each bar shows the proportion of samples accessing different PII types",
    x = "Malware category",
    y = "% of samples"
  ) +
  andmal_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p_pii_type

PII API’s refer to those that control privacy such as device identifiers, accounts & content, location and microphone. The bars here can add up to over 100% but thats ok because the number is still relevant as a metric of how aggressive the code is. My intuition was that spyware/PUA would spike here but this did not turn out to be impactful. It turns out there is also a lack of samples that actually use microphone/location access. Some malware only starts interacting with these after reboot but even then it is a microscopic amount of samples:

# Long format for location/mic + reboot_state
loc_mic_long <- get_combined_data() %>%
  mutate(
    has_location = location_calls > 0,
    has_mic      = mic_calls      > 0
  ) %>%
  select(Category, reboot_state, has_location, has_mic) %>%
  pivot_longer(
    cols      = c(has_location, has_mic),
    names_to  = "sensor_type",
    values_to = "present"
  ) %>%
  mutate(
    sensor_type = recode(
      sensor_type,
      has_location = "Location",
      has_mic      = "Microphone"
    )
  )

loc_mic_summary <- loc_mic_long %>%
  group_by(Category, reboot_state, sensor_type) %>%
  summarise(
    prop_present = mean(present, na.rm = TRUE),
    n_samples    = n(),
    .groups      = "drop"
  )

# Bar chart: before vs after reboot for location/mic
p_loc_mic_reboot <- ggplot(
  loc_mic_summary,
  aes(x = Category, y = prop_present, fill = reboot_state)
) +
  geom_col(position = "dodge") +
  facet_wrap(~ sensor_type) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_viridis_d(
    option = "inferno",
    name   = "Reboot state"
  ) +
  labs(
    title    = "Location and microphone access: before vs after reboot",
    subtitle = "Proportion of samples touching location/mic APIs, by category & reboot state",
    x = "Malware category",
    y = "% of samples"
  ) +
  andmal_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p_loc_mic_reboot

Dynamic Code Loading Analysis

Dynamic code loading is running code not included in the initial APK, it has benefits like smaller app size but introduces many security risks and is not recommend by google: https://developer.android.com/privacy-and-security/risks/dynamic-code-loading I don’t have the domain expertise to say why but dynamic code loading through DexClassLoader APIs varies significantly across malware categories:

library(ggplot2)
library(dplyr)
library(scales)

# Summarise per Category: proportion of samples with any Dex loading
dex_summary <- get_combined_data() %>%
  group_by(Category) %>%
  summarise(
    prop_dex_any = mean(dex_any, na.rm = TRUE),
    n_samples    = n(),
    .groups      = "drop"
  )

# Bar chart: % of apps using Dex loading, by Category
p_dex_bar <- ggplot(
  dex_summary,
  aes(x = Category, y = prop_dex_any, fill = Category)
) +
  geom_col() +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_manual(values = rep("red", length(unique(dex_summary$Category)))) +
  labs(
    title = "Dynamic Dex loading by malware category",
    subtitle = "Percentage of samples that use any DexClassLoader / DexFile dynamic loading API",
    x = "Malware category",
    y = "% of samples with dynamic Dex loading"
  ) +
  andmal_theme +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

p_dex_bar

```