• Trend Micro
  • About TrendLabs Security Intelligence Blog
Search:
  • Home
  • Categories
    • Ransomware
    • Vulnerabilities
    • Exploits
    • Targeted Attacks
    • Deep Web
    • Mobile
    • Internet of Things
    • Malware
    • Bad Sites
    • Spam
    • Botnets
    • Social
    • Open source
Home   »   Malware   »   Uncovering Unknown Threats With Human-Readable Machine Learning

Uncovering Unknown Threats With Human-Readable Machine Learning

  • Posted on:April 12, 2018 at 5:00 am
  • Posted in:Malware
  • Author:
    Marco Balduzzi (Senior Threat Researcher)
0

Dr. Marco Balduzzi, Senior Researcher, Forward-Looking Threat Research Team

Aided by machine learning, we analyzed data on 3 million software downloads from hundreds of thousands of internet-connected machines. In our previous blog posts for this three-part series, we explored key aspects of software downloads in the wild. We looked into the major domains from where different malware categories were downloaded and discussed which client applications were mostly targeted by malware infection. We also looked at code signing abuse and examined certain certification authorities that were found with certificates that were used for signing malicious code. In this blog post, we will discuss how we developed a human-readable machine learning system that is able to determine whether a downloaded file is benign or malicious in nature.

The development of this actionable intelligent system stemmed from the question: How can we make our knowledge about global software download events actionable? More specifically, how can we use such information to do a better job at detecting the threats posed by the large amounts of new malicious software circulating on a daily basis?

In this last installment of this blog series, we will answer such questions and give a summary of what we did with the information we’ve obtained. Our research paper titled Exploring the Long Tail of (Malicious) Software Downloads provides a more comprehensive look into how we’ve gathered and analyzed our software downloads data.

Exploration: Majority of downloaded files are still unknown

We begin with a simple observation: 83% of downloads that we observed in the wild were unknown. This means that the downloaded files are undetected, i.e., found neither benign nor malicious.

Keep in mind the following considerations:

  1. This is limited to the data set that we used for our research. Our first blog contains the details.
  2. This is limited to our best effort in labeling the download data. We made use of internal proprietary systems as well as publicly available services.

Due to the nature of our data set, an important observation we noted was that most of these files have very low prevalence. When considering the files individually, overall, each file is downloaded by only a few machines. Therefore, one may think that these files are uninteresting, and the fact that they remain unknown is understandable.

However, if we consider the number of machines, we find that 69% of the entire machine population downloaded one or more unknown files: If these had been malware, hundreds of thousands of machines would have been infected.

Of course, this raises important concerns on the actual effectiveness of large-scale real-world malware detection and classification systems deployments, and their ability to defend internet-connected machines from the emergence of new threats — especially as it appears that many of these remain undetected.

Detection: From observation to automatic detection to reduce unknown files

The goal of our research was to reduce the number of unknown downloads, given its substantial volume.

We did that by condensing the observations drawn from our study into an actionable intelligent system. This system “ingests” these observations (for example, observations on malicious signers) and automatically produces detection rules for each one. These rules are immediately applicable and have very high detection rates — at least according to our experimental results. A rule is therefore a combination of information and will look, for example, like the following:

IF (the file’s signer is “Apps Installer S.L.” AND its downloading process’s signer is “Microsoft Windows” AND the file’s certification authority (CA) is “thawte code signing CA – g2”) → MALICIOUS

The pieces of information consumed by the system, i.e., features, are as follows:

  • Signer, CA, and packer of the downloaded file
  • Signer, CA, and packer of the downloading process
  • Class of the downloading process (browser, Windows, Java, etc.)
  • Popularity of the download domain

This system generated 1,500[1] novel detection rules per month — which reduced the number of unknown downloads by 28%.[2] By counting the number of machines that downloaded these files, which amounted to 31%,[3] our system proved to be an essential tool in protecting almost a third of the total population of machines from new malware infections.

System details: A human-readable system that keeps false positives at bay

Given the importance machine learning has gained in the security industry, we think it’s necessary to share a few words to discuss the internal workings of our system. We designed our system with two main goals in mind:

  1. Generating detection rules that are human readable. For us, being able to explain why a certain software is either benign or malicious is important. In fact, customers and users in general, are more and more interested in knowing how they have been targeted – that is, the context around the infection rather than the infection itself.
    1. Common machine-learning algorithms — like support vector machines (SVMs) and neural networks — suffer from “un-interpretability,” which makes the results difficult to analyze, observe, or understand. To overcome this limitation, we used the PART rule learning algorithm to derive a set of human-readable classification rules based on the features listed above (downloaded software, downloading process, and download domain).
  2. Keeping the number of false positives (errors) as low as possible. This aspect is very important in cybersecurity operations where thousands of unknown and new software downloads (and potential threats) are observed per day.
      1. To do that, we used only a subset of all the rules generated by our PART algorithm, i.e., by including only the rules with error rates less than a maximum (configurable) error threshold τ. For example, for one month of training window Ttr and by choosing the rules that have no training error (τ=0.0%), 1,148 rules out of 1,680 rules were selected.
      2. The following table reports the statistical information about the extracted rules during different windows Ttr:
    Ttr Overall no. of rules τ Selected rules Rules composition
    No. of benign No. of malicious
    Feb 1,766 0.0% 1,020 889 131
    0.1% 1,031 894 137
    Mar 1,680 0.0% 1,148 970 178
    0.1% 1,162 976 186
    Apr 1,272 0.0% 1,054 872 182
    0.1% 1,070 875 195
    May 1,476 0.0% 974 791 183
    0.1% 986 793 193
    Jun 944 0.0% 740 577 163
    0.1% 753 585 168
    Jul 1,376 0.0% 937 755 182
    0.1% 953 763 190

[1] Average number. Averaged based on seven months’ worth of data.

[2] Average number. Averaged based on seven months’ worth of data.

[3] Average number. Averaged based on seven months’ worth of data.

Through this blog series, we sought to elaborate on our work on software downloads and its potential application to cybersecurity solutions. We started by looking at how malware campaigns are operated – both technically and economically – and how they affect organizations. We also looked at the phenomenon of code signing abuse and how criminals misuse it in the underground. In this concluding piece, we saw how domain expertise can be made actionable as a way of protecting our customers from the threats posed by the large amount of new and undetected malicious software circulating in the wild. Through a system of classification that uses machine learning technology to analyze unknown files, we can determine whether they are benign or malicious in nature. This human-readable machine learning system, as well as other pertinent findings on large-scale global download events, is discussed in more detail in our research paper titled Exploring the Long Tail of (Malicious) Software Downloads.

Behind the scene of malware operators. Insights and countermeasures. CONFidence 2018, Kracow 05.06.2018 from Trend Micro

 

Learn how to protect Enterprises, Small Businesses, and Home Users from ransomware:
ENTERPRISE »
SMALL BUSINESS»
HOME»
Tags: machine learningMalwaresoftware downloadsunknown files

Security Predictions for 2020

  • Cybersecurity in 2020 will be viewed through many lenses — from differing attacker motivations and cybercriminal arsenal to technological developments and global threat intelligence — only so defenders can keep up with the broad range of threats.
    Read our security predictions for 2020.

Business Process Compromise

  • Attackers are starting to invest in long-term operations that target specific processes enterprises rely on. They scout for vulnerable practices, susceptible systems and operational loopholes that they can leverage or abuse. To learn more, read our Security 101: Business Process Compromise.

Recent Posts

  • Our New Blog
  • How Unsecure gRPC Implementations Can Compromise APIs, Applications
  • XCSSET Mac Malware: Infects Xcode Projects, Performs UXSS Attack on Safari, Other Browsers, Leverages Zero-day Exploits
  • August Patch Tuesday Fixes Critical IE, Important Windows Vulnerabilities Exploited in the Wild
  • Water Nue Phishing Campaign Targets C-Suite’s Office 365 Accounts

Popular Posts

Sorry. No data so far.

Stay Updated

  • Home and Home Office
  • |
  • For Business
  • |
  • Security Intelligence
  • |
  • About Trend Micro
  • Asia Pacific Region (APAC): Australia / New Zealand, äž­ć›œ, æ—„æœŹ, ëŒ€í•œëŻŒê”­, 揰灣
  • Latin America Region (LAR): Brasil, MĂ©xico
  • North America Region (NABU): United States, Canada
  • Europe, Middle East, & Africa Region (EMEA): France, Deutschland / Österreich / Schweiz, Italia, Đ ĐŸŃŃĐžŃ, España, United Kingdom / Ireland
  • Privacy Statement
  • Legal Policies
  • Copyright © Trend Micro Incorporated. All rights reserved.