SentinelOne researcher, Caleb Fenton was a guest speaker at Cyber Defenders and talked about how Malware is identified in Android applications using Machine Learning and the behavioral heuristics taken into account to do the same.
How is Malware Identified?
Originally, there were only byte signature heuristics, so malware was identified by it. This was good in the beginning, but after a while it was not beneficial. Thus, the current process of identification evolved. Firstly, the boolean expression of identifying traits of a malware was compared to normal files. Then, the behavior heuristics of a malware was analyzed and went on to identify, “What does the malware do?”
Caleb summarized the identification process as follows:
1. Collect samples
- Need lots of good and bad samples
- Diversity of good and bad is important
2. Engineer features
- Features are classified as Good and Bad — 0 and 1
- Understanding the APK format: Android apps come as APK files — are just ZIP files and are rich with variety
- Look for other resources (Icons, maps, sounds, etc)
- Offensive & Defensive Android Reverse Engineering
3. Explore the data
In order to explore the data we have, we need to ask the following questions:
- Which features don’t vary? — These are probably useless
- Which features correlate to target?
- What about linear combinations in the data?
- What’s the shape of the data?
He went on to say that exploration is a process that never stops and it is essential to dig in and understand the roles. To visualize the data in this phase, it is necessary to consider:
- Each feature is a dimension
- >3 features is hard to visualize
- Need to reduce dimensions
One of the ways to visualize is using t-distributed Stochastic Neighbor Embedding (t-SNE). It gives some pretty cool animation effects. To know more about this, visit this page
4. Train a Model
It is necessary to draw a decision boundary between the classes of data and different algorithms do this differently. Just remember the following:
- Try as many algorithms as possible — Grid searching
- Try many parameters
- Lots of tuning and adjusting will go into it
- Stack multiple models
- Evaluate the performance
How to Test a Model?
It is necessary that you don’t test and train with the same data. So, we have train on some data and test on the rest! K-fold cross validation is the process where we generally train on 75% and test on the remaining 25% in different iterations.
He also introduced the concepts of Precision and Recall along with an introduction to the models — Naive Bayes and SVM.
5. Repeat
Finally, we repeat the first four steps until we achieve accurate results for our model and are able to correctly identify malware.
What is the Modern Approach?
The modern approach to identifying a malware is based on:
- Reputation (file hash)
- Static heuristics
- Static AI Model
- Behavioral heuristics
- Behavioral AI model
Why Use Python?
Highlighting the importance of Python for cybersecurity, Caleb mentioned how simple the language of Python is and how it comes with many libraries that makes processing data easy. Some of the libraries he highlighted were:
- Sklearn — widely used in machine learning (has great documentation)
- Jupyter notebooks — great for experimentation
- Pandas — used for managing data
- Xgboost — great Gradient Boost library (GBT)
How Did Caleb Become a Security Engineer?
Among other things, Caleb spoke about how he became a security researcher — “the geek route; loving to code, working on a PHP site and not going to my high school graduation”. He explained how important it is to build a feedback mechanism by learning skills and submitting content online for evaluation, which helps you to develop your areas to improve.
How Are Anti-Viruses Keeping Up?
As a researcher himself, Caleb said that customer care more about malware detection. Vendors don’t even know what they want most of the time. What they mostly care about is having a slick website where they can manage their web-points. He thinks the way for anti-viruses to keep up is to get a dynamic detection, since it is a lot easier now to describe how malware behaves. With these definitions, if there is a sophisticated dynamic engine which is watching every API call, it can easily tell if a PDF downloads an exe file. We can keep track of all these and build this whole story. Once it crosses a threshold, we can track everything it did and flush it out of the system.
Caleb Developed Simplify — What is it?
It is a virtual machine (there’s like a 100k lines of code) which takes exe files that runs inside your own VM and keeps track of everything that happens. It will take both decision paths and keeps track of every possible action. There is also pattern based obfuscation and also has the easiest way to de-obfuscate.
Understanding Machine Learning Algorithms — The Caleb Way!
Caleb says that in general what he would work on is to serially master things. You pick one or two things and go very deep into something and learn the in and out of it, then go on to do something else. He emphasizes, “Don’t start off with using 20 different things. Start with one simple thing and learn more about it and then get further with other applications”.
Solid Piece of Advices!
Caleb addressed all the questions that the students had especially related to the field of cybersecurity, how to showcase our skills and accomplishments and also how to keep ourselves in pace with the growing industry. A few of them are highlighted here:
“I definitely recommend writing a blog — you may have only 2 paragraphs worth of material to write, but as you get into it, you might develop more ideas that you haven’t even thought about previously.”
“Building your GitHub profile is very important — I wanna know if you can code. Interview process is also important — Getting to know you and if you’re eager to learn”
The most solid of all and something that everybody needs to work on from day one is:
“Always be engaged to work on something interesting. Don’t ever be discouraged by those around you that you feel know more than you”.
Learning Resources
Going on to say how much amount of learning resources are already available in the internet, Caleb gave information about the following few resources that will help us to learn a lot more:
- Ember malware code + data
- Malware classification contest
- Kaggle blog
- HackerRank — AI challenges
- Build an Antivirus in 5 mins
Datasets to test and experiment
A few malware datasets that were highlighted to experiment our learning are:
- Beta.virusbay.io
- Koodous.com
- Sandroid.xjty.edu.cn:8080
Find the live video of the lecture here:
What is Cyber Defenders?
Cyber Defenders student program is an 8 to 10-week paid summer internship and in 2017 is expanding to south bay through a partnership with San Jose Evergreen Community College District’s Silicon Valley Engineering Technology Pathways grant which includes nine community college partners. The Cyber Defenders program offers a compelling introduction to the field of Cyber Security and prepares students with practical skills to be workforce ready providing a viable pipeline for your organization.
As a cohort group, during the internship program students receive practical experience in computer systems, network operations, computer security, information protection and cyber policy. The program consists of projects, classes, seminars, presentations, meetings, a poster session, a cyber policy debate and a capture the flag team challenge.