I love talking with people who created products and services I admire. Whether such encounters occur in-person or online, I appreciate the opportunity to put a human face to the technology that represents their work. With this in mind, I reached out to Jan Miller, who has been working on an automated malware analysis tool called VxStream Sandbox, which is available for free at Hybrid-Analysis.com. Who is he and why did he embark upon such a challenging endeavor? What's special about the "hybrid" technology he created? I asked Jan a few questions to find out.
Update: This interview was conducted and published in February 2015, at the onset of Jan's journey that led to his company, Payload Security, being acquired by CrowdStrike in November 2017. You can read about this milestone in my follow-up interview with Jan. This original post was updated in November 2017 to mention the acquisition and also to fix a few broken links.
What's your background, Jan? How did you find yourself with the skills and desire to create a sophisticated malware analysis sandbox?
My dad is a software developer and made me write my first computer program in C on a Macintosh Performa 6300. That's when it all started and I kept on writing low-level code all throughout my youth, without wanting to go into the details of the dark ages. After high school I studied computer science at the University of Rostock (Germany) for five years and ended up writing a diploma thesis on static binary analysis, reverse engineering techniques and functional similarity of executable files in the context of malware.
I've always had a strong focus on binary analysis, especially taking a file apart and detecting interesting code sequences. I had a brief detour at a large software development company, where I learned high-level languages and database related things. I ended up switching into the malware forensics area and become the leading software developer at a known sandbox firm, adding my knowledge on static analysis and reverse engineering. That's when I was able to combine my love for "automatic reverse engineering" with something useful, namely malware forensics. Writing a piece of software that can analyze another piece of software and make an intelligent decision for me is the holy grail of computer science, similar to the halting problem defined by Alan Turing.
Today, I live with my family near Frankfurt and try to find a good mix between work and taking walks with my wife and little girl in the nature.
Why are you creating a new malware analysis sandbox, given that we already have several free and commercial options available?
I love what I do and believe combining static and dynamic analysis is inevitably going to be the next step in sandbox system evolution, when automatic systems are to tackle increasingly more complex malware. Pure dynamic analysis (runtime behavior observation) is not enough these days, as malware is becoming more "environment aware" to detect virtual environments. Often, the real payload is not executed and triggered through time bombs or other mechanisms.
That is why modern sandbox systems need to analyze the entire process memory using static analysis, as well. Combining static with dynamic analysis is what I call "Hybrid Analysis"—because a mix of both technologies is implemented. For example, the static analysis implementation uses runtime information to associate memory address values with symbols or perform a better data flow analysis. In the long run, I believe a technology like Hybrid Analysis is going to be a basic requirement for any sandbox system. So far I know of only one other sandbox system implementing a similar technology out of the box, so I am sure there is room for VxStream Sandbox in the industry. Of course, each automated malware analysis system will have its advantages and disadvantages, but it is always good to have an alternative.
What benefits do you seek to provide to malware analysts by building VxStream Sandbox?
Hybrid Analysis is a technology implemented by VxStream Sandbox. Due to its nature it allows extraction of a lot more artifacts/indicators (such as API calls or strings that are concatenated on the stack) than systems implementing only runtime behavior analysis. This approach adds a lot to the possible behavior signatures giving benefit to any automatic rule-based system.
In addition, VxStream Sandbox makes memory dumps of the analyzed processes from which it extracts annotated disassembly listings using hybrid analysis technology and sorts them by relevance (e.g. based on the number of API calls, string references or matched signatures) giving an analyst a quick way to find potentially interesting entrypoints for a subsequent manual analysis. For example, an analyst could reconstruct PE files based on a memory dump frame and load the reconstructed file in IDA Pro, including the automatically detected API call/string annotations using an IDA Pro plugin.
I plan on adding bridging capabilities to import results from the analysis into sophisticated debuggers in the future, as all the data is at hand already. In general, an analyst can both stop analysis at the report level and use the extracted data to generate a YARA or STIX signature—or use the report as a helping hand when diving into the depths of the malware as part of a manual analysis. You can read about the VxStream Sandbox architecture on our site.
What are some of the challenges you've encountered when creating VxStream Sandbox, technical and perhaps non-technical ones?
The biggest challenge was probably writing the static analysis engine. Automatically distinguishing between code and data has been very difficult, because traversing code recursively will never reveal all code locations. Plus, there are all kinds of tricks malware authors implement to hinder analysis, such as fake conditional jumps, trash instructions, self-modifying code, and so on. Also, integrating runtime symbol information and implementing a basic data flow analysis was a challenging task, because the sandbox system might need to process some hundreds megabytes of memory dump files.
The biggest non-technical challenge is getting out there and establishing a name, especially when the market is full of companies making promises of having high-end APT detecting tools (which is often not the case, see this "Anti-VM" benchmark blogpost). That is why we tried to create an as open as possible sandbox system that allows users to write their own signatures or "action scripts" that are executed during the analysis to simulate human behavior.
How should the toolkits that malware analysts use evolve to keep up with the evolution of malware that we discover on systems nowadays?
I think the trend is going in the direction of having automated solutions that are able to extract as many artifacts/indicators as possible. For example, one of our customers creates YARA rules that are based purely on runtime process memory analysis. Such an approach is necessary because the total number of malware samples is growing exceedingly and pure static binary signature algorithms have failed completely. Encrypted payloads and packers killed the classic pattern scanners. Sophisticated environment checks killed emulation, so now we are using virtualized environments, but performance and scalability is becoming an aspect.
Either way, being able to detect malware "in the wild" should be the ultimate challenge for the protective market. I believe automated systems will move away from being used purely in the forensics niche and will be used in backend systems of the protective market. For example, there is the latest trend towards automatic YARA rule generation - but we need intelligent algorithms and as many indicators as possible, in order to generate useful rules. Overall, I think the days of simple algorithms is coming to an end and we need to step up the intelligence of our algorithms. Hybrid Analysis is one step in this direction.
Right now using VxStream Sandbox as part of Hybrid-Analysis.com is free. Will this be the case in the future? Do you have commercial plans for your product?
Our online malware analysis service will always be free to use. If you want to license the sandbox system to install your own instance, please get in touch with us. The full version not only includes the web application, but also a load balancing controller to scale analysis, report formats in XML/JSON, a more complete single-file HTML reports, the ability to create/edit your own signatures and the ability to configure many other aspects of the system, such as interfacing with Virus Total.