Li, Xuezixiang

Learning Assembly Language Models for Security Applications

2024

Li, Xuezixiang
Advisor(s): Yin, Heng

Creative Commons 'BY-SA' version 4.0 license

Abstract

Deep learning has proven its effectiveness in a wide range of binary analysis tasks, such as function boundary detection, binary code search, function prototype inference, and value set analysis. Deep learning-based assembly language models, in particular, have garnered significant attention and delivered promising results. This dissertation presents approaches and evaluations for training assembly language models tailored to diverse security applications.

First, we introduce DeepBinDiff, an unsupervised technique for program-wide code representation learning. This approach leverages both code semantics and program-wide control flow information to generate basic block embeddings for fine-grained binary diffing.

Secondly, to enhance instruction representation and provide additional support for approaches like DeepBinDiff, we introduce PalmTree, a language model designed for generating general-purpose, high-quality instruction embeddings, which can be used to improve the downstream deep-learning models for various binary analysis applications.

Third, in light of more transformer-based assembly language models that have been proposed targeting different security downstream applications, each featuring unique architectural modifications and the introduction of novel pre-training tasks, we undertake a comprehensive evaluation of transformer-based models, including PalmTree, and their pre-training tasks in the context of four distinct security applications.

Lastly, based on the insights gained from our evaluations, we outline our forthcoming work, which focuses on a novel zero-shot learning approach for obfuscation detection. This is achieved through the re-use of the pre-training task of the assembly language model, with the expectation that it can deliver comparable performance to other supervised learning approaches.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Riverside

Learning Assembly Language Models for Security Applications