793 lines
37 KiB
Markdown
793 lines
37 KiB
Markdown
|
For some time now the Portable Document Format standard has
|
|||
|
been a considerable risk in regard to corporate as well as
|
|||
|
private information security concerns. Some work has been
|
|||
|
done to classify PDF documents as malicious or benign, but
|
|||
|
not as much when it comes to clustering the malicious
|
|||
|
documents by techniques used. Such clustering would provide
|
|||
|
insight, in automated analysis, to how sophisticated an
|
|||
|
attack is and who staged it. A 100.000 unique PDF dataset
|
|||
|
was supplied by the Shadowserver foundation. Analysis of
|
|||
|
experiment results showed that 97% of the documents
|
|||
|
contained javascripts. This and other sources revealed that
|
|||
|
most exploits are delivered through such, or similar object
|
|||
|
types. Based on that, javascript object labeling gets a
|
|||
|
thorough focus in the paper.
|
|||
|
|
|||
|
The scope of the paper is limited to extend the attribution
|
|||
|
research already done in regard to PDF documents, so that a
|
|||
|
feature vector may be used in labeling a given (or a batch)
|
|||
|
PDF to a relevant cluster. That as an attempt to recognize
|
|||
|
different techniques and threat agents.
|
|||
|
|
|||
|
> Javascript is currently one of the most exploited PDF
|
|||
|
objects. How can the PDF feature vector be extended to
|
|||
|
include a javascript subvector correctly describing the
|
|||
|
technique/style, sophistication and similarity to previous
|
|||
|
malicious PDF documents. How does it relate to the term
|
|||
|
digital evidence?
|
|||
|
> — Problem statement
|
|||
|
|
|||
|
The problem statement considers the coding styles and
|
|||
|
obfuscation techniques used and the related sophistication
|
|||
|
in the coding style. Least but most important the statement
|
|||
|
involves how the current PDF document measures to others
|
|||
|
previously labeled. These are all essential problems when it
|
|||
|
comes to automatated data mining and clustering.
|
|||
|
|
|||
|
### A. Related Work
|
|||
|
|
|||
|
Proposed solutions for malicious contra benign
|
|||
|
classification of PDF documents has been explicitly
|
|||
|
documented in several papers. Classification using support
|
|||
|
vector machines (SVM) was handled by Jarle Kittilsen in his
|
|||
|
recent Master's thesis1.
|
|||
|
|
|||
|
Further, the author of this paper in his bachelor's thesis2
|
|||
|
investigated the possibility to detect obfuscated malware by
|
|||
|
analyzing HTTP data traffic known to contain malware. In
|
|||
|
regard, the findings were implemented, designed and tested
|
|||
|
in Snort. Some of the detection techniques will be used as a
|
|||
|
fundament for labeling in this paper.
|
|||
|
|
|||
|
Even though much good work has been done in the era of
|
|||
|
analyzing malicious PDF documents, many of the resulting
|
|||
|
tools are based on manual analysis. To be mentioned are
|
|||
|
Didier Stevens who developed several practical tools, such
|
|||
|
as the PDF parser and PDFid. These tools are not only tools,
|
|||
|
but was the beginning of a structured way of looking at
|
|||
|
suspicious objects in PDF documents as well. To be credited
|
|||
|
as well is Paul Baccas in Sophos, which did considerable
|
|||
|
work on characterizing malicious contra benign PDF
|
|||
|
documents3.
|
|||
|
|
|||
|
The paper will be doing research into the feature,
|
|||
|
javascript subvector of malicious PDF documents. To be able
|
|||
|
to determine an effective vector (in this experimental
|
|||
|
phase), it is essential that the dataset is filtered,
|
|||
|
meaning that the files must be malicious. As Kittilsen has
|
|||
|
done in regard to PDF documents, Al-Tharwa et ál2 has done
|
|||
|
interesting work to detect malicious javascript in browsers.
|
|||
|
|
|||
|
## Background
|
|||
|
### A.1. The Feature Vector in Support of Digital Evidence
|
|||
|
|
|||
|
Carrier and Spafford defined "digital evidence" as any
|
|||
|
digital data that contain reliable information that supports
|
|||
|
or refutes a hypothesis about the incident7. Formally, the
|
|||
|
investigation process consists of five parts and is
|
|||
|
specially crafted for maintaining evidence integrity, the
|
|||
|
order of volatility (OOV) and the chain of custody. This all
|
|||
|
leads up to the term forensic soudness.
|
|||
|
|
|||
|
The investigation process consists of five phases. Note the
|
|||
|
identification and analysis phase.
|
|||
|
|
|||
|
![Fig. 1: The investigation process. The investigation
|
|||
|
process consists of five phases9. Note the identification
|
|||
|
and analysis
|
|||
|
phase](/images/2015/02/Theinvestigationprocess-e1380485641223.png)
|
|||
|
|
|||
|
In this paper, forensic soudness is a notion previously
|
|||
|
defined10 as meaning: No alternation of source data has
|
|||
|
occured. Traditionally this means that every bit of data is
|
|||
|
copied and no data added. The previous paper stated two
|
|||
|
elementary questions:
|
|||
|
|
|||
|
* Can one trust the host where the data is collected from?
|
|||
|
* Does the information correlate to other data?
|
|||
|
|
|||
|
When it comes to malicious documents, they are typically
|
|||
|
collected in two places:
|
|||
|
|
|||
|
1. In the security monitoring logging, the pre-event phase
|
|||
|
2. When an incident has occured and as part of the reaction to an
|
|||
|
incident (the collection phase)
|
|||
|
|
|||
|
Now, the ten thousand dollar question: When a malicious
|
|||
|
document gets executed on the computer, how is it possible
|
|||
|
to get indications that alteration of evidence has occured?
|
|||
|
The answer is potentially the first collection point, the
|
|||
|
pre-event logging.
|
|||
|
|
|||
|
In many cases, especially considering targeted attacks, it
|
|||
|
is not possible to state an PDF document as malicious in the
|
|||
|
pre-event phase. The reason for this is often the way the
|
|||
|
threat agent craft his attack to evade the security
|
|||
|
mechanisms in the target using collected intelligence. Most
|
|||
|
systems in accordance to local legislation should then
|
|||
|
delete the content data. A proposition though is to store
|
|||
|
the feature vector.
|
|||
|
|
|||
|
The reasoning behind storing a feature vector is quite
|
|||
|
simple: When storing hashes, object counts and the
|
|||
|
javascript subvector which we will return to later in the
|
|||
|
paper, it will be possible to indicate if the document
|
|||
|
features has changed. On the other side there is no
|
|||
|
identifiable data invading privacy.
|
|||
|
|
|||
|
It is reasonable to argue that the measure of how similar
|
|||
|
one PDF document is to another, is also the measure of how
|
|||
|
forensically sound the evidence collected in a post-event
|
|||
|
phase is. How likely it is that the document aquired in the
|
|||
|
collection phase is the same as the one in the pre-phase is
|
|||
|
decided by the characteristics supplied by the feature
|
|||
|
vectors of both. Further, the feature-vector should be as
|
|||
|
rich and relevant as possible.
|
|||
|
|
|||
|
![Fig. 2: Correlation by using the feature vector of the PDF
|
|||
|
document. Illustration of a possible pre/post incident
|
|||
|
scenario](/images/2015/02/Preandpost.png)
|
|||
|
|
|||
|
### A.2. Identification as an Extension of Similarity
|
|||
|
|
|||
|
The notion of similarity largely relates to the feature
|
|||
|
vector: How is it in large quantities of data possible to
|
|||
|
tell if the new PDF document carries similar characteristics
|
|||
|
like others of a larger dataset.
|
|||
|
|
|||
|
In his work with semantic similarity and preserving hashing,
|
|||
|
M. Pittalis11 defined similarity from the Merriam-Webster
|
|||
|
dictionary:
|
|||
|
|
|||
|
> Similarity: The existance of comparable aspect between two
|
|||
|
> elements
|
|||
|
> – Merriam-Webster Dictionary
|
|||
|
|
|||
|
The measure of similarity is important in regard to
|
|||
|
clustering or grouping the documents. When clustering
|
|||
|
datasets the procedure is usually in six steps, finding the
|
|||
|
similarity measure is step 2.
|
|||
|
|
|||
|
1. Feature selection
|
|||
|
2. Proximity/similarity measure
|
|||
|
3. Clustering criterion
|
|||
|
4. Clustering algorithm
|
|||
|
5. Validation
|
|||
|
6. Interpretation
|
|||
|
|
|||
|
In this paper the k-means unsupervised learning clustering
|
|||
|
algorithm was consideres. This simple algorithm groups the
|
|||
|
number n observations into k clusters22. Each observation
|
|||
|
relates to the cluster with the nearest mean.
|
|||
|
|
|||
|
Now, as will be seen over the next two sections, work done
|
|||
|
in the subject is mostly missing out on giving a valid
|
|||
|
similarity measure when it comes to classifying PDF
|
|||
|
documents as anything other than malicious or benign. So, to
|
|||
|
be able to cluster the PDF documents the feature vector will
|
|||
|
need a revision.
|
|||
|
|
|||
|
As Pittalis introduced the concept of similarity, it is
|
|||
|
important to define one more term: Identification. According
|
|||
|
to the American Heritage Dictionary, identification is:
|
|||
|
|
|||
|
> Proof or Evidence of Identity.
|
|||
|
> — The American Heritage Dictionary
|
|||
|
|
|||
|
In our context this means being able to identify a PDF
|
|||
|
document and attribute it to e.g. a certain type of botnet
|
|||
|
or perhaps more correct a coding or obfuscation
|
|||
|
technique. In an ideal state this will give an indication to
|
|||
|
which threat agent is behind the attack. This is something
|
|||
|
that has not been researched extensively in regard to PDF
|
|||
|
documents earlier.
|
|||
|
|
|||
|
### C. The Portable Document Format
|
|||
|
|
|||
|
When it comes to the feature vector of the portable document
|
|||
|
format (PDF), it is reasonable to have a look at how PDF
|
|||
|
documents are structured. The PDF consists of objects, each
|
|||
|
object is of a certain type. As much research has been done
|
|||
|
on the topic previously, the format itself will not be
|
|||
|
treated any further in this paper12.
|
|||
|
|
|||
|
![A simplified illustration of the portable document format](/images/2015/02/ObjectdescriptionPDF-2.png)
|
|||
|
|
|||
|
When considering malicious PDF documents, relevant
|
|||
|
statistics has shown the following distribution of resource
|
|||
|
objects:
|
|||
|
|
|||
|
**Known Malicious Datasets Objects** A table showing a
|
|||
|
number interesting and selected features in malicious seen
|
|||
|
against clean PDF documents. Baccas used two datasets where
|
|||
|
one indicated slightly different results.
|
|||
|
|
|||
|
Dataset Object Type Clean (%) Malicious (%)
|
|||
|
The Shadowserver 100k PDF malicious dataset /JavaScript NA 97%
|
|||
|
--
|
|||
|
Paul Baccas' Sophos 130k malicious/benign dataset3 /JavaScript 2% 94%
|
|||
|
/RichMedia 0% 0,26%
|
|||
|
/FlateDecode 89% 77%
|
|||
|
/Encrypt 0,91% 10,81%
|
|||
|
|
|||
|
What can be seen of the table above is that when it comes to
|
|||
|
the distribution of objects in malicious files, most of them
|
|||
|
contains javascript. This makes it very hard to distinguish
|
|||
|
and find the similarity between the documents without
|
|||
|
considering a javascript subvector. The author would argue
|
|||
|
that this makes it a requirement for a javascript subvector
|
|||
|
to be included in the PDF feature vector to make it a
|
|||
|
valid. In previous work, where the aim has been to
|
|||
|
distinguish between malicious and benign, this has not been
|
|||
|
an issue.
|
|||
|
|
|||
|
### D. Closing in on the Core: The PDF Javascript Feature Subvector
|
|||
|
|
|||
|
Javascript is a client-side scripting language primarily
|
|||
|
offering greater interactivity with webpages. Specifically
|
|||
|
javascript is not a compiled language, weakly-typed4 and has
|
|||
|
first-class functions5. In form of rapid development, these
|
|||
|
features gives great advantages. In a security perspective
|
|||
|
this is problematic. The following states a Snort signature
|
|||
|
to detect a javascript "unescape"-obfuscation technique2(we
|
|||
|
will return to the concept of obfuscation later on):
|
|||
|
|
|||
|
alert tcp any any -> any any (msg:”Obfuscated unescape”; sid: 1337003; content:”replace”; pcre:”/u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e’ ?.replace (/”;rev:4;)
|
|||
|
|
|||
|
Traditionally javascript is integrated as a part of an
|
|||
|
browser. Seen from a security perspective, this opens for
|
|||
|
what is commonly known as client-side attacks. More
|
|||
|
formally: Javascript enables programmatic access to
|
|||
|
computational objects within a host environment. This is
|
|||
|
complicated as javascript comes in different flavors, making
|
|||
|
general parsing and evaluation complex6, as may be seen of
|
|||
|
the above signature. The flavors are often specific to the
|
|||
|
application. Today, most browsers are becoming more aligned
|
|||
|
due to the requirements of interoperability. Some
|
|||
|
applications, such as the widely deployed Adobe Reader has
|
|||
|
some extended functionality though, which we will be
|
|||
|
focusing on in this paper.
|
|||
|
|
|||
|
Even though javascript may pose challenges to security, it
|
|||
|
is important to realize that this is due to
|
|||
|
complexity. Javascript (which is implemented through
|
|||
|
SpiderMonkey in Mozilla18-products and in Adobe Reader as
|
|||
|
well) builds on a standard named ECMA-262. The ECMA is an
|
|||
|
standardization-organ of Information and Communication
|
|||
|
Technology (ICT) and Consumer Electronics (CE)17. Thus,
|
|||
|
Javascript is built from the ECMAScript scripting language
|
|||
|
standard. To fully understand which functions is essential
|
|||
|
in regard to malicious Javascripts this paper will rely on
|
|||
|
the ECMAScript Language Specification19 combined with expert
|
|||
|
knowledge.
|
|||
|
|
|||
|
### E. Introducing Obfuscation
|
|||
|
|
|||
|
Harawa et al.8 describes javascript obfuscation by six elements:
|
|||
|
|
|||
|
* Identifier reassignment or randomization
|
|||
|
* Block randomization
|
|||
|
* White space and comment randomization
|
|||
|
* Strings encoding
|
|||
|
* String splitting
|
|||
|
* Integer obfuscation
|
|||
|
|
|||
|
Further, Kittilsen1 documented a javascript feature vector
|
|||
|
which states the following functions as potentially
|
|||
|
malicious: [function, eval_length, max_string, stringcount,
|
|||
|
replace, substring, eval, fromCharCode]. Even though his
|
|||
|
confusion matrix shows good results, there are some problems
|
|||
|
when it comes to evaluating these as is: Such characters are
|
|||
|
usually obfuscated. The following is an example from sample
|
|||
|
``SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]``:
|
|||
|
|
|||
|
if((String+'').substr(1,4)==='unct'){e="".indexOf;}c='var _l1="4c206f5783eb9d;pnwAy()utio{.VsSg',h<+I}*/DkR%x-W[]mCj^?:LBKQYEUqFM';l='l';e=e()[((2+3)?'e'+'v':"")+"a"+l];s=[];a='pus'+'h';z=c's'+"ubstr" [1];sa [2];z=c's'+"ubstr" [3];sa [2];z=c['s'+"ubstr"] [...]e(s.join(""));}
|
|||
|
|
|||
|
The above example tells an interesting story about the
|
|||
|
attackers awareness of complexity. In respect to Kittilsens
|
|||
|
javascript feature vector the above would yield the
|
|||
|
following result: [0,x,x,x,0,0,0,0] (considerable results on
|
|||
|
the second to fourth, plus one count if we are to shorten
|
|||
|
substring to substr), in other words the features are to be
|
|||
|
found in the embedded, obfuscated javascript, but not in
|
|||
|
clear text. When it comes to eval_length, max_string and
|
|||
|
string_count we will return to those later in the paper.
|
|||
|
|
|||
|
Deobfuscated, the script would look like:
|
|||
|
|
|||
|
var _l1="[...]";_l3=app;_l4=new Array();function _l5(){var _l6=_l3.viewerVersion.toString();_l6=_l6.replace('.','');while(_l6.length&4)_l6l='0';return parsetnt(_l6,10);function _l7(_l8,_l9){while(_l8.length+2&_l9)_l8l=_l8;return _l8.substring(0,_l9I2);function _t0(_t1){_t1=unescape(_t1);rote}a*=_t1.length+2;da*/ote=unescape('Du9090');spray=_l7(da*/ote,0k2000Rrote}a*);lok%hee=_t1lspray;lok%hee=_l7(lok%hee,524098);for(i=0; i & 400; ill)_l4xi-=lok%hee.substr(0,lok%hee.lengthR1)lda*/ote;;function _t2(_t1,len){while(_t1.length&len)_t1l=_t1;return _t1.substring(0,len);function _t3(_t1){ret='';for(i=0;i&_t1.length;il=2){b=_t1.substr(i,2);c=parsetnt(b,16);retl=String.froW[har[ode(c);;return ret;function _]i1(_t1,_t4){_t5='';for(_t6=0;_t6&_t1.length;_t6ll){_l9=_t4.length;_t7=_t1.char[odeAt(_t6);_t8=_t4.char[odeAt(_t6D_l9);_t5l=String.froW[har[ode(_t7m_t8);;return _t5;function _t9(_t6){_]0=_t6.toString(16);_]1=_]0.length;_t5=(_]1D2)C'0'l_]0j_]0;return _t5;function _]2(_t1){_t5='';for(_t6=0;_t6&_t1.length;_t6l=2){_t5l='Du';_t5l=_t9(_t1.char[odeAt(_t6l1));_t5l=_t9(_t1.char[odeAt(_t6));return _t5;function _]3(){_]4=_l5();if(_]4&9000){_]5='oluAS]ggg*pu^4?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAAfhaASiAgBA98Kt?:';_]6=_l1;_]7=_t3(_]6);else{_]5='*?lAS]iLhKp9fo?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAABk[ASiAgBAIfK4?:';_]6=_l2;_]7=_t3(_]6);_]8='SQ*YA}ggAA??';_]9=_t2('LQE?',10984);_ll0='LLcAAAK}AAKAAAAwtAAAALK}AAKAAAA?AAAAAwK}AAKAAAA?AAAA?gK}AAKAAAA?AAAAKLKKAAKAAAAtAAAAEwKKAAKAAAAwtAAAQAK}AUwAAA[StAAAAAAAAAAU}A]IIIII';_ll1=_]8l_]9l_ll0l_]5;_ll2=_]i1(_]7,'');if(_ll2.lengthD2)_ll2l=unescape('D00');_ll3=_]2(_ll2);with({*j_ll3;)_t0(*);Ywe123.rawValue=_ll1;_]3();
|
|||
|
|
|||
|
Which through the simple Python script javascript feature
|
|||
|
vector generator (appendice 1), yields:
|
|||
|
|
|||
|
['function: 9', 'eval_length: x', 'max_string: x', 'stringcount: x', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']
|
|||
|
|
|||
|
Harawa et al.' 6 elements of javascript obfuscation is
|
|||
|
probably a better, or necessary supplemental approach to
|
|||
|
Kittilsens work.
|
|||
|
|
|||
|
There is a notable difference between deobfuscation and
|
|||
|
detecting obfuscation techniques. The difference consists of
|
|||
|
the depth of insight one might gain in actually
|
|||
|
deobfuscating a javascript as it will reveal completely
|
|||
|
different code while the obfuscation routines may be based
|
|||
|
on a generic obfuscator routine used by several threat
|
|||
|
agents. This is much like the issue of packers in regard to
|
|||
|
executables23.
|
|||
|
|
|||
|
This section has shown the difficulties of balancing
|
|||
|
deobfuscation for a more detailed coding style analysis
|
|||
|
against a less specific feature vector by using abstract
|
|||
|
obfuscation detection.
|
|||
|
|
|||
|
## Extracting and Analysing a PDF Feature Vector
|
|||
|
|
|||
|
### A. Deobfuscation - Emerging Intentions
|
|||
|
|
|||
|
Usually the most pressing question when an incident
|
|||
|
involving a PDF document occur is: Who did it, and what's
|
|||
|
his intentions. This is also a consideration when further
|
|||
|
evolving the PDF feature vector. In the next figure is a
|
|||
|
model describing three groups of threat agents, where one
|
|||
|
usually stands out. Such as if a Stuxnet scale attack24
|
|||
|
involving a PDF document is perceived it will be associated
|
|||
|
with a cluster containing "group 1" entities.
|
|||
|
|
|||
|
While Al-Tharwa et ál2 argues for no need for deobfuscation
|
|||
|
in regard to classification, deobfuscation is an important
|
|||
|
step in regard to finding a distinct feature vector. The
|
|||
|
issue is that in most situations it isn't good enough to
|
|||
|
tell if the documents is malicious, but also in addition to
|
|||
|
who, what, where and how it was created. In regard to being
|
|||
|
defined as valid digital evidence a rich feature vector (in
|
|||
|
addition to the network on-the-fly hash-sum) is part of
|
|||
|
telling. The latter also makes itself relevant when it comes
|
|||
|
to large quantities of data, where an analyst is not capable
|
|||
|
of manually analyzing and identifying hundreds to tens of
|
|||
|
thousands of PDF documents each day.
|
|||
|
|
|||
|
![Fig. 4: The threat agent modelA model describing three
|
|||
|
groups of attackers. These are necessary to filter and
|
|||
|
detect in the collection
|
|||
|
phase](/images/2015/02/threat-agent-model.png)
|
|||
|
|
|||
|
### B. Technical Problems During Deobfuscation
|
|||
|
|
|||
|
Normally most javascript engines, such as Mozillas
|
|||
|
Spidermonkey15, Google V816 and others, tend to be
|
|||
|
javascript libraries for browsers and miss some basic
|
|||
|
functionality in regard to Adobe Reader which is the most
|
|||
|
used PDF reader. These engines is most often used for
|
|||
|
dynamic analysis of Javascripts and is a prerequiste when it
|
|||
|
comes to being able to completely deobfuscate javascripts.
|
|||
|
|
|||
|
To prove the concepts of this article a static Python
|
|||
|
feature vector generator engine based on a rewritten version
|
|||
|
of the Jsunpack-n14project is used. The application used in
|
|||
|
the paper is providing a vector based interpretation of the
|
|||
|
static script, meaningn it is not run it dynamically.
|
|||
|
|
|||
|
Reliably detecting malicious PDF documents is a challenge
|
|||
|
due to the obfuscation routines often used. This makes it
|
|||
|
necessary to perform some kind of deobfuscation to reveal
|
|||
|
more functionality. Even if one managed to deobfuscate the
|
|||
|
script one time, there may be several rounds more before it
|
|||
|
is in clear text. This was a challenge not solvable in the
|
|||
|
scope of this article.
|
|||
|
|
|||
|
Due to parsing errors under half of the Shadowserver 100k
|
|||
|
dataset was processed by the custom Jsunpack-n module.
|
|||
|
|
|||
|
### C. Introducing Two Techniques: Feature Vector Inversion and Outer Loop Obfuscation Variable Computation
|
|||
|
|
|||
|
As have been very well documented so far in the paper it is
|
|||
|
more or less impossible to completely automate an
|
|||
|
deobfuscation process of the PDF format. Obfuscation leaves
|
|||
|
many distinct characteristics though, so the threat agent on
|
|||
|
the other hand must be careful to not trigger anomaly
|
|||
|
alarms. There is a balance. This part of the article
|
|||
|
introduces two novel techniques proposed applied to the
|
|||
|
javascript subvector to improvie its reliability.
|
|||
|
|
|||
|
#### C.1. Outer Loop Obfuscation Variable Computation (OLOVC)
|
|||
|
|
|||
|
When the threat agent implements obfuscation, one of his
|
|||
|
weaknesses is being detected using obfuscation. When it
|
|||
|
comes to PDF documents using javascripts alone is a
|
|||
|
trigger. Now, the threat agent is probably using every trick
|
|||
|
in the book, meaning the 6 elements of javascripts
|
|||
|
obfuscation8. The job of an analyst in such a matter will be
|
|||
|
to predict new obfuscation attempts and implement anomaly
|
|||
|
alerts using the extended PDF feature vector.
|
|||
|
|
|||
|
Throughout this paper we will name this technique "Outer
|
|||
|
Loop Obfuscation Variable Computation". The term "outer
|
|||
|
loop" most often refer to round zero or the first of the
|
|||
|
deobfuscation routines. Variable computation is as its name
|
|||
|
states, a matter of computing the original javascript
|
|||
|
variable. As we have seen this may be done by either
|
|||
|
deobfuscating the script as a whole including its
|
|||
|
near-impossible-for-automation complexity, or use the
|
|||
|
original obfuscated data. We will have a further look at the
|
|||
|
latter option.
|
|||
|
|
|||
|
Take for instance this excerpt from the "Introducing Obfuscation"-section:
|
|||
|
|
|||
|
z=c['s'+"ubstr"](0,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](2,1);s[a](z);z=c['s'+"ubstr"](3,1);s[a](z);z=c['s'+"ubstr"](4,1);s[a](z);z=c['s'+"ubstr"](5,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](7,1);s[a](z);z=c['s'+"ubstr"](8,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](10,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](13,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](14,1);s[a](z);z=c['s'+"ubstr"](12,1);[...](20,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](18,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](13,1);s[a](z);z=c['s'+"ubstr"](19,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](14,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);
|
|||
|
|
|||
|
|
|||
|
Harawa ét al defined the above obfuscation technique as
|
|||
|
"string splitting" (as seen in the section "Introducing
|
|||
|
obfuscation"). The following two obfuscation-extraction
|
|||
|
regular expressions, is previously stated in the authors
|
|||
|
Bachelors thesis2:
|
|||
|
|
|||
|
e.{0,2}v.{0,2}a.{0,2}l.{0,1}
|
|||
|
|
|||
|
u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e
|
|||
|
|
|||
|
Keep the two above statements and the previous code excerpt
|
|||
|
in mind. When breaking down the above expressions we
|
|||
|
introduce one more regular expression:
|
|||
|
|
|||
|
s.{0,4}u.{0,4}b.{0,4}s.{0,4}t.{0,4}r.{0,4}
|
|||
|
|
|||
|
While searching for "substr" in plain text in the plain-text
|
|||
|
will certainly fail, the above expression will match e.g.:
|
|||
|
|
|||
|
's'+"ubstr"
|
|||
|
|
|||
|
Recall Kittilsens javascript feature vector: ``[function,
|
|||
|
eval_length, max_string, stringcount, replace, substring,
|
|||
|
eval, fromCharCode]``. If extended by the above techniques,
|
|||
|
the results is somewhat different.
|
|||
|
|
|||
|
Without string splitting detection:
|
|||
|
|
|||
|
['function: 9', 'eval_length: x', 'max_string: 10849', 'stringcount: 1', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']
|
|||
|
|
|||
|
With outer loop obfuscation variable computation:
|
|||
|
|
|||
|
['function: 0', 'eval_length: x', 'max_string: 67', 'stringcount: 2', 'replace: 0', 'substring: 0', 'substr: 3663', 'eval: 1', 'fromCharCode: 0']
|
|||
|
|
|||
|
Additionally, rewriting and extending Kittilsens feature
|
|||
|
vector by several other typically suspicious functions
|
|||
|
should give preferrable results: ``[max_string, stringcount,
|
|||
|
function, replace, substring, substr, eval, fromCharCode,
|
|||
|
indexof, push, unescape, split, join, sort, length,
|
|||
|
concat]``
|
|||
|
|
|||
|
This makes the following results in two random, but related, samples:
|
|||
|
|
|||
|
[SHA256:5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 10849, 'stringcount': 2, 'replace': 0, 'substring': 0, 'substr': 1, 'length': 1, 'split': 2, 'eval': 0, 'push': 0, 'join': 1, 'concat': 0, 'fromCharCode': 0}
|
|||
|
|
|||
|
[SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 67, 'stringcount': 1, 'replace': 0, 'substring': 0, 'substr': 3663, 'length': 0, 'split': 0, 'eval': 0, 'push': 1, 'join': 1, 'concat': 0, 'fromCharCode': 0}
|
|||
|
|
|||
|
It may perhaps not need a comment, but in the above results
|
|||
|
we see that there are two types of elements in the feature
|
|||
|
vector that stands out: max_string and two of the suspicious
|
|||
|
functions.
|
|||
|
|
|||
|
Summarized the "Outer Loop Obfuscation Variable Computation"
|
|||
|
may be used to, at least partially, defeat the malware
|
|||
|
authors obfuscation attempts. By running the somewhat
|
|||
|
complex regular expressions with known malicious obfuscation
|
|||
|
routines, the implementation result of the 100.000 PDF
|
|||
|
dataset may be seen in the following table: Dataset
|
|||
|
generalization by "outer loop obfuscation variable
|
|||
|
computation" Dataset aggregated by counting javascript
|
|||
|
variables and functions, OLOVC applied (due to errors in the
|
|||
|
jsunpack-n the total number of entities calculated is
|
|||
|
42736).
|
|||
|
|
|||
|
Word Count
|
|||
|
function 651
|
|||
|
sort 7579
|
|||
|
unescape 4
|
|||
|
toLowerCase 1
|
|||
|
indexof 8
|
|||
|
max_string 42346
|
|||
|
stringcount 41979
|
|||
|
replace 70
|
|||
|
substring 91
|
|||
|
replace 70
|
|||
|
substring 91
|
|||
|
substr 38952
|
|||
|
length 1512
|
|||
|
split 9621
|
|||
|
eval 77
|
|||
|
push 260
|
|||
|
join 91
|
|||
|
inverse_vector 41423
|
|||
|
concat 86
|
|||
|
fromCharCode 45
|
|||
|
|
|||
|
By the counts in the above table it is shown that the
|
|||
|
selected feature vector has several very interesting
|
|||
|
features. On a sidenote: Even though some features has a
|
|||
|
larger quantity than others it should be mentioned that this
|
|||
|
is not necessarily the measure of how good that feature is,
|
|||
|
such is especially the case with the inverse vector as we
|
|||
|
will be more familiar with in the next section. Also, as
|
|||
|
previously mentioned it is interesting to see the
|
|||
|
composition of multiple features to determine the origin of
|
|||
|
the script (or the script style if you'd like). The
|
|||
|
aggregation script is attached in appendice 2.
|
|||
|
|
|||
|
The "Outer Loop Obfuscation Variable Computation" will
|
|||
|
require a notable amount of computational resources in
|
|||
|
high-quantity networks due to the high workload. In a way
|
|||
|
this is unavoidable since the threat agents objective of
|
|||
|
running client-side scripts is to stress the resources of
|
|||
|
such systems.
|
|||
|
|
|||
|
![Fig. 5: Illustration of Computational Complexity. The illustration shows the computational load on a network sensor in regard to different obfuscation techniques](/images/2015/02/Skjermbilde-2012-05-08-kl--20-43-04.png)
|
|||
|
|
|||
|
### C.2. Feature Vector Inversion
|
|||
|
|
|||
|
Threat agents go a long way in evading detection
|
|||
|
algorithms. The following thought is derived from a common
|
|||
|
misconception in database security:
|
|||
|
|
|||
|
> A group of ten persons which names are not to be revealed
|
|||
|
is listed amongst a couple of thousands, in an
|
|||
|
organizations LDAP directory. The group, let us name it X,
|
|||
|
is not to be revealed and is therefore not named in the
|
|||
|
department field.
|
|||
|
|
|||
|
While the public may not search and filter directly on the
|
|||
|
department name, being X, an indirect search would be
|
|||
|
succesful to reveal the group due to the ten persons being
|
|||
|
the only ones not associated with a department.
|
|||
|
|
|||
|
The concept of searching indirectly may be applied to
|
|||
|
evaluating javascripts in PDF documents as well. We might
|
|||
|
start off with some of the expected characters found in
|
|||
|
benign javascript documents:
|
|||
|
|
|||
|
{'viewerVersion':1,'getPrintParams':1,'printd':1,'var':10,'getPageNthWord':1,'annot':2,'numPages':1,'new':3}
|
|||
|
|
|||
|
The above which is found by expert knowledge as the probable
|
|||
|
used variables and functions in a benign javascript or other
|
|||
|
object. Much of these functions is used in interactive PDF
|
|||
|
documents, e.g. providing print buttons,
|
|||
|
|
|||
|
A weight is added to each cleartext function/variable. After
|
|||
|
counting the words in the document a summarized variable
|
|||
|
named the inverted_feature_vector gives an integer. The
|
|||
|
higher the integer, the higher the probability of the
|
|||
|
javascript being benign.
|
|||
|
|
|||
|
The inversed feature vector may be used as a signature and a
|
|||
|
whitelist indication database may be built of datasets. In
|
|||
|
the 100k malicious dataset the statistics showed that out of
|
|||
|
42475, 41423 had more than one occurence of a known benign
|
|||
|
variable. This might seem like a less good feature, but the
|
|||
|
quantity is not the issue here, it is the weight of each
|
|||
|
variable. So: One may say that the higher the inverse vector
|
|||
|
is, the more likely it is that the PDF or javascript is
|
|||
|
benign. To clarify, next table shows variables fragmented by
|
|||
|
weight: Inverse vector separated by interval, the
|
|||
|
|
|||
|
**Shadowserver 100k dataset** _The table shows that most
|
|||
|
malicious PDF files in the 100k Shadowserver dataset
|
|||
|
contains low-weighted scores when it comes to the inverted
|
|||
|
vector as a measure of how benign the scripts are._
|
|||
|
|
|||
|
Weight interval Instances Instance percentage
|
|||
|
<10 15232 35,6%
|
|||
|
20<>9 26852 62,8%
|
|||
|
30<>19 136 ~0%
|
|||
|
40<>29 148 ~0%
|
|||
|
50<>39 87 ~0%
|
|||
|
60<>49 28 ~0%
|
|||
|
>60 253 ~0%
|
|||
|
Total 42736 -
|
|||
|
|
|||
|
The inversion vector may as well be seen as a measure of the
|
|||
|
likeliness that the script is obfuscated. A quick look at
|
|||
|
the table shows that the characteristics of obfuscation is
|
|||
|
found in most PDF documents in the Shadowserver 100k
|
|||
|
dataset.
|
|||
|
|
|||
|
Even though this part of the vector should be seen as an
|
|||
|
indication, analysts should be aware that threat agents may
|
|||
|
adapt to the detection technique and insert clear text
|
|||
|
variables such as the ones listed above in addition to their
|
|||
|
malicious javascripts. This latter would function as a
|
|||
|
primitive feature vector inversion jammer. In other words it
|
|||
|
should be seen in context with the other items of the
|
|||
|
javascript feature vector as well. Further, the concept
|
|||
|
should be further evolved to avoid such evasion. One
|
|||
|
technique to segment the code before analyzing it (giving
|
|||
|
each code segment a score, finally generating a overall
|
|||
|
probability score), making it more difficult for the threat
|
|||
|
agent to utilize noise in his obfuscation.
|
|||
|
|
|||
|
### D. Clustering
|
|||
|
|
|||
|
Experience shows that in practically oriented environments
|
|||
|
security analysis is, at least partially, done in a manual
|
|||
|
manner. This saying that the detection is based on
|
|||
|
indicators or anomalies and the analysis of the detection
|
|||
|
results is performed manually by an analyst. Though this may
|
|||
|
possibly be the approach resulting in least false positives
|
|||
|
it is overwhelming in regard to analysis of all potentially
|
|||
|
PDF documents in a larger organization. The 100k PDF dataset
|
|||
|
used in this paper is a evidence of such. So, how is it
|
|||
|
possible to automatically detect the interesting parts of
|
|||
|
the 100k PDF dataset? This question leads to the concept of
|
|||
|
data mining.
|
|||
|
|
|||
|
The definition of data mining is the transformation of data
|
|||
|
to "meaningful patterns and rules".
|
|||
|
|
|||
|
Michael Abernethy at IBM developerWorks20 covers data mining quite extensively.
|
|||
|
|
|||
|
#### D.1. A Narrow Experiment and Results
|
|||
|
|
|||
|
In this paper the goal is to achieve an view of the dataset
|
|||
|
in a way that is named "undirected" data mining: Trying to
|
|||
|
find patterns or rules in existing data. This is achieved
|
|||
|
through the feature vector previously presented.
|
|||
|
|
|||
|
Up until now this paper has discussed how to generate an
|
|||
|
satisfactionary feature vector and what makes the measure of
|
|||
|
similarity. Let us do an experiment using WEKA (Waikato
|
|||
|
Environment for Knowledge Analysis) for analyzing our
|
|||
|
feature vector.
|
|||
|
|
|||
|
Appendice 3 describes the ARFF format found from our feature
|
|||
|
vector and two of the previously presented feature vectors
|
|||
|
(SHA256:
|
|||
|
``5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b``,
|
|||
|
``d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201``)
|
|||
|
and a random selection of 2587 parseable PDF-documents from
|
|||
|
the dataset.
|
|||
|
|
|||
|
In this experiement the feature vector were produced of 200
|
|||
|
random samples from the 100k dataset. Interesting in that
|
|||
|
regard is that the subdataset loaded from originally
|
|||
|
contained 6214 samples, while our application only handled
|
|||
|
the decoding of under half. The feature vector was extracted
|
|||
|
in a CSV format, converted by the following WEKA Java class
|
|||
|
and loaded in WEKA:
|
|||
|
|
|||
|
java -classpath /Applications/weka-3-6-6.app/Contents/Resources/Java/weka.jar weka.core.converters.CSVLoader dataset.csv
|
|||
|
|
|||
|
In the WEKA preprocessing, the results may be visualized:
|
|||
|
|
|||
|
![Fig. 6: Results 1; PDF Feature Vector DistributionA model
|
|||
|
showing the PDF feature vector object distribution using
|
|||
|
the 2587 parsable PDF
|
|||
|
documents](/images/2015/02/Skjermbilde-2012-05-16-kl--13-17-20.png)
|
|||
|
|
|||
|
### D.2. The complete dataset
|
|||
|
|
|||
|
Next loading the complete feature vector dataset consisting
|
|||
|
of 42736 entities showed interesting results when
|
|||
|
clustering.
|
|||
|
|
|||
|
![Fig. 7: Stringcount vs anomalies in the inverse
|
|||
|
vector. Stringcount vs anomalies in the
|
|||
|
inverse_vector. Using k-means algorithm and k=5. Medium
|
|||
|
Jitter to emphasize the
|
|||
|
clusters](/images/2015/02/Skjermbilde-2012-06-27-kl--11-40-19.png)
|
|||
|
|
|||
|
The cluster process above also enables the possibility to
|
|||
|
look at the anomalies where the inverse_vector is high. For
|
|||
|
instance 9724 (the highest one in the Y-axis) the
|
|||
|
inverse_vector is 21510 which is a very clear anomaly
|
|||
|
compared to the rest of the clusters (the distance is
|
|||
|
far). This should encourage a closer look at the file based
|
|||
|
on the hash.
|
|||
|
|
|||
|
The Shadowserver 100k ARFF dataset will be further evolved and may be found at the project GitHub page25.
|
|||
|
|
|||
|
### E. Logging and Interpreting Errors
|
|||
|
|
|||
|
Again and again while analyzing the 100k dataset the
|
|||
|
interpreter went on parsing errors. Bad code one may say,
|
|||
|
but a fact is that the threat agents are adapting their code
|
|||
|
to evading known tools and frameworks. An example of this is
|
|||
|
a recent bug21 in Stevens PDF parser where empty PDF objects
|
|||
|
in fact created an exception in the application.
|
|||
|
|
|||
|
So, what does this have to do with this paper? Creative
|
|||
|
threat agents can never be avoided, creating malicious code
|
|||
|
that avoids the detection routines. This makes an important
|
|||
|
point, being that the application implemented should be
|
|||
|
using strict deobfuscation and interpretation routines. When
|
|||
|
an error occurs, which will happen sooner or later, the file
|
|||
|
should be traceable and manually analyzed. This in turn
|
|||
|
should lead to an adaption of the application. Where the
|
|||
|
routines fails will also be a characteristic of the threat
|
|||
|
agent: What part of the detection routines does he try to
|
|||
|
evade? E.g. in the 100k dataset an error on the
|
|||
|
ascii85-filter occurred. The parsing error made the
|
|||
|
parser-module not to output a feature vector, and were
|
|||
|
detected by error monitoring in log files.
|
|||
|
|
|||
|
## Discussion and Conclusions
|
|||
|
|
|||
|
In regard to being used standalone as evidence the feature
|
|||
|
vector will have its limitations, especially since its hard
|
|||
|
to connect it to an event it should be considered
|
|||
|
circumstancial.
|
|||
|
|
|||
|
The PDF and ECMA standard are complex and difficult to
|
|||
|
interpret, especially when it comes to automation. As has
|
|||
|
been shown in this article a really hard problem is
|
|||
|
dynamically and generically executing javascripts for
|
|||
|
deobfuscation. This is also shown just in the Adobe Reader,
|
|||
|
where e.g. Adobe Reader X uses Spidermonkey 1.8, while
|
|||
|
previous more prevalent versions use version 1.7 of
|
|||
|
Spidermonkey. This often resulted in parsing errors, and
|
|||
|
again it will potentially cause a larger error rate in the
|
|||
|
next generation intrusion detection systems.
|
|||
|
|
|||
|
It has been proved that a static analysis through a
|
|||
|
Jsunpack-n modification recovers good enough round-zero
|
|||
|
data, from a little less than half of the Shadowserver 100k
|
|||
|
dataset, to generate a characteristic of each file. The
|
|||
|
results were somewhat disappointing in regard to the
|
|||
|
extensive parsing errors. Parsing optimalization and error
|
|||
|
correction making the script more robust and reliable should
|
|||
|
be covered in a separate report. Despite the latter a good
|
|||
|
foundation and enough data were given to give a clue for
|
|||
|
what to expect from the extended PDF feature vector. Also,
|
|||
|
the inverse vector with its weighting gives a individual
|
|||
|
score to each document, making it exceptionally promising
|
|||
|
for further research.
|
|||
|
|
|||
|
In regard to OLOVC a certain enhancement would be to combine
|
|||
|
it with the work of Franke' and Petrovic' "Improving the
|
|||
|
efficiency of digital forensic search by means of contrained
|
|||
|
edit distance". Their concept seems quite promising and
|
|||
|
might provide valuable input to OLOVC.
|
|||
|
|
|||
|
The dataset used in this article may contain certain flaws
|
|||
|
in its scientific foundation. No dataset flaws, but
|
|||
|
indications that some data origins from the same source, has
|
|||
|
been seen throughout this article. The reason is most
|
|||
|
probably that the dataset was collected over three
|
|||
|
continuous days. Linked to the behaviour of malware it is
|
|||
|
known that certain malware such as drive-by attacks has
|
|||
|
peaks in its spread as a function of time. It is therefore
|
|||
|
natural to assume that there are larger occurences of PDF
|
|||
|
documents originating from the same threat agent. On the
|
|||
|
other side, in further research, this should be a measure of
|
|||
|
the effectiveness of algorithms ability to group the data.
|
|||
|
|
|||
|
The Shadowserver 100k dataset only contains distinct
|
|||
|
files. It would be interesting to recollect a similar
|
|||
|
dataset with non-distinct hash-entries, and to cluster it by
|
|||
|
fuzzy hashing as well.
|
|||
|
|
|||
|
Even though clustering is mentioned in the last part of this
|
|||
|
article, further extensive research should be done to
|
|||
|
completely explore the potential of using the current
|
|||
|
feature vector. In other words the scope of the article
|
|||
|
permitted for a manual selection of a feature vector and a
|
|||
|
more or less defined measure of similarity though the
|
|||
|
extended PDF feature vector.
|
|||
|
|
|||
|
The project has a maintained GitHub page as introduced in
|
|||
|
the last section. This page should encourage further
|
|||
|
development into the extended PDF feature vector.
|
|||
|
|
|||
|
If you'd like please have a look at the GuC Testimon Forensic Laboratory [21].
|
|||
|
|
|||
|
|
|||
|
[1] GuC Testimon Forensic Laboratory: https://sites.google.com/site/testimonlab/
|