Tommy Skaug
805a34f937
All checks were successful
Export / Explore-GitHub-Actions (push) Successful in 2m19s
792 lines
37 KiB
Markdown
792 lines
37 KiB
Markdown
For some time now the Portable Document Format standard has
|
||
been a considerable risk in regard to corporate as well as
|
||
private information security concerns. Some work has been
|
||
done to classify PDF documents as malicious or benign, but
|
||
not as much when it comes to clustering the malicious
|
||
documents by techniques used. Such clustering would provide
|
||
insight, in automated analysis, to how sophisticated an
|
||
attack is and who staged it. A 100.000 unique PDF dataset
|
||
was supplied by the Shadowserver foundation. Analysis of
|
||
experiment results showed that 97% of the documents
|
||
contained javascripts. This and other sources revealed that
|
||
most exploits are delivered through such, or similar object
|
||
types. Based on that, javascript object labeling gets a
|
||
thorough focus in the paper.
|
||
|
||
The scope of the paper is limited to extend the attribution
|
||
research already done in regard to PDF documents, so that a
|
||
feature vector may be used in labeling a given (or a batch)
|
||
PDF to a relevant cluster. That as an attempt to recognize
|
||
different techniques and threat agents.
|
||
|
||
> Javascript is currently one of the most exploited PDF
|
||
objects. How can the PDF feature vector be extended to
|
||
include a javascript subvector correctly describing the
|
||
technique/style, sophistication and similarity to previous
|
||
malicious PDF documents. How does it relate to the term
|
||
digital evidence?
|
||
> — Problem statement
|
||
|
||
The problem statement considers the coding styles and
|
||
obfuscation techniques used and the related sophistication
|
||
in the coding style. Least but most important the statement
|
||
involves how the current PDF document measures to others
|
||
previously labeled. These are all essential problems when it
|
||
comes to automatated data mining and clustering.
|
||
|
||
### A. Related Work
|
||
|
||
Proposed solutions for malicious contra benign
|
||
classification of PDF documents has been explicitly
|
||
documented in several papers. Classification using support
|
||
vector machines (SVM) was handled by Jarle Kittilsen in his
|
||
recent Master's thesis1.
|
||
|
||
Further, the author of this paper in his bachelor's thesis2
|
||
investigated the possibility to detect obfuscated malware by
|
||
analyzing HTTP data traffic known to contain malware. In
|
||
regard, the findings were implemented, designed and tested
|
||
in Snort. Some of the detection techniques will be used as a
|
||
fundament for labeling in this paper.
|
||
|
||
Even though much good work has been done in the era of
|
||
analyzing malicious PDF documents, many of the resulting
|
||
tools are based on manual analysis. To be mentioned are
|
||
Didier Stevens who developed several practical tools, such
|
||
as the PDF parser and PDFid. These tools are not only tools,
|
||
but was the beginning of a structured way of looking at
|
||
suspicious objects in PDF documents as well. To be credited
|
||
as well is Paul Baccas in Sophos, which did considerable
|
||
work on characterizing malicious contra benign PDF
|
||
documents3.
|
||
|
||
The paper will be doing research into the feature,
|
||
javascript subvector of malicious PDF documents. To be able
|
||
to determine an effective vector (in this experimental
|
||
phase), it is essential that the dataset is filtered,
|
||
meaning that the files must be malicious. As Kittilsen has
|
||
done in regard to PDF documents, Al-Tharwa et ál2 has done
|
||
interesting work to detect malicious javascript in browsers.
|
||
|
||
## Background
|
||
### A.1. The Feature Vector in Support of Digital Evidence
|
||
|
||
Carrier and Spafford defined "digital evidence" as any
|
||
digital data that contain reliable information that supports
|
||
or refutes a hypothesis about the incident7. Formally, the
|
||
investigation process consists of five parts and is
|
||
specially crafted for maintaining evidence integrity, the
|
||
order of volatility (OOV) and the chain of custody. This all
|
||
leads up to the term forensic soudness.
|
||
|
||
The investigation process consists of five phases. Note the
|
||
identification and analysis phase.
|
||
|
||
![Fig. 1: The investigation process. The investigation
|
||
process consists of five phases9. Note the identification
|
||
and analysis
|
||
phase](/images/2015/02/Theinvestigationprocess-e1380485641223.png)
|
||
|
||
In this paper, forensic soudness is a notion previously
|
||
defined10 as meaning: No alternation of source data has
|
||
occured. Traditionally this means that every bit of data is
|
||
copied and no data added. The previous paper stated two
|
||
elementary questions:
|
||
|
||
* Can one trust the host where the data is collected from?
|
||
* Does the information correlate to other data?
|
||
|
||
When it comes to malicious documents, they are typically
|
||
collected in two places:
|
||
|
||
1. In the security monitoring logging, the pre-event phase
|
||
2. When an incident has occured and as part of the reaction to an
|
||
incident (the collection phase)
|
||
|
||
Now, the ten thousand dollar question: When a malicious
|
||
document gets executed on the computer, how is it possible
|
||
to get indications that alteration of evidence has occured?
|
||
The answer is potentially the first collection point, the
|
||
pre-event logging.
|
||
|
||
In many cases, especially considering targeted attacks, it
|
||
is not possible to state an PDF document as malicious in the
|
||
pre-event phase. The reason for this is often the way the
|
||
threat agent craft his attack to evade the security
|
||
mechanisms in the target using collected intelligence. Most
|
||
systems in accordance to local legislation should then
|
||
delete the content data. A proposition though is to store
|
||
the feature vector.
|
||
|
||
The reasoning behind storing a feature vector is quite
|
||
simple: When storing hashes, object counts and the
|
||
javascript subvector which we will return to later in the
|
||
paper, it will be possible to indicate if the document
|
||
features has changed. On the other side there is no
|
||
identifiable data invading privacy.
|
||
|
||
It is reasonable to argue that the measure of how similar
|
||
one PDF document is to another, is also the measure of how
|
||
forensically sound the evidence collected in a post-event
|
||
phase is. How likely it is that the document aquired in the
|
||
collection phase is the same as the one in the pre-phase is
|
||
decided by the characteristics supplied by the feature
|
||
vectors of both. Further, the feature-vector should be as
|
||
rich and relevant as possible.
|
||
|
||
![Fig. 2: Correlation by using the feature vector of the PDF
|
||
document. Illustration of a possible pre/post incident
|
||
scenario](/images/2015/02/Preandpost.png)
|
||
|
||
### A.2. Identification as an Extension of Similarity
|
||
|
||
The notion of similarity largely relates to the feature
|
||
vector: How is it in large quantities of data possible to
|
||
tell if the new PDF document carries similar characteristics
|
||
like others of a larger dataset.
|
||
|
||
In his work with semantic similarity and preserving hashing,
|
||
M. Pittalis11 defined similarity from the Merriam-Webster
|
||
dictionary:
|
||
|
||
> Similarity: The existance of comparable aspect between two
|
||
> elements
|
||
> – Merriam-Webster Dictionary
|
||
|
||
The measure of similarity is important in regard to
|
||
clustering or grouping the documents. When clustering
|
||
datasets the procedure is usually in six steps, finding the
|
||
similarity measure is step 2.
|
||
|
||
1. Feature selection
|
||
2. Proximity/similarity measure
|
||
3. Clustering criterion
|
||
4. Clustering algorithm
|
||
5. Validation
|
||
6. Interpretation
|
||
|
||
In this paper the k-means unsupervised learning clustering
|
||
algorithm was consideres. This simple algorithm groups the
|
||
number n observations into k clusters22. Each observation
|
||
relates to the cluster with the nearest mean.
|
||
|
||
Now, as will be seen over the next two sections, work done
|
||
in the subject is mostly missing out on giving a valid
|
||
similarity measure when it comes to classifying PDF
|
||
documents as anything other than malicious or benign. So, to
|
||
be able to cluster the PDF documents the feature vector will
|
||
need a revision.
|
||
|
||
As Pittalis introduced the concept of similarity, it is
|
||
important to define one more term: Identification. According
|
||
to the American Heritage Dictionary, identification is:
|
||
|
||
> Proof or Evidence of Identity.
|
||
> — The American Heritage Dictionary
|
||
|
||
In our context this means being able to identify a PDF
|
||
document and attribute it to e.g. a certain type of botnet
|
||
or perhaps more correct a coding or obfuscation
|
||
technique. In an ideal state this will give an indication to
|
||
which threat agent is behind the attack. This is something
|
||
that has not been researched extensively in regard to PDF
|
||
documents earlier.
|
||
|
||
### C. The Portable Document Format
|
||
|
||
When it comes to the feature vector of the portable document
|
||
format (PDF), it is reasonable to have a look at how PDF
|
||
documents are structured. The PDF consists of objects, each
|
||
object is of a certain type. As much research has been done
|
||
on the topic previously, the format itself will not be
|
||
treated any further in this paper12.
|
||
|
||
![A simplified illustration of the portable document format](/images/2015/02/ObjectdescriptionPDF-2.png)
|
||
|
||
When considering malicious PDF documents, relevant
|
||
statistics has shown the following distribution of resource
|
||
objects:
|
||
|
||
**Known Malicious Datasets Objects** A table showing a
|
||
number interesting and selected features in malicious seen
|
||
against clean PDF documents. Baccas used two datasets where
|
||
one indicated slightly different results.
|
||
|
||
Dataset Object Type Clean (%) Malicious (%)
|
||
The Shadowserver 100k PDF malicious dataset /JavaScript NA 97%
|
||
--
|
||
Paul Baccas' Sophos 130k malicious/benign dataset3 /JavaScript 2% 94%
|
||
/RichMedia 0% 0,26%
|
||
/FlateDecode 89% 77%
|
||
/Encrypt 0,91% 10,81%
|
||
|
||
What can be seen of the table above is that when it comes to
|
||
the distribution of objects in malicious files, most of them
|
||
contains javascript. This makes it very hard to distinguish
|
||
and find the similarity between the documents without
|
||
considering a javascript subvector. The author would argue
|
||
that this makes it a requirement for a javascript subvector
|
||
to be included in the PDF feature vector to make it a
|
||
valid. In previous work, where the aim has been to
|
||
distinguish between malicious and benign, this has not been
|
||
an issue.
|
||
|
||
### D. Closing in on the Core: The PDF Javascript Feature Subvector
|
||
|
||
Javascript is a client-side scripting language primarily
|
||
offering greater interactivity with webpages. Specifically
|
||
javascript is not a compiled language, weakly-typed4 and has
|
||
first-class functions5. In form of rapid development, these
|
||
features gives great advantages. In a security perspective
|
||
this is problematic. The following states a Snort signature
|
||
to detect a javascript "unescape"-obfuscation technique2(we
|
||
will return to the concept of obfuscation later on):
|
||
|
||
alert tcp any any -> any any (msg:”Obfuscated unescape”; sid: 1337003; content:”replace”; pcre:”/u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e’ ?.replace (/”;rev:4;)
|
||
|
||
Traditionally javascript is integrated as a part of an
|
||
browser. Seen from a security perspective, this opens for
|
||
what is commonly known as client-side attacks. More
|
||
formally: Javascript enables programmatic access to
|
||
computational objects within a host environment. This is
|
||
complicated as javascript comes in different flavors, making
|
||
general parsing and evaluation complex6, as may be seen of
|
||
the above signature. The flavors are often specific to the
|
||
application. Today, most browsers are becoming more aligned
|
||
due to the requirements of interoperability. Some
|
||
applications, such as the widely deployed Adobe Reader has
|
||
some extended functionality though, which we will be
|
||
focusing on in this paper.
|
||
|
||
Even though javascript may pose challenges to security, it
|
||
is important to realize that this is due to
|
||
complexity. Javascript (which is implemented through
|
||
SpiderMonkey in Mozilla18-products and in Adobe Reader as
|
||
well) builds on a standard named ECMA-262. The ECMA is an
|
||
standardization-organ of Information and Communication
|
||
Technology (ICT) and Consumer Electronics (CE)17. Thus,
|
||
Javascript is built from the ECMAScript scripting language
|
||
standard. To fully understand which functions is essential
|
||
in regard to malicious Javascripts this paper will rely on
|
||
the ECMAScript Language Specification19 combined with expert
|
||
knowledge.
|
||
|
||
### E. Introducing Obfuscation
|
||
|
||
Harawa et al.8 describes javascript obfuscation by six elements:
|
||
|
||
* Identifier reassignment or randomization
|
||
* Block randomization
|
||
* White space and comment randomization
|
||
* Strings encoding
|
||
* String splitting
|
||
* Integer obfuscation
|
||
|
||
Further, Kittilsen1 documented a javascript feature vector
|
||
which states the following functions as potentially
|
||
malicious: [function, eval_length, max_string, stringcount,
|
||
replace, substring, eval, fromCharCode]. Even though his
|
||
confusion matrix shows good results, there are some problems
|
||
when it comes to evaluating these as is: Such characters are
|
||
usually obfuscated. The following is an example from sample
|
||
``SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]``:
|
||
|
||
if((String+'').substr(1,4)==='unct'){e="".indexOf;}c='var _l1="4c206f5783eb9d;pnwAy()utio{.VsSg',h<+I}*/DkR%x-W[]mCj^?:LBKQYEUqFM';l='l';e=e()[((2+3)?'e'+'v':"")+"a"+l];s=[];a='pus'+'h';z=c's'+"ubstr" [1];sa [2];z=c's'+"ubstr" [3];sa [2];z=c['s'+"ubstr"] [...]e(s.join(""));}
|
||
|
||
The above example tells an interesting story about the
|
||
attackers awareness of complexity. In respect to Kittilsens
|
||
javascript feature vector the above would yield the
|
||
following result: [0,x,x,x,0,0,0,0] (considerable results on
|
||
the second to fourth, plus one count if we are to shorten
|
||
substring to substr), in other words the features are to be
|
||
found in the embedded, obfuscated javascript, but not in
|
||
clear text. When it comes to eval_length, max_string and
|
||
string_count we will return to those later in the paper.
|
||
|
||
Deobfuscated, the script would look like:
|
||
|
||
var _l1="[...]";_l3=app;_l4=new Array();function _l5(){var _l6=_l3.viewerVersion.toString();_l6=_l6.replace('.','');while(_l6.length&4)_l6l='0';return parsetnt(_l6,10);function _l7(_l8,_l9){while(_l8.length+2&_l9)_l8l=_l8;return _l8.substring(0,_l9I2);function _t0(_t1){_t1=unescape(_t1);rote}a*=_t1.length+2;da*/ote=unescape('Du9090');spray=_l7(da*/ote,0k2000Rrote}a*);lok%hee=_t1lspray;lok%hee=_l7(lok%hee,524098);for(i=0; i & 400; ill)_l4xi-=lok%hee.substr(0,lok%hee.lengthR1)lda*/ote;;function _t2(_t1,len){while(_t1.length&len)_t1l=_t1;return _t1.substring(0,len);function _t3(_t1){ret='';for(i=0;i&_t1.length;il=2){b=_t1.substr(i,2);c=parsetnt(b,16);retl=String.froW[har[ode(c);;return ret;function _]i1(_t1,_t4){_t5='';for(_t6=0;_t6&_t1.length;_t6ll){_l9=_t4.length;_t7=_t1.char[odeAt(_t6);_t8=_t4.char[odeAt(_t6D_l9);_t5l=String.froW[har[ode(_t7m_t8);;return _t5;function _t9(_t6){_]0=_t6.toString(16);_]1=_]0.length;_t5=(_]1D2)C'0'l_]0j_]0;return _t5;function _]2(_t1){_t5='';for(_t6=0;_t6&_t1.length;_t6l=2){_t5l='Du';_t5l=_t9(_t1.char[odeAt(_t6l1));_t5l=_t9(_t1.char[odeAt(_t6));return _t5;function _]3(){_]4=_l5();if(_]4&9000){_]5='oluAS]ggg*pu^4?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAAfhaASiAgBA98Kt?:';_]6=_l1;_]7=_t3(_]6);else{_]5='*?lAS]iLhKp9fo?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAABk[ASiAgBAIfK4?:';_]6=_l2;_]7=_t3(_]6);_]8='SQ*YA}ggAA??';_]9=_t2('LQE?',10984);_ll0='LLcAAAK}AAKAAAAwtAAAALK}AAKAAAA?AAAAAwK}AAKAAAA?AAAA?gK}AAKAAAA?AAAAKLKKAAKAAAAtAAAAEwKKAAKAAAAwtAAAQAK}AUwAAA[StAAAAAAAAAAU}A]IIIII';_ll1=_]8l_]9l_ll0l_]5;_ll2=_]i1(_]7,'');if(_ll2.lengthD2)_ll2l=unescape('D00');_ll3=_]2(_ll2);with({*j_ll3;)_t0(*);Ywe123.rawValue=_ll1;_]3();
|
||
|
||
Which through the simple Python script javascript feature
|
||
vector generator (appendice 1), yields:
|
||
|
||
['function: 9', 'eval_length: x', 'max_string: x', 'stringcount: x', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']
|
||
|
||
Harawa et al.' 6 elements of javascript obfuscation is
|
||
probably a better, or necessary supplemental approach to
|
||
Kittilsens work.
|
||
|
||
There is a notable difference between deobfuscation and
|
||
detecting obfuscation techniques. The difference consists of
|
||
the depth of insight one might gain in actually
|
||
deobfuscating a javascript as it will reveal completely
|
||
different code while the obfuscation routines may be based
|
||
on a generic obfuscator routine used by several threat
|
||
agents. This is much like the issue of packers in regard to
|
||
executables23.
|
||
|
||
This section has shown the difficulties of balancing
|
||
deobfuscation for a more detailed coding style analysis
|
||
against a less specific feature vector by using abstract
|
||
obfuscation detection.
|
||
|
||
## Extracting and Analysing a PDF Feature Vector
|
||
|
||
### A. Deobfuscation - Emerging Intentions
|
||
|
||
Usually the most pressing question when an incident
|
||
involving a PDF document occur is: Who did it, and what's
|
||
his intentions. This is also a consideration when further
|
||
evolving the PDF feature vector. In the next figure is a
|
||
model describing three groups of threat agents, where one
|
||
usually stands out. Such as if a Stuxnet scale attack24
|
||
involving a PDF document is perceived it will be associated
|
||
with a cluster containing "group 1" entities.
|
||
|
||
While Al-Tharwa et ál2 argues for no need for deobfuscation
|
||
in regard to classification, deobfuscation is an important
|
||
step in regard to finding a distinct feature vector. The
|
||
issue is that in most situations it isn't good enough to
|
||
tell if the documents is malicious, but also in addition to
|
||
who, what, where and how it was created. In regard to being
|
||
defined as valid digital evidence a rich feature vector (in
|
||
addition to the network on-the-fly hash-sum) is part of
|
||
telling. The latter also makes itself relevant when it comes
|
||
to large quantities of data, where an analyst is not capable
|
||
of manually analyzing and identifying hundreds to tens of
|
||
thousands of PDF documents each day.
|
||
|
||
![Fig. 4: The threat agent modelA model describing three
|
||
groups of attackers. These are necessary to filter and
|
||
detect in the collection
|
||
phase](/images/2015/02/threat-agent-model.png)
|
||
|
||
### B. Technical Problems During Deobfuscation
|
||
|
||
Normally most javascript engines, such as Mozillas
|
||
Spidermonkey15, Google V816 and others, tend to be
|
||
javascript libraries for browsers and miss some basic
|
||
functionality in regard to Adobe Reader which is the most
|
||
used PDF reader. These engines is most often used for
|
||
dynamic analysis of Javascripts and is a prerequiste when it
|
||
comes to being able to completely deobfuscate javascripts.
|
||
|
||
To prove the concepts of this article a static Python
|
||
feature vector generator engine based on a rewritten version
|
||
of the Jsunpack-n14project is used. The application used in
|
||
the paper is providing a vector based interpretation of the
|
||
static script, meaningn it is not run it dynamically.
|
||
|
||
Reliably detecting malicious PDF documents is a challenge
|
||
due to the obfuscation routines often used. This makes it
|
||
necessary to perform some kind of deobfuscation to reveal
|
||
more functionality. Even if one managed to deobfuscate the
|
||
script one time, there may be several rounds more before it
|
||
is in clear text. This was a challenge not solvable in the
|
||
scope of this article.
|
||
|
||
Due to parsing errors under half of the Shadowserver 100k
|
||
dataset was processed by the custom Jsunpack-n module.
|
||
|
||
### C. Introducing Two Techniques: Feature Vector Inversion and Outer Loop Obfuscation Variable Computation
|
||
|
||
As have been very well documented so far in the paper it is
|
||
more or less impossible to completely automate an
|
||
deobfuscation process of the PDF format. Obfuscation leaves
|
||
many distinct characteristics though, so the threat agent on
|
||
the other hand must be careful to not trigger anomaly
|
||
alarms. There is a balance. This part of the article
|
||
introduces two novel techniques proposed applied to the
|
||
javascript subvector to improvie its reliability.
|
||
|
||
#### C.1. Outer Loop Obfuscation Variable Computation (OLOVC)
|
||
|
||
When the threat agent implements obfuscation, one of his
|
||
weaknesses is being detected using obfuscation. When it
|
||
comes to PDF documents using javascripts alone is a
|
||
trigger. Now, the threat agent is probably using every trick
|
||
in the book, meaning the 6 elements of javascripts
|
||
obfuscation8. The job of an analyst in such a matter will be
|
||
to predict new obfuscation attempts and implement anomaly
|
||
alerts using the extended PDF feature vector.
|
||
|
||
Throughout this paper we will name this technique "Outer
|
||
Loop Obfuscation Variable Computation". The term "outer
|
||
loop" most often refer to round zero or the first of the
|
||
deobfuscation routines. Variable computation is as its name
|
||
states, a matter of computing the original javascript
|
||
variable. As we have seen this may be done by either
|
||
deobfuscating the script as a whole including its
|
||
near-impossible-for-automation complexity, or use the
|
||
original obfuscated data. We will have a further look at the
|
||
latter option.
|
||
|
||
Take for instance this excerpt from the "Introducing Obfuscation"-section:
|
||
|
||
z=c['s'+"ubstr"](0,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](2,1);s[a](z);z=c['s'+"ubstr"](3,1);s[a](z);z=c['s'+"ubstr"](4,1);s[a](z);z=c['s'+"ubstr"](5,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](7,1);s[a](z);z=c['s'+"ubstr"](8,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](10,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](13,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](14,1);s[a](z);z=c['s'+"ubstr"](12,1);[...](20,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](18,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](13,1);s[a](z);z=c['s'+"ubstr"](19,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](14,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);
|
||
|
||
|
||
Harawa ét al defined the above obfuscation technique as
|
||
"string splitting" (as seen in the section "Introducing
|
||
obfuscation"). The following two obfuscation-extraction
|
||
regular expressions, is previously stated in the authors
|
||
Bachelors thesis2:
|
||
|
||
e.{0,2}v.{0,2}a.{0,2}l.{0,1}
|
||
|
||
u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e
|
||
|
||
Keep the two above statements and the previous code excerpt
|
||
in mind. When breaking down the above expressions we
|
||
introduce one more regular expression:
|
||
|
||
s.{0,4}u.{0,4}b.{0,4}s.{0,4}t.{0,4}r.{0,4}
|
||
|
||
While searching for "substr" in plain text in the plain-text
|
||
will certainly fail, the above expression will match e.g.:
|
||
|
||
's'+"ubstr"
|
||
|
||
Recall Kittilsens javascript feature vector: ``[function,
|
||
eval_length, max_string, stringcount, replace, substring,
|
||
eval, fromCharCode]``. If extended by the above techniques,
|
||
the results is somewhat different.
|
||
|
||
Without string splitting detection:
|
||
|
||
['function: 9', 'eval_length: x', 'max_string: 10849', 'stringcount: 1', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']
|
||
|
||
With outer loop obfuscation variable computation:
|
||
|
||
['function: 0', 'eval_length: x', 'max_string: 67', 'stringcount: 2', 'replace: 0', 'substring: 0', 'substr: 3663', 'eval: 1', 'fromCharCode: 0']
|
||
|
||
Additionally, rewriting and extending Kittilsens feature
|
||
vector by several other typically suspicious functions
|
||
should give preferrable results: ``[max_string, stringcount,
|
||
function, replace, substring, substr, eval, fromCharCode,
|
||
indexof, push, unescape, split, join, sort, length,
|
||
concat]``
|
||
|
||
This makes the following results in two random, but related, samples:
|
||
|
||
[SHA256:5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 10849, 'stringcount': 2, 'replace': 0, 'substring': 0, 'substr': 1, 'length': 1, 'split': 2, 'eval': 0, 'push': 0, 'join': 1, 'concat': 0, 'fromCharCode': 0}
|
||
|
||
[SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 67, 'stringcount': 1, 'replace': 0, 'substring': 0, 'substr': 3663, 'length': 0, 'split': 0, 'eval': 0, 'push': 1, 'join': 1, 'concat': 0, 'fromCharCode': 0}
|
||
|
||
It may perhaps not need a comment, but in the above results
|
||
we see that there are two types of elements in the feature
|
||
vector that stands out: max_string and two of the suspicious
|
||
functions.
|
||
|
||
Summarized the "Outer Loop Obfuscation Variable Computation"
|
||
may be used to, at least partially, defeat the malware
|
||
authors obfuscation attempts. By running the somewhat
|
||
complex regular expressions with known malicious obfuscation
|
||
routines, the implementation result of the 100.000 PDF
|
||
dataset may be seen in the following table: Dataset
|
||
generalization by "outer loop obfuscation variable
|
||
computation" Dataset aggregated by counting javascript
|
||
variables and functions, OLOVC applied (due to errors in the
|
||
jsunpack-n the total number of entities calculated is
|
||
42736).
|
||
|
||
Word Count
|
||
function 651
|
||
sort 7579
|
||
unescape 4
|
||
toLowerCase 1
|
||
indexof 8
|
||
max_string 42346
|
||
stringcount 41979
|
||
replace 70
|
||
substring 91
|
||
replace 70
|
||
substring 91
|
||
substr 38952
|
||
length 1512
|
||
split 9621
|
||
eval 77
|
||
push 260
|
||
join 91
|
||
inverse_vector 41423
|
||
concat 86
|
||
fromCharCode 45
|
||
|
||
By the counts in the above table it is shown that the
|
||
selected feature vector has several very interesting
|
||
features. On a sidenote: Even though some features has a
|
||
larger quantity than others it should be mentioned that this
|
||
is not necessarily the measure of how good that feature is,
|
||
such is especially the case with the inverse vector as we
|
||
will be more familiar with in the next section. Also, as
|
||
previously mentioned it is interesting to see the
|
||
composition of multiple features to determine the origin of
|
||
the script (or the script style if you'd like). The
|
||
aggregation script is attached in appendice 2.
|
||
|
||
The "Outer Loop Obfuscation Variable Computation" will
|
||
require a notable amount of computational resources in
|
||
high-quantity networks due to the high workload. In a way
|
||
this is unavoidable since the threat agents objective of
|
||
running client-side scripts is to stress the resources of
|
||
such systems.
|
||
|
||
![Fig. 5: Illustration of Computational Complexity. The illustration shows the computational load on a network sensor in regard to different obfuscation techniques](/images/2015/02/Skjermbilde-2012-05-08-kl--20-43-04.png)
|
||
|
||
### C.2. Feature Vector Inversion
|
||
|
||
Threat agents go a long way in evading detection
|
||
algorithms. The following thought is derived from a common
|
||
misconception in database security:
|
||
|
||
> A group of ten persons which names are not to be revealed
|
||
is listed amongst a couple of thousands, in an
|
||
organizations LDAP directory. The group, let us name it X,
|
||
is not to be revealed and is therefore not named in the
|
||
department field.
|
||
|
||
While the public may not search and filter directly on the
|
||
department name, being X, an indirect search would be
|
||
succesful to reveal the group due to the ten persons being
|
||
the only ones not associated with a department.
|
||
|
||
The concept of searching indirectly may be applied to
|
||
evaluating javascripts in PDF documents as well. We might
|
||
start off with some of the expected characters found in
|
||
benign javascript documents:
|
||
|
||
{'viewerVersion':1,'getPrintParams':1,'printd':1,'var':10,'getPageNthWord':1,'annot':2,'numPages':1,'new':3}
|
||
|
||
The above which is found by expert knowledge as the probable
|
||
used variables and functions in a benign javascript or other
|
||
object. Much of these functions is used in interactive PDF
|
||
documents, e.g. providing print buttons,
|
||
|
||
A weight is added to each cleartext function/variable. After
|
||
counting the words in the document a summarized variable
|
||
named the inverted_feature_vector gives an integer. The
|
||
higher the integer, the higher the probability of the
|
||
javascript being benign.
|
||
|
||
The inversed feature vector may be used as a signature and a
|
||
whitelist indication database may be built of datasets. In
|
||
the 100k malicious dataset the statistics showed that out of
|
||
42475, 41423 had more than one occurence of a known benign
|
||
variable. This might seem like a less good feature, but the
|
||
quantity is not the issue here, it is the weight of each
|
||
variable. So: One may say that the higher the inverse vector
|
||
is, the more likely it is that the PDF or javascript is
|
||
benign. To clarify, next table shows variables fragmented by
|
||
weight: Inverse vector separated by interval, the
|
||
|
||
**Shadowserver 100k dataset** _The table shows that most
|
||
malicious PDF files in the 100k Shadowserver dataset
|
||
contains low-weighted scores when it comes to the inverted
|
||
vector as a measure of how benign the scripts are._
|
||
|
||
Weight interval Instances Instance percentage
|
||
<10 15232 35,6%
|
||
20<>9 26852 62,8%
|
||
30<>19 136 ~0%
|
||
40<>29 148 ~0%
|
||
50<>39 87 ~0%
|
||
60<>49 28 ~0%
|
||
>60 253 ~0%
|
||
Total 42736 -
|
||
|
||
The inversion vector may as well be seen as a measure of the
|
||
likeliness that the script is obfuscated. A quick look at
|
||
the table shows that the characteristics of obfuscation is
|
||
found in most PDF documents in the Shadowserver 100k
|
||
dataset.
|
||
|
||
Even though this part of the vector should be seen as an
|
||
indication, analysts should be aware that threat agents may
|
||
adapt to the detection technique and insert clear text
|
||
variables such as the ones listed above in addition to their
|
||
malicious javascripts. This latter would function as a
|
||
primitive feature vector inversion jammer. In other words it
|
||
should be seen in context with the other items of the
|
||
javascript feature vector as well. Further, the concept
|
||
should be further evolved to avoid such evasion. One
|
||
technique to segment the code before analyzing it (giving
|
||
each code segment a score, finally generating a overall
|
||
probability score), making it more difficult for the threat
|
||
agent to utilize noise in his obfuscation.
|
||
|
||
### D. Clustering
|
||
|
||
Experience shows that in practically oriented environments
|
||
security analysis is, at least partially, done in a manual
|
||
manner. This saying that the detection is based on
|
||
indicators or anomalies and the analysis of the detection
|
||
results is performed manually by an analyst. Though this may
|
||
possibly be the approach resulting in least false positives
|
||
it is overwhelming in regard to analysis of all potentially
|
||
PDF documents in a larger organization. The 100k PDF dataset
|
||
used in this paper is a evidence of such. So, how is it
|
||
possible to automatically detect the interesting parts of
|
||
the 100k PDF dataset? This question leads to the concept of
|
||
data mining.
|
||
|
||
The definition of data mining is the transformation of data
|
||
to "meaningful patterns and rules".
|
||
|
||
Michael Abernethy at IBM developerWorks20 covers data mining quite extensively.
|
||
|
||
#### D.1. A Narrow Experiment and Results
|
||
|
||
In this paper the goal is to achieve an view of the dataset
|
||
in a way that is named "undirected" data mining: Trying to
|
||
find patterns or rules in existing data. This is achieved
|
||
through the feature vector previously presented.
|
||
|
||
Up until now this paper has discussed how to generate an
|
||
satisfactionary feature vector and what makes the measure of
|
||
similarity. Let us do an experiment using WEKA (Waikato
|
||
Environment for Knowledge Analysis) for analyzing our
|
||
feature vector.
|
||
|
||
Appendice 3 describes the ARFF format found from our feature
|
||
vector and two of the previously presented feature vectors
|
||
(SHA256:
|
||
``5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b``,
|
||
``d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201``)
|
||
and a random selection of 2587 parseable PDF-documents from
|
||
the dataset.
|
||
|
||
In this experiement the feature vector were produced of 200
|
||
random samples from the 100k dataset. Interesting in that
|
||
regard is that the subdataset loaded from originally
|
||
contained 6214 samples, while our application only handled
|
||
the decoding of under half. The feature vector was extracted
|
||
in a CSV format, converted by the following WEKA Java class
|
||
and loaded in WEKA:
|
||
|
||
java -classpath /Applications/weka-3-6-6.app/Contents/Resources/Java/weka.jar weka.core.converters.CSVLoader dataset.csv
|
||
|
||
In the WEKA preprocessing, the results may be visualized:
|
||
|
||
![Fig. 6: Results 1; PDF Feature Vector DistributionA model
|
||
showing the PDF feature vector object distribution using
|
||
the 2587 parsable PDF
|
||
documents](/images/2015/02/Skjermbilde-2012-05-16-kl--13-17-20.png)
|
||
|
||
### D.2. The complete dataset
|
||
|
||
Next loading the complete feature vector dataset consisting
|
||
of 42736 entities showed interesting results when
|
||
clustering.
|
||
|
||
![Fig. 7: Stringcount vs anomalies in the inverse
|
||
vector. Stringcount vs anomalies in the
|
||
inverse_vector. Using k-means algorithm and k=5. Medium
|
||
Jitter to emphasize the
|
||
clusters](/images/2015/02/Skjermbilde-2012-06-27-kl--11-40-19.png)
|
||
|
||
The cluster process above also enables the possibility to
|
||
look at the anomalies where the inverse_vector is high. For
|
||
instance 9724 (the highest one in the Y-axis) the
|
||
inverse_vector is 21510 which is a very clear anomaly
|
||
compared to the rest of the clusters (the distance is
|
||
far). This should encourage a closer look at the file based
|
||
on the hash.
|
||
|
||
The Shadowserver 100k ARFF dataset will be further evolved and may be found at the project GitHub page25.
|
||
|
||
### E. Logging and Interpreting Errors
|
||
|
||
Again and again while analyzing the 100k dataset the
|
||
interpreter went on parsing errors. Bad code one may say,
|
||
but a fact is that the threat agents are adapting their code
|
||
to evading known tools and frameworks. An example of this is
|
||
a recent bug21 in Stevens PDF parser where empty PDF objects
|
||
in fact created an exception in the application.
|
||
|
||
So, what does this have to do with this paper? Creative
|
||
threat agents can never be avoided, creating malicious code
|
||
that avoids the detection routines. This makes an important
|
||
point, being that the application implemented should be
|
||
using strict deobfuscation and interpretation routines. When
|
||
an error occurs, which will happen sooner or later, the file
|
||
should be traceable and manually analyzed. This in turn
|
||
should lead to an adaption of the application. Where the
|
||
routines fails will also be a characteristic of the threat
|
||
agent: What part of the detection routines does he try to
|
||
evade? E.g. in the 100k dataset an error on the
|
||
ascii85-filter occurred. The parsing error made the
|
||
parser-module not to output a feature vector, and were
|
||
detected by error monitoring in log files.
|
||
|
||
## Discussion and Conclusions
|
||
|
||
In regard to being used standalone as evidence the feature
|
||
vector will have its limitations, especially since its hard
|
||
to connect it to an event it should be considered
|
||
circumstancial.
|
||
|
||
The PDF and ECMA standard are complex and difficult to
|
||
interpret, especially when it comes to automation. As has
|
||
been shown in this article a really hard problem is
|
||
dynamically and generically executing javascripts for
|
||
deobfuscation. This is also shown just in the Adobe Reader,
|
||
where e.g. Adobe Reader X uses Spidermonkey 1.8, while
|
||
previous more prevalent versions use version 1.7 of
|
||
Spidermonkey. This often resulted in parsing errors, and
|
||
again it will potentially cause a larger error rate in the
|
||
next generation intrusion detection systems.
|
||
|
||
It has been proved that a static analysis through a
|
||
Jsunpack-n modification recovers good enough round-zero
|
||
data, from a little less than half of the Shadowserver 100k
|
||
dataset, to generate a characteristic of each file. The
|
||
results were somewhat disappointing in regard to the
|
||
extensive parsing errors. Parsing optimalization and error
|
||
correction making the script more robust and reliable should
|
||
be covered in a separate report. Despite the latter a good
|
||
foundation and enough data were given to give a clue for
|
||
what to expect from the extended PDF feature vector. Also,
|
||
the inverse vector with its weighting gives a individual
|
||
score to each document, making it exceptionally promising
|
||
for further research.
|
||
|
||
In regard to OLOVC a certain enhancement would be to combine
|
||
it with the work of Franke' and Petrovic' "Improving the
|
||
efficiency of digital forensic search by means of contrained
|
||
edit distance". Their concept seems quite promising and
|
||
might provide valuable input to OLOVC.
|
||
|
||
The dataset used in this article may contain certain flaws
|
||
in its scientific foundation. No dataset flaws, but
|
||
indications that some data origins from the same source, has
|
||
been seen throughout this article. The reason is most
|
||
probably that the dataset was collected over three
|
||
continuous days. Linked to the behaviour of malware it is
|
||
known that certain malware such as drive-by attacks has
|
||
peaks in its spread as a function of time. It is therefore
|
||
natural to assume that there are larger occurences of PDF
|
||
documents originating from the same threat agent. On the
|
||
other side, in further research, this should be a measure of
|
||
the effectiveness of algorithms ability to group the data.
|
||
|
||
The Shadowserver 100k dataset only contains distinct
|
||
files. It would be interesting to recollect a similar
|
||
dataset with non-distinct hash-entries, and to cluster it by
|
||
fuzzy hashing as well.
|
||
|
||
Even though clustering is mentioned in the last part of this
|
||
article, further extensive research should be done to
|
||
completely explore the potential of using the current
|
||
feature vector. In other words the scope of the article
|
||
permitted for a manual selection of a feature vector and a
|
||
more or less defined measure of similarity though the
|
||
extended PDF feature vector.
|
||
|
||
The project has a maintained GitHub page as introduced in
|
||
the last section. This page should encourage further
|
||
development into the extended PDF feature vector.
|
||
|
||
If you'd like please have a look at the GuC Testimon Forensic Laboratory [21].
|
||
|
||
|
||
[1] GuC Testimon Forensic Laboratory: https://sites.google.com/site/testimonlab/
|