thoughts/data/novel-pdf-detection.md

For some time now the Portable Document Format standard has
been a considerable risk in regard to corporate as well as
private information security concerns. Some work has been
done to classify PDF documents as malicious or benign, but
not as much when it comes to clustering the malicious
documents by techniques used. Such clustering would provide
insight, in automated analysis, to how sophisticated an
attack is and who staged it.  A 100.000 unique PDF dataset
was supplied by the Shadowserver foundation. Analysis of
experiment results showed that 97% of the documents
contained javascripts. This and other sources revealed that
most exploits are delivered through such, or similar object
types. Based on that, javascript object labeling gets a
thorough focus in the paper.

The scope of the paper is limited to extend the attribution
research already done in regard to PDF documents, so that a
feature vector may be used in labeling a given (or a batch)
PDF to a relevant cluster. That as an attempt to recognize
different techniques and threat agents.

> Javascript is currently one of the most exploited PDF
  objects. How can the PDF feature vector be extended to
  include a javascript subvector correctly describing the
  technique/style, sophistication and similarity to previous
  malicious PDF documents. How does it relate to the term
  digital evidence?  
> — Problem statement

The problem statement considers the coding styles and
obfuscation techniques used and the related sophistication
in the coding style. Least but most important the statement
involves how the current PDF document measures to others
previously labeled. These are all essential problems when it
comes to automatated data mining and clustering.

### A. Related Work

Proposed solutions for malicious contra benign
classification of PDF documents has been explicitly
documented in several papers. Classification using support
vector machines (SVM) was handled by Jarle Kittilsen in his
recent Master's thesis1.

Further, the author of this paper in his bachelor's thesis2
investigated the possibility to detect obfuscated malware by
analyzing HTTP data traffic known to contain malware. In
regard, the findings were implemented, designed and tested
in Snort. Some of the detection techniques will be used as a
fundament for labeling in this paper.

Even though much good work has been done in the era of
analyzing malicious PDF documents, many of the resulting
tools are based on manual analysis. To be mentioned are
Didier Stevens who developed several practical tools, such
as the PDF parser and PDFid. These tools are not only tools,
but was the beginning of a structured way of looking at
suspicious objects in PDF documents as well. To be credited
as well is Paul Baccas in Sophos, which did considerable
work on characterizing malicious contra benign PDF
documents3.

The paper will be doing research into the feature,
javascript subvector of malicious PDF documents. To be able
to determine an effective vector (in this experimental
phase), it is essential that the dataset is filtered,
meaning that the files must be malicious. As Kittilsen has
done in regard to PDF documents, Al-Tharwa et ál2 has done
interesting work to detect malicious javascript in browsers.

## Background
### A.1. The Feature Vector in Support of Digital Evidence

Carrier and Spafford defined "digital evidence" as any
digital data that contain reliable information that supports
or refutes a hypothesis about the incident7. Formally, the
investigation process consists of five parts and is
specially crafted for maintaining evidence integrity, the
order of volatility (OOV) and the chain of custody. This all
leads up to the term forensic soudness.

The investigation process consists of five phases. Note the
identification and analysis phase.

![Fig. 1: The investigation process. The investigation
 process consists of five phases9. Note the identification
 and analysis
 phase](/images/2015/02/Theinvestigationprocess-e1380485641223.png)

In this paper, forensic soudness is a notion previously
defined10 as meaning: No alternation of source data has
occured. Traditionally this means that every bit of data is
copied and no data added. The previous paper stated two
elementary questions:

* Can one trust the host where the data is collected from?
* Does the information correlate to other data?

When it comes to malicious documents, they are typically
collected in two places:

1. In the security monitoring logging, the pre-event phase
2. When an incident has occured and as part of the reaction to an
   incident (the collection phase)

Now, the ten thousand dollar question: When a malicious
document gets executed on the computer, how is it possible
to get indications that alteration of evidence has occured?
The answer is potentially the first collection point, the
pre-event logging.

In many cases, especially considering targeted attacks, it
is not possible to state an PDF document as malicious in the
pre-event phase. The reason for this is often the way the
threat agent craft his attack to evade the security
mechanisms in the target using collected intelligence. Most
systems in accordance to local legislation should then
delete the content data. A proposition though is to store
the feature vector.

The reasoning behind storing a feature vector is quite
simple: When storing hashes, object counts and the
javascript subvector which we will return to later in the
paper, it will be possible to indicate if the document
features has changed. On the other side there is no
identifiable data invading privacy.

It is reasonable to argue that the measure of how similar
one PDF document is to another, is also the measure of how
forensically sound the evidence collected in a post-event
phase is. How likely it is that the document aquired in the
collection phase is the same as the one in the pre-phase is
decided by the characteristics supplied by the feature
vectors of both. Further, the feature-vector should be as
rich and relevant as possible.

![Fig. 2: Correlation by using the feature vector of the PDF
 document. Illustration of a possible pre/post incident
 scenario](/images/2015/02/Preandpost.png)

### A.2. Identification as an Extension of Similarity

The notion of similarity largely relates to the feature
vector: How is it in large quantities of data possible to
tell if the new PDF document carries similar characteristics
like others of a larger dataset.

In his work with semantic similarity and preserving hashing,
M. Pittalis11 defined similarity from the Merriam-Webster
dictionary:

> Similarity: The existance of comparable aspect between two
> elements  
> – Merriam-Webster Dictionary

The measure of similarity is important in regard to
clustering or grouping the documents. When clustering
datasets the procedure is usually in six steps, finding the
similarity measure is step 2.

1. Feature selection
2. Proximity/similarity measure
3. Clustering criterion
4. Clustering algorithm
5. Validation
6. Interpretation

In this paper the k-means unsupervised learning clustering
algorithm was consideres. This simple algorithm groups the
number n observations into k clusters22. Each observation
relates to the cluster with the nearest mean.

Now, as will be seen over the next two sections, work done
in the subject is mostly missing out on giving a valid
similarity measure when it comes to classifying PDF
documents as anything other than malicious or benign. So, to
be able to cluster the PDF documents the feature vector will
need a revision.

As Pittalis introduced the concept of similarity, it is
important to define one more term: Identification. According
to the American Heritage Dictionary, identification is:

> Proof or Evidence of Identity.
> — The American Heritage Dictionary

In our context this means being able to identify a PDF
document and attribute it to e.g. a certain type of botnet
or perhaps more correct a coding or obfuscation
technique. In an ideal state this will give an indication to
which threat agent is behind the attack. This is something
that has not been researched extensively in regard to PDF
documents earlier.

### C. The Portable Document Format

When it comes to the feature vector of the portable document
format (PDF), it is reasonable to have a look at how PDF
documents are structured. The PDF consists of objects, each
object is of a certain type. As much research has been done
on the topic previously, the format itself will not be
treated any further in this paper12.

![A simplified illustration of the portable document format](/images/2015/02/ObjectdescriptionPDF-2.png)

When considering malicious PDF documents, relevant
statistics has shown the following distribution of resource
objects:

**Known Malicious Datasets Objects** A table showing a
number interesting and selected features in malicious seen
against clean PDF documents. Baccas used two datasets where
one indicated slightly different results.

    Dataset	Object Type	Clean (%)	Malicious (%)
    The Shadowserver 100k PDF malicious dataset	 /JavaScript	            NA	    97%
    --
    Paul Baccas' Sophos 130k malicious/benign dataset3	/JavaScript	     2%   94%
                /RichMedia	   0%	 0,26%
                /FlateDecode	89%	  77%
                /Encrypt	    0,91% 10,81%

What can be seen of the table above is that when it comes to
the distribution of objects in malicious files, most of them
contains javascript. This makes it very hard to distinguish
and find the similarity between the documents without
considering a javascript subvector. The author would argue
that this makes it a requirement for a javascript subvector
to be included in the PDF feature vector to make it a
valid. In previous work, where the aim has been to
distinguish between malicious and benign, this has not been
an issue.

### D. Closing in on the Core: The PDF Javascript Feature Subvector

Javascript is a client-side scripting language primarily
offering greater interactivity with webpages. Specifically
javascript is not a compiled language, weakly-typed4 and has
first-class functions5. In form of rapid development, these
features gives great advantages. In a security perspective
this is problematic. The following states a Snort signature
to detect a javascript "unescape"-obfuscation technique2(we
will return to the concept of obfuscation later on):

    alert tcp any any -> any any (msg:”Obfuscated unescape”; sid: 1337003; content:”replace”; pcre:”/u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e’ ?.replace (/”;rev:4;)

Traditionally javascript is integrated as a part of an
browser. Seen from a security perspective, this opens for
what is commonly known as client-side attacks. More
formally: Javascript enables programmatic access to
computational objects within a host environment. This is
complicated as javascript comes in different flavors, making
general parsing and evaluation complex6, as may be seen of
the above signature. The flavors are often specific to the
application. Today, most browsers are becoming more aligned
due to the requirements of interoperability. Some
applications, such as the widely deployed Adobe Reader has
some extended functionality though, which we will be
focusing on in this paper.

Even though javascript may pose challenges to security, it
is important to realize that this is due to
complexity. Javascript (which is implemented through
SpiderMonkey in Mozilla18-products and in Adobe Reader as
well) builds on a standard named ECMA-262. The ECMA is an
standardization-organ of Information and Communication
Technology (ICT) and Consumer Electronics (CE)17. Thus,
Javascript is built from the ECMAScript scripting language
standard. To fully understand which functions is essential
in regard to malicious Javascripts this paper will rely on
the ECMAScript Language Specification19 combined with expert
knowledge.

### E. Introducing Obfuscation

Harawa et al.8 describes javascript obfuscation by six elements:

* Identifier reassignment or randomization
* Block randomization
* White space and comment randomization
* Strings encoding
* String splitting
* Integer obfuscation

Further, Kittilsen1 documented a javascript feature vector
which states the following functions as potentially
malicious: [function, eval_length, max_string, stringcount,
replace, substring, eval, fromCharCode]. Even though his
confusion matrix shows good results, there are some problems
when it comes to evaluating these as is: Such characters are
usually obfuscated. The following is an example from sample
``SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]``:

    if((String+'').substr(1,4)==='unct'){e="".indexOf;}c='var _l1="4c206f5783eb9d;pnwAy()utio{.VsSg',h&lt;+I}*/DkR%x-W[]mCj^?:LBKQYEUqFM';l='l';e=e()[((2+3)&#63;'e'+'v':"")+"a"+l];s=[];a='pus'+'h';z=c's'+"ubstr" [1];sa [2];z=c's'+"ubstr" [3];sa [2];z=c['s'+"ubstr"] [...]e(s.join(""));}

The above example tells an interesting story about the
attackers awareness of complexity. In respect to Kittilsens
javascript feature vector the above would yield the
following result: [0,x,x,x,0,0,0,0] (considerable results on
the second to fourth, plus one count if we are to shorten
substring to substr), in other words the features are to be
found in the embedded, obfuscated javascript, but not in
clear text. When it comes to eval_length, max_string and
string_count we will return to those later in the paper.

Deobfuscated, the script would look like:

    var _l1="[...]";_l3=app;_l4=new Array();function _l5(){var _l6=_l3.viewerVersion.toString();_l6=_l6.replace('.','');while(_l6.length&4)_l6l='0';return parsetnt(_l6,10);function _l7(_l8,_l9){while(_l8.length+2&_l9)_l8l=_l8;return _l8.substring(0,_l9I2);function _t0(_t1){_t1=unescape(_t1);rote}a*=_t1.length+2;da*/ote=unescape('Du9090');spray=_l7(da*/ote,0k2000Rrote}a*);lok%hee=_t1lspray;lok%hee=_l7(lok%hee,524098);for(i=0; i & 400; ill)_l4xi-=lok%hee.substr(0,lok%hee.lengthR1)lda*/ote;;function _t2(_t1,len){while(_t1.length&len)_t1l=_t1;return _t1.substring(0,len);function _t3(_t1){ret='';for(i=0;i&_t1.length;il=2){b=_t1.substr(i,2);c=parsetnt(b,16);retl=String.froW[har[ode(c);;return ret;function _]i1(_t1,_t4){_t5='';for(_t6=0;_t6&_t1.length;_t6ll){_l9=_t4.length;_t7=_t1.char[odeAt(_t6);_t8=_t4.char[odeAt(_t6D_l9);_t5l=String.froW[har[ode(_t7m_t8);;return _t5;function _t9(_t6){_]0=_t6.toString(16);_]1=_]0.length;_t5=(_]1D2)C'0'l_]0j_]0;return _t5;function _]2(_t1){_t5='';for(_t6=0;_t6&_t1.length;_t6l=2){_t5l='Du';_t5l=_t9(_t1.char[odeAt(_t6l1));_t5l=_t9(_t1.char[odeAt(_t6));return _t5;function _]3(){_]4=_l5();if(_]4&9000){_]5='oluAS]ggg*pu^4?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAAfhaASiAgBA98Kt?:';_]6=_l1;_]7=_t3(_]6);else{_]5='*?lAS]iLhKp9fo?:IIIIIwAAAA?AAAAAAAAAAAALAAAAAAAABk[ASiAgBAIfK4?:';_]6=_l2;_]7=_t3(_]6);_]8='SQ*YA}ggAA??';_]9=_t2('LQE?',10984);_ll0='LLcAAAK}AAKAAAAwtAAAALK}AAKAAAA?AAAAAwK}AAKAAAA?AAAA?gK}AAKAAAA?AAAAKLKKAAKAAAAtAAAAEwKKAAKAAAAwtAAAQAK}AUwAAA[StAAAAAAAAAAU}A]IIIII';_ll1=_]8l_]9l_ll0l_]5;_ll2=_]i1(_]7,'');if(_ll2.lengthD2)_ll2l=unescape('D00');_ll3=_]2(_ll2);with({*j_ll3;)_t0(*);Ywe123.rawValue=_ll1;_]3();

Which through the simple Python script javascript feature
vector generator (appendice 1), yields:

    ['function: 9', 'eval_length: x', 'max_string: x', 'stringcount: x', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']

Harawa et al.' 6 elements of javascript obfuscation is
probably a better, or necessary supplemental approach to
Kittilsens work.

There is a notable difference between deobfuscation and
detecting obfuscation techniques. The difference consists of
the depth of insight one might gain in actually
deobfuscating a javascript as it will reveal completely
different code while the obfuscation routines may be based
on a generic obfuscator routine used by several threat
agents. This is much like the issue of packers in regard to
executables23.

This section has shown the difficulties of balancing
deobfuscation for a more detailed coding style analysis
against a less specific feature vector by using abstract
obfuscation detection.

## Extracting and Analysing a PDF Feature Vector

### A. Deobfuscation - Emerging Intentions

Usually the most pressing question when an incident
involving a PDF document occur is: Who did it, and what's
his intentions. This is also a consideration when further
evolving the PDF feature vector. In the next figure is a
model describing three groups of threat agents, where one
usually stands out. Such as if a Stuxnet scale attack24
involving a PDF document is perceived it will be associated
with a cluster containing "group 1" entities.

While Al-Tharwa et ál2 argues for no need for deobfuscation
in regard to classification, deobfuscation is an important
step in regard to finding a distinct feature vector. The
issue is that in most situations it isn't good enough to
tell if the documents is malicious, but also in addition to
who, what, where and how it was created. In regard to being
defined as valid digital evidence a rich feature vector (in
addition to the network on-the-fly hash-sum) is part of
telling. The latter also makes itself relevant when it comes
to large quantities of data, where an analyst is not capable
of manually analyzing and identifying hundreds to tens of
thousands of PDF documents each day.

![Fig. 4: The threat agent modelA model describing three
 groups of attackers. These are necessary to filter and
 detect in the collection
 phase](/images/2015/02/threat-agent-model.png)

### B. Technical Problems During Deobfuscation

Normally most javascript engines, such as Mozillas
Spidermonkey15, Google V816 and others, tend to be
javascript libraries for browsers and miss some basic
functionality in regard to Adobe Reader which is the most
used PDF reader. These engines is most often used for
dynamic analysis of Javascripts and is a prerequiste when it
comes to being able to completely deobfuscate javascripts.

To prove the concepts of this article a static Python
feature vector generator engine based on a rewritten version
of the Jsunpack-n14project is used. The application used in
the paper is providing a vector based interpretation of the
static script, meaningn it is not run it dynamically.

Reliably detecting malicious PDF documents is a challenge
due to the obfuscation routines often used. This makes it
necessary to perform some kind of deobfuscation to reveal
more functionality. Even if one managed to deobfuscate the
script one time, there may be several rounds more before it
is in clear text. This was a challenge not solvable in the
scope of this article.

Due to parsing errors under half of the Shadowserver 100k
dataset was processed by the custom Jsunpack-n module.

### C. Introducing Two Techniques: Feature Vector Inversion and Outer Loop Obfuscation Variable Computation

As have been very well documented so far in the paper it is
more or less impossible to completely automate an
deobfuscation process of the PDF format. Obfuscation leaves
many distinct characteristics though, so the threat agent on
the other hand must be careful to not trigger anomaly
alarms. There is a balance. This part of the article
introduces two novel techniques proposed applied to the
javascript subvector to improvie its reliability.

#### C.1. Outer Loop Obfuscation Variable Computation (OLOVC)

When the threat agent implements obfuscation, one of his
weaknesses is being detected using obfuscation. When it
comes to PDF documents using javascripts alone is a
trigger. Now, the threat agent is probably using every trick
in the book, meaning the 6 elements of javascripts
obfuscation8. The job of an analyst in such a matter will be
to predict new obfuscation attempts and implement anomaly
alerts using the extended PDF feature vector.

Throughout this paper we will name this technique "Outer
Loop Obfuscation Variable Computation". The term "outer
loop" most often refer to round zero or the first of the
deobfuscation routines. Variable computation is as its name
states, a matter of computing the original javascript
variable. As we have seen this may be done by either
deobfuscating the script as a whole including its
near-impossible-for-automation complexity, or use the
original obfuscated data. We will have a further look at the
latter option.

Take for instance this excerpt from the "Introducing Obfuscation"-section:

    z=c['s'+"ubstr"](0,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](2,1);s[a](z);z=c['s'+"ubstr"](3,1);s[a](z);z=c['s'+"ubstr"](4,1);s[a](z);z=c['s'+"ubstr"](5,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](7,1);s[a](z);z=c['s'+"ubstr"](8,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](10,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](13,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](14,1);s[a](z);z=c['s'+"ubstr"](12,1);[...](20,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](18,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](13,1);s[a](z);z=c['s'+"ubstr"](19,1);s[a](z);z=c['s'+"ubstr"](11,1);s[a](z);z=c['s'+"ubstr"](14,1);s[a](z);z=c['s'+"ubstr"](17,1);s[a](z);z=c['s'+"ubstr"](12,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](1,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);z=c['s'+"ubstr"](9,1);s[a](z);z=c['s'+"ubstr"](6,1);s[a](z);


Harawa ét al defined the above obfuscation technique as
"string splitting" (as seen in the section "Introducing
obfuscation"). The following two obfuscation-extraction
regular expressions, is previously stated in the authors
Bachelors thesis2:

    e.{0,2}v.{0,2}a.{0,2}l.{0,1}

    u.{0,2}n.{0,2}e.{0,2}s.{0,2}c.{0,2}a.{0,2}p.{0,1}e

Keep the two above statements and the previous code excerpt
in mind. When breaking down the above expressions we
introduce one more regular expression:

    s.{0,4}u.{0,4}b.{0,4}s.{0,4}t.{0,4}r.{0,4}

While searching for "substr" in plain text in the plain-text
will certainly fail, the above expression will match e.g.:

    's'+"ubstr"

Recall Kittilsens javascript feature vector: ``[function,
eval_length, max_string, stringcount, replace, substring,
eval, fromCharCode]``. If extended by the above techniques,
the results is somewhat different.

Without string splitting detection:

    ['function: 9', 'eval_length: x', 'max_string: 10849', 'stringcount: 1', 'replace: 1', 'substring|substr: 4', 'eval: 0', 'fromCharCode: 0']

With outer loop obfuscation variable computation:

    ['function: 0', 'eval_length: x', 'max_string: 67', 'stringcount: 2', 'replace: 0', 'substring: 0', 'substr: 3663', 'eval: 1', 'fromCharCode: 0']

Additionally, rewriting and extending Kittilsens feature
vector by several other typically suspicious functions
should give preferrable results: ``[max_string, stringcount,
function, replace, substring, substr, eval, fromCharCode,
indexof, push, unescape, split, join, sort, length,
concat]``

This makes the following results in two random, but related, samples:

    [SHA256:5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 10849, 'stringcount': 2, 'replace': 0, 'substring': 0, 'substr': 1, 'length': 1, 'split': 2, 'eval': 0, 'push': 0, 'join': 1, 'concat': 0, 'fromCharCode': 0}

    [SHA256:d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201]:{'function': 0, 'sort': 0, 'unescape': 0, 'indexof': 0, 'max_string': 67, 'stringcount': 1, 'replace': 0, 'substring': 0, 'substr': 3663, 'length': 0, 'split': 0, 'eval': 0, 'push': 1, 'join': 1, 'concat': 0, 'fromCharCode': 0}

It may perhaps not need a comment, but in the above results
we see that there are two types of elements in the feature
vector that stands out: max_string and two of the suspicious
functions.

Summarized the "Outer Loop Obfuscation Variable Computation"
may be used to, at least partially, defeat the malware
authors obfuscation attempts. By running the somewhat
complex regular expressions with known malicious obfuscation
routines, the implementation result of the 100.000 PDF
dataset may be seen in the following table: Dataset
generalization by "outer loop obfuscation variable
computation" Dataset aggregated by counting javascript
variables and functions, OLOVC applied (due to errors in the
jsunpack-n the total number of entities calculated is
42736).

    Word	    Count
    function	  651
    sort	     7579
    unescape	    4
    toLowerCase	 1
    indexof	     8
    max_string  42346
    stringcount 41979
    replace	    70
    substring	  91
    replace	    70
    substring  	91
    substr	  38952
    length	   1512
    split	    9621
    eval	       77
    push	      260
    join	       91
    inverse_vector 41423
    concat	     86
    fromCharCode   45

By the counts in the above table it is shown that the
selected feature vector has several very interesting
features. On a sidenote: Even though some features has a
larger quantity than others it should be mentioned that this
is not necessarily the measure of how good that feature is,
such is especially the case with the inverse vector as we
will be more familiar with in the next section. Also, as
previously mentioned it is interesting to see the
composition of multiple features to determine the origin of
the script (or the script style if you'd like). The
aggregation script is attached in appendice 2.

The "Outer Loop Obfuscation Variable Computation" will
require a notable amount of computational resources in
high-quantity networks due to the high workload. In a way
this is unavoidable since the threat agents objective of
running client-side scripts is to stress the resources of
such systems.

![Fig. 5: Illustration of Computational Complexity. The illustration shows the computational load on a network sensor in regard to different obfuscation techniques](/images/2015/02/Skjermbilde-2012-05-08-kl--20-43-04.png)

### C.2. Feature Vector Inversion

Threat agents go a long way in evading detection
algorithms. The following thought is derived from a common
misconception in database security:

> A group of ten persons which names are not to be revealed
  is listed amongst a couple of thousands, in an
  organizations LDAP directory. The group, let us name it X,
  is not to be revealed and is therefore not named in the
  department field.

While the public may not search and filter directly on the
department name, being X, an indirect search would be
succesful to reveal the group due to the ten persons being
the only ones not associated with a department.

The concept of searching indirectly may be applied to
evaluating javascripts in PDF documents as well. We might
start off with some of the expected characters found in
benign javascript documents:

    {'viewerVersion':1,'getPrintParams':1,'printd':1,'var':10,'getPageNthWord':1,'annot':2,'numPages':1,'new':3}

The above which is found by expert knowledge as the probable
used variables and functions in a benign javascript or other
object. Much of these functions is used in interactive PDF
documents, e.g. providing print buttons,

A weight is added to each cleartext function/variable. After
counting the words in the document a summarized variable
named the inverted_feature_vector gives an integer. The
higher the integer, the higher the probability of the
javascript being benign.

The inversed feature vector may be used as a signature and a
whitelist indication database may be built of datasets. In
the 100k malicious dataset the statistics showed that out of
42475, 41423 had more than one occurence of a known benign
variable. This might seem like a less good feature, but the
quantity is not the issue here, it is the weight of each
variable. So: One may say that the higher the inverse vector
is, the more likely it is that the PDF or javascript is
benign. To clarify, next table shows variables fragmented by
weight: Inverse vector separated by interval, the

**Shadowserver 100k dataset** _The table shows that most
malicious PDF files in the 100k Shadowserver dataset
contains low-weighted scores when it comes to the inverted
vector as a measure of how benign the scripts are._

    Weight    interval	Instances	Instance percentage
    <10	      15232	   35,6%
    20<>9	    26852	 62,8%
    30<>19	     136	~0%
    40<>29	     148	~0%
    50<>39	      87	~0%
    60<>49	      28	~0%
    >60	        253	   ~0%
    Total	    42736	 -

The inversion vector may as well be seen as a measure of the
likeliness that the script is obfuscated. A quick look at
the table shows that the characteristics of obfuscation is
found in most PDF documents in the Shadowserver 100k
dataset.

Even though this part of the vector should be seen as an
indication, analysts should be aware that threat agents may
adapt to the detection technique and insert clear text
variables such as the ones listed above in addition to their
malicious javascripts. This latter would function as a
primitive feature vector inversion jammer. In other words it
should be seen in context with the other items of the
javascript feature vector as well. Further, the concept
should be further evolved to avoid such evasion. One
technique to segment the code before analyzing it (giving
each code segment a score, finally generating a overall
probability score), making it more difficult for the threat
agent to utilize noise in his obfuscation.

### D. Clustering

Experience shows that in practically oriented environments
security analysis is, at least partially, done in a manual
manner. This saying that the detection is based on
indicators or anomalies and the analysis of the detection
results is performed manually by an analyst. Though this may
possibly be the approach resulting in least false positives
it is overwhelming in regard to analysis of all potentially
PDF documents in a larger organization. The 100k PDF dataset
used in this paper is a evidence of such. So, how is it
possible to automatically detect the interesting parts of
the 100k PDF dataset? This question leads to the concept of
data mining.

The definition of data mining is the transformation of data
to "meaningful patterns and rules".

Michael Abernethy at IBM developerWorks20 covers data mining quite extensively.

#### D.1. A Narrow Experiment and Results

In this paper the goal is to achieve an view of the dataset
in a way that is named "undirected" data mining: Trying to
find patterns or rules in existing data. This is achieved
through the feature vector previously presented.

Up until now this paper has discussed how to generate an
satisfactionary feature vector and what makes the measure of
similarity. Let us do an experiment using WEKA (Waikato
Environment for Knowledge Analysis) for analyzing our
feature vector.

Appendice 3 describes the ARFF format found from our feature
vector and two of the previously presented feature vectors
(SHA256:
``5a61a0d5b0edecfb58952572addc06f2de60fcb99a21988394926ced4bbc8d1b``,
``d3874cf113fa6b43e7f6e2c438bd500edea5cae7901e2bf921b9d0d2bf081201``)
and a random selection of 2587 parseable PDF-documents from
the dataset.

In this experiement the feature vector were produced of 200
random samples from the 100k dataset. Interesting in that
regard is that the subdataset loaded from originally
contained 6214 samples, while our application only handled
the decoding of under half. The feature vector was extracted
in a CSV format, converted by the following WEKA Java class
and loaded in WEKA:

    java -classpath /Applications/weka-3-6-6.app/Contents/Resources/Java/weka.jar weka.core.converters.CSVLoader dataset.csv

In the WEKA preprocessing, the results may be visualized:

![Fig. 6: Results 1; PDF Feature Vector DistributionA model
 showing the PDF feature vector object distribution using
 the 2587 parsable PDF
 documents](/images/2015/02/Skjermbilde-2012-05-16-kl--13-17-20.png)

### D.2. The complete dataset

Next loading the complete feature vector dataset consisting
of 42736 entities showed interesting results when
clustering.

![Fig. 7: Stringcount vs anomalies in the inverse
 vector. Stringcount vs anomalies in the
 inverse_vector. Using k-means algorithm and k=5. Medium
 Jitter to emphasize the
 clusters](/images/2015/02/Skjermbilde-2012-06-27-kl--11-40-19.png)

The cluster process above also enables the possibility to
look at the anomalies where the inverse_vector is high. For
instance 9724 (the highest one in the Y-axis) the
inverse_vector is 21510 which is a very clear anomaly
compared to the rest of the clusters (the distance is
far). This should encourage a closer look at the file based
on the hash.

The Shadowserver 100k ARFF dataset will be further evolved and may be found at the project GitHub page25.

### E. Logging and Interpreting Errors

Again and again while analyzing the 100k dataset the
interpreter went on parsing errors. Bad code one may say,
but a fact is that the threat agents are adapting their code
to evading known tools and frameworks. An example of this is
a recent bug21 in Stevens PDF parser where empty PDF objects
in fact created an exception in the application.

So, what does this have to do with this paper? Creative
threat agents can never be avoided, creating malicious code
that avoids the detection routines. This makes an important
point, being that the application implemented should be
using strict deobfuscation and interpretation routines. When
an error occurs, which will happen sooner or later, the file
should be traceable and manually analyzed. This in turn
should lead to an adaption of the application. Where the
routines fails will also be a characteristic of the threat
agent: What part of the detection routines does he try to
evade? E.g. in the 100k dataset an error on the
ascii85-filter occurred. The parsing error made the
parser-module not to output a feature vector, and were
detected by error monitoring in log files.

## Discussion and Conclusions

In regard to being used standalone as evidence the feature
vector will have its limitations, especially since its hard
to connect it to an event it should be considered
circumstancial.

The PDF and ECMA standard are complex and difficult to
interpret, especially when it comes to automation. As has
been shown in this article a really hard problem is
dynamically and generically executing javascripts for
deobfuscation. This is also shown just in the Adobe Reader,
where e.g. Adobe Reader X uses Spidermonkey 1.8, while
previous more prevalent versions use version 1.7 of
Spidermonkey. This often resulted in parsing errors, and
again it will potentially cause a larger error rate in the
next generation intrusion detection systems.

It has been proved that a static analysis through a
Jsunpack-n modification recovers good enough round-zero
data, from a little less than half of the Shadowserver 100k
dataset, to generate a characteristic of each file. The
results were somewhat disappointing in regard to the
extensive parsing errors. Parsing optimalization and error
correction making the script more robust and reliable should
be covered in a separate report. Despite the latter a good
foundation and enough data were given to give a clue for
what to expect from the extended PDF feature vector. Also,
the inverse vector with its weighting gives a individual
score to each document, making it exceptionally promising
for further research.

In regard to OLOVC a certain enhancement would be to combine
it with the work of Franke' and Petrovic' "Improving the
efficiency of digital forensic search by means of contrained
edit distance". Their concept seems quite promising and
might provide valuable input to OLOVC.

The dataset used in this article may contain certain flaws
in its scientific foundation. No dataset flaws, but
indications that some data origins from the same source, has
been seen throughout this article. The reason is most
probably that the dataset was collected over three
continuous days. Linked to the behaviour of malware it is
known that certain malware such as drive-by attacks has
peaks in its spread as a function of time. It is therefore
natural to assume that there are larger occurences of PDF
documents originating from the same threat agent. On the
other side, in further research, this should be a measure of
the effectiveness of algorithms ability to group the data.

The Shadowserver 100k dataset only contains distinct
files. It would be interesting to recollect a similar
dataset with non-distinct hash-entries, and to cluster it by
fuzzy hashing as well.

Even though clustering is mentioned in the last part of this
article, further extensive research should be done to
completely explore the potential of using the current
feature vector. In other words the scope of the article
permitted for a manual selection of a feature vector and a
more or less defined measure of similarity though the
extended PDF feature vector.

The project has a maintained GitHub page as introduced in
the last section. This page should encourage further
development into the extended PDF feature vector.

If you'd like please have a look at the GuC Testimon Forensic Laboratory [21].


[1] GuC Testimon Forensic Laboratory: https://sites.google.com/site/testimonlab/