thoughts/data/indicators.md

Over what have become some years, cyber security
professionals have been working on optimising the sharing of
information and knowledge. A lot of the efforts have
recently been focused around intelligence- and data-driven
teams. Today many of these discussions have ended evolving
around something related to the STIX format.

> Don't use a lot where a little will do
> – Unknown origin

This post features a perspective of the potential of today's
standard-oriented approach for documenting indicator sets
related to cyber security threat actors and incidents. It
turns out we have a longer way to go than expected.

For the purpose of this article, an indicator is a
characteristic or evidence of something unwanted, or hostile
if you'd like. I like to refer to the military term
"Indicators & Warnings" in this regard. In other words, an
indicator isn't necessarily limited to the cyber domain
alone either. Physical security could be in an even worse
condition than cyber security when it comes to expressing
threat indicators. I'll leave the cross-domain discussion
for another time.

## Up Until Today

Multiple standards have evolved and disappeared, and one
that I have been in favor of previously is the OpenIOC 1.1
standard. However, times are changing, and so are the
terminology and breadth of how we are able to express the
intrusion sets.

Even though OpenIOC was a very good start, and still is as
far as I am concerned, it has far been surpassed Cybox and
ultimately STIX [1] in popularity.

STIX is a container, a quite verbose XML format (which is
turning JSON in 2.0). Cybox is the artefact format [2], for
malware you have MAEC [3] and so on. Basically it's a set of
projects collaborating.

This all sounds good, right? Not quite. Have a look at the
OpenIOC to STIX repository on Github [4] and you will find
that ``stuxnet.stix.xml`` is 202 lines of XML code for 18
atomic indicators. OpenIOC on the other hand, is 91 lines,
and that is a verbose format as well. In fact the overhead
ratio of the STIX file is about 10:1, while OpenIOC is about
5:1.

To add to the mind-blowing inefficiency I have yet to see,
on a regular basis, complex and nested expressions of an
actor or a campaign in the STIX format.

Before you continue, do a simple Google search for "STIX
editor" and "cybox editor". Do it now, and while you are at
it google for "openioc editor" as well. Hello guys, these
standards have been going around for many years. So, how
should we interpret that there aren't any user friendly
approaches to using them? The closest I've come is through
MISP, and that is generally speaking not using these
standards for their internal workings either. This one on
the MISP GitHub issue tracker says it all: STIX 2.x support
(MISP) [5].

I'm sure that some may disagree with the above statements,
calling out the infancy of these formats. However, they
can't be said to be new standards anymore. They are just too
complex. One example of such is the graph-oriented relations
implemented into the formats. Why not just let a graph
database take care of these instead?

This is not just a post to establish the current state. How
would a better approach look?

## What Is The Problem to Be Solved?

Back to where things have gone since the OpenIOC 1.1/atomic
indicator days. The most promising addition, in my opinion,
is the MITRE PRE-ATT&CK and ATT&CK frameworks. The two
frameworks builds on a less structured approach than seen
for atomic indicators (Lockheed's Kill-Chain). The latter
can for instance be viewed in form of the Intelligence
Pyramid.

The Intelligence Pyramid's abstraction levels can be mapped
against what it is supposed to support when it comes to
indicators like the following:

    | Level of abstraction  |    | Supports
    |-----------------------|----|-------------
    | Behavior              |    | Knowledge
    |-----------------------|--->|-------------
    | Derived               |    | Information
    |-----------------------|--->|-------------
    | Atomic                |    | Data

The purpose of the abstration layer is in this case to
support assessments and measures at the corresponding
contextual level. For instance a technical report tailored
to an Incident Response Team (IRT) generally concerns
Derived and Atomic indicators, while an intelligence report
would usually be based on the Behavioural level.

Having covered the abstraction layers, we can recognize that
OpenIOC (or Cybox and MAEC) covers the bottom layers of
abstration, while MITRE (PRE-)ATT&CK in its current form is
mostly about the Behaviour level.

For Derived indicators there are primarily two
well-established, seasoned and successful formats that have
become standards through its widespread usage. This is
amongst others caused by the indicators and rules being
effective, rapid, easy and pleasing to write.

First we have Snort/Suricata rules and Lua scripts which was
designed for network detection. For Snort/Suricata I'd say
that most of what is detected of metadata today is probably
expressable in OpenIOC (except for the magic that can be
done with Lua). Second there is the Yara format which has
become known for its applicability against malicious
files. The simplicity of both formats is obviously due to
their power of expression. Thus, I'd say that Yara and
Snort/Suricata formats is the ones to look for when it comes
to content and pattern detection.

> Indicators should be easy and pleasing to write.

To summarize the above, each of the formats can be mapped to
an abstraction level:

    | Level of abstraction  |    | Formats
    |-----------------------|----|-------------
    | Behavior              |    | MITRE (PRE-)ATT&CK
    |-----------------------|--->|-------------
    | Derived               |    | Suricata+Lua, Yara
    |-----------------------|--->|-------------
    | Atomic                |    | OpenIOC 1.1


Going through my notes on how I document my own indicators I
also found that I use the CVE database, datetimes,
confidence, analyst comments for context and classification
as well (the latter being irrelevant for detection).

One of the major problems is: everything that is currently
out there breaks the analyst workflow. You either need to
log in to some fancy web interface, edit XML files (god
forbid) or you would just jot down everything in a text
file. The text file seems to be the natural fallback in
almost any instance. I have even attempted to use the very
good initiative by Yahoo, PyIOCe, and Mandiant's
long-forgotten IOC Editor. These projects have both lost
tracktion, as almost every other intiative in this space. So
that is right folks, the text editor is still the preferred
tool in 2018, and let's face it: indicators should be
pleasing to design and create - like putting your signature
to an incident or a job well done.

> an indicator set should be for humans and machines by
  humans

After all, the human is the one that is going to have to
deal with the indicator sets at some point, and we are the
slowest link. So let us not slow ourselves down more than
necessary. At this point I would like to propose the golden
rule of creating golden rules: an indicator set should be
for humans and machines by humans.

You may also have noticed that when all these standards
suddendly are combined into one standard, they become less
user-friendly. In other words, let us rather find back to
our common \*NIX roots where each tool had a limited set of
tasks.

Graphs are essential when writing indicators. Almost
everything in the world around us can be modelled as a
network, and infiltration and persistence in cyberspace is
no exception. Thus, an indicator format needs to be
representable in a graph, and guess what? Almost everything
are as long as it maintains some kind of structure.

For graphs there are two ways of going about the problem:

1) Implement the graph in the format

2) Make sure that you have a good graph backend and a
automatable and traversable format available

For option 1, the graph in the format will increase the
complexity significantly. Option 2 results in the opposite,
but that does not mean that it can't be converted to a
graph. To make an elaborate discussion short, this is what
we have graph databases for, such as Janusgraph [6].


## A Conceptual View

Summarizing the above, I'd like to propose the following
requirements for indicator formats:

1) Indicator sets should be easy and inviting to create

2) You should be able to start writing at any time, when you
need it

3) Unnecessary complexity should be avoided

4) The format should be human readable and editable

5) A machine should be able to interpret the format

6) Indicator sets should be graph compatible

With a basis in this article, I believe that the best
approach is to provide a basic plain text format
specification that inherits from the OpenIOC 1.1 and MITRE
frameworks and references other formats where necessary.

Let us imagine that we found an IP address in one
situation. The IP-address was connected to a domain that we
found using passive DNS. Further, it was found that a
specific file was associated with that domain through a
Twitter comment. Representing the given information in its
purest (readable) form looks like the following:

    // a test file
    class                  tlp:white
    date                   2018/02/18
    ipv4          low      188.226.130.166
      domain      med      secdiary.com
      technique            PRE-T1146
        filename  med      some_filename.docx
        comment            found in open sources

To recap some of the previous points: the above format is
simple, it can be written at any time based on knowledge of
well known standards. The best of it all is that if you are
heavily invested in specific formats, it can be converted to
them all using a simple interpreter traversing the format.

Further, such a format is easily converted into a tree and
can be loaded into a graph for traversing and automated
assessments. Each confidence value can be quantified
(``low=0.33``, ``med=0.66``, ``high=1.0``). That said,
simplicity in this case equals actionable indicators.

    | v: 188.226.130.166 (0.33)    | match    |
    | e                            |          |
    | v: secdiary.com (0.66)       | no match | (0.33+0.66)/2=0.5
    | e                            |          |
    | v: some_filename.docx (0.66) | match    |

For networks vs hierarchies: a drawback of the latter, as
mentioned in the former section, is the lack of
e.g. multiple domains being connected to different other
vertices. A practical solution goes as follows:

    ipv4      low    188.226.130.166
      domain  med    secdiary.com
    domain    low    secdiary.com
      ipv4      low    128.199.56.232

The graph receiving the above indicator file should identify
the domain as being a unique entity and link the two IP
addresses to the same domain:

    | v: 188.226.130.166 (0.33)
    | e: 0.5
    | v: secdiary.com (0.5)
    | e: 0.33
    | v: 128.199.56.232 (0.33)

As for structuring the indicator format for machines in the
practical aspect, consider the following pseudocode:

    indicators = [(0,'ipv4','low','188.226.130.166'),...]
    _tree = tree(root_node)
    for indicator in indicators
      depth = indicator[0]
      _tree.insert(indicator,depth)

Now that we have the tree represented in code, it is
trivially traversable when loading it into some graph:

    method load_indicators(node,depth):
      graph.insert(node.parent,edge_label,node)
      for child in node.children
        load_indicator(child,depth+1)

    load_indicators(tree,0)

## Summary

Hopefully I did not kill too many kittens with this
post. You may or may not agree, but I do believe that most
analysts share at least parts of my purist views on the
matter.

We are currently too focused on supporting standards and
having everyone use as few of them as possible. I believe
that energy is better used on getting more consistent in the
way we document and actually exchange more developed
indicator sets than the md5 hash- and domainlists that are
typically shared today ("not looking at these kinds of files
at all" - even though it's not the worst I've seen:
``MAR-10135536-F_WHITE_stix.xml`` [7]).

In the conceptual part of this article I propose a simple
but yet effective way of representing indicators in a
practical manner. Frankly, it is even too simple to be
novel. It is just consistent and intutitive.

PS! For the STIX example above, have a look at the following
to get a feel with the actual content of the file (used one
of the mentioned specimens to show the point):

    class             tlp:white
    date              2018/02/05

    sha1          high    4efb9c09d7bffb2f64fc6fe2519ea85378756195
      comment             NCCIC:Observable-724f9bfe-1392-456e-8d9b-c143af15f8d4
      comment             did not convert all attributes
      compiler            Microsoft Visual C++ 6.0
      md5         high    3dae0dc356c2b217a452b477c4b1db06
      date                2016-01-29T09:21:46Z
      entropy     med     6.65226708818
      #sections   low     5
      intname     med     ProxyDll.dll
      detection   med     symantec:Heur.AdvML.B

The original document states for those same indicators in no less than 119 lines
with an overhead ratio of about 1:5 (it looks completely insane):

    <stix:Observables cybox_major_version="2" cybox_minor_version="1" cybox_update_version="0">
        <cybox:Observable id="NCCIC:Observable-724f9bfe-1392-456e-8d9b-c143af15f8d4">
            <cybox:Object id="NCCIC:WinExecutableFile-bb9e38d1-d91c-4727-ab6a-514ecc0c02a2">
                <cybox:Properties xsi:type="WinExecutableFileObj:WindowsExecutableFileObjectType">
                    <FileObj:File_Name>3DAE0DC356C2B217A452B477C4B1DB06</FileObj:File_Name>
                    <FileObj:Size_In_Bytes>336073</FileObj:Size_In_Bytes>
                    <FileObj:File_Format>PE32 executable (DLL) (console) Intel 80386, for MS Windows</FileObj:File_Format>
                    <FileObj:Hashes>
                        <cyboxCommon:Hash>
                            <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
                            <cyboxCommon:Simple_Hash_Value>3dae0dc356c2b217a452b477c4b1db06</cyboxCommon:Simple_Hash_Value>
                        </cyboxCommon:Hash>
                        <cyboxCommon:Hash>
                            <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SHA1</cyboxCommon:Type>
                            <cyboxCommon:Simple_Hash_Value>4efb9c09d7bffb2f64fc6fe2519ea85378756195</cyboxCommon:Simple_Hash_Value>
                        </cyboxCommon:Hash>
                        <cyboxCommon:Hash>
                            <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SHA256</cyboxCommon:Type>
                            <cyboxCommon:Simple_Hash_Value>8acfe8ba294ebb81402f37aa094cca8f914792b9171bc62e758a3bbefafb6e02</cyboxCommon:Simple_Hash_Value>
                        </cyboxCommon:Hash>
                        <cyboxCommon:Hash>
                            <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SHA512</cyboxCommon:Type>
                            <cyboxCommon:Simple_Hash_Value>e52b8878bd8c3bdd28d696470cba8a18dcc5a6d234169e26a2fbd9862b10ec1d40196fac981bc3c5a67e661cd60c10036321388e5e5c1f60a7e9937dd71fadb1</cyboxCommon:Simple_Hash_Value>
                        </cyboxCommon:Hash>
                        <cyboxCommon:Hash>
                            <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">SSDEEP</cyboxCommon:Type>
                            <cyboxCommon:Simple_Hash_Value>3072:jUdidTaC07zIQt9xSx1pYxHvQY06emquSYttxlxep0xnC:jyi1XCzcbpYdvQ2e9g3kp01C</cyboxCommon:Simple_Hash_Value>
                        </cyboxCommon:Hash>
                    </FileObj:Hashes>
                    <FileObj:Packer_List>
                        <FileObj:Packer>
                            <FileObj:Name>Microsoft Visual C++ 6.0</FileObj:Name>
                        </FileObj:Packer>
                        <FileObj:Packer>
                            <FileObj:Name>Microsoft Visual C++ 6.0 DLL (Debug)</FileObj:Name>
                        </FileObj:Packer>
                    </FileObj:Packer_List>
                    <FileObj:Peak_Entropy>6.65226708818</FileObj:Peak_Entropy>
                    <WinExecutableFileObj:Headers>
                        <WinExecutableFileObj:File_Header>
                            <WinExecutableFileObj:Number_Of_Sections>5</WinExecutableFileObj:Number_Of_Sections>
                            <WinExecutableFileObj:Time_Date_Stamp>2016-01-29T09:21:46Z</WinExecutableFileObj:Time_Date_Stamp>
                            <WinExecutableFileObj:Size_Of_Optional_Header>4096</WinExecutableFileObj:Size_Of_Optional_Header>
                            <WinExecutableFileObj:Hashes>
                                <cyboxCommon:Hash>
                                    <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
                                    <cyboxCommon:Simple_Hash_Value>e14dca360e273ca75c52a4446cd39897</cyboxCommon:Simple_Hash_Value>
                                </cyboxCommon:Hash>
                            </WinExecutableFileObj:Hashes>
                        </WinExecutableFileObj:File_Header>
                        <WinExecutableFileObj:Entropy>
                            <WinExecutableFileObj:Value>0.672591739631</WinExecutableFileObj:Value>
                        </WinExecutableFileObj:Entropy>
                    </WinExecutableFileObj:Headers>
                    <WinExecutableFileObj:Sections>
                        <WinExecutableFileObj:Section>
                            <WinExecutableFileObj:Section_Header>
                                <WinExecutableFileObj:Name>.text</WinExecutableFileObj:Name>
                                <WinExecutableFileObj:Size_Of_Raw_Data>49152</WinExecutableFileObj:Size_Of_Raw_Data>
                            </WinExecutableFileObj:Section_Header>
                            <WinExecutableFileObj:Entropy>
                                <WinExecutableFileObj:Value>6.41338619924</WinExecutableFileObj:Value>
                            </WinExecutableFileObj:Entropy>
                            <WinExecutableFileObj:Header_Hashes>
                                <cyboxCommon:Hash>
                                    <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
                                    <cyboxCommon:Simple_Hash_Value>076cdf2a2c0b721f0259de10578505a1</cyboxCommon:Simple_Hash_Value>
                                </cyboxCommon:Hash>
                            </WinExecutableFileObj:Header_Hashes>
                        </WinExecutableFileObj:Section>
                        <WinExecutableFileObj:Section>
                            <WinExecutableFileObj:Section_Header>
                                <WinExecutableFileObj:Name>.rdata</WinExecutableFileObj:Name>
                                <WinExecutableFileObj:Size_Of_Raw_Data>8192</WinExecutableFileObj:Size_Of_Raw_Data>
                            </WinExecutableFileObj:Section_Header>
                            <WinExecutableFileObj:Entropy>
                                <WinExecutableFileObj:Value>3.293891672</WinExecutableFileObj:Value>
                            </WinExecutableFileObj:Entropy>
                            <WinExecutableFileObj:Header_Hashes>
                                <cyboxCommon:Hash>
                                    <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
                                    <cyboxCommon:Simple_Hash_Value>4a6af2b49d08dd42374deda5564c24ef</cyboxCommon:Simple_Hash_Value>
                                </cyboxCommon:Hash>
                            </WinExecutableFileObj:Header_Hashes>
                        </WinExecutableFileObj:Section>
                        <WinExecutableFileObj:Section>
                            <WinExecutableFileObj:Section_Header>
                                <WinExecutableFileObj:Name>.data</WinExecutableFileObj:Name>
                                <WinExecutableFileObj:Size_Of_Raw_Data>110592</WinExecutableFileObj:Size_Of_Raw_Data>
                            </WinExecutableFileObj:Section_Header>
                            <WinExecutableFileObj:Entropy>
                                <WinExecutableFileObj:Value>6.78785911234</WinExecutableFileObj:Value>
                            </WinExecutableFileObj:Entropy>
                            <WinExecutableFileObj:Header_Hashes>
                                <cyboxCommon:Hash>
                                    <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
                                    <cyboxCommon:Simple_Hash_Value>c797dda9277ee1d5469683527955d77a</cyboxCommon:Simple_Hash_Value>
                                </cyboxCommon:Hash>
                            </WinExecutableFileObj:Header_Hashes>
                        </WinExecutableFileObj:Section>
                        <WinExecutableFileObj:Section>
                            <WinExecutableFileObj:Section_Header>
                                <WinExecutableFileObj:Name>.reloc</WinExecutableFileObj:Name>
                                <WinExecutableFileObj:Size_Of_Raw_Data>8192</WinExecutableFileObj:Size_Of_Raw_Data>
                            </WinExecutableFileObj:Section_Header>
                            <WinExecutableFileObj:Entropy>
                                <WinExecutableFileObj:Value>3.46819043887</WinExecutableFileObj:Value>
                            </WinExecutableFileObj:Entropy>
                            <WinExecutableFileObj:Header_Hashes>
                                <cyboxCommon:Hash>
                                    <cyboxCommon:Type xsi:type="cyboxVocabs:HashNameVocab-1.0">MD5</cyboxCommon:Type>
                                    <cyboxCommon:Simple_Hash_Value>fbefbe53b3d0ca62b2134f249d249774</cyboxCommon:Simple_Hash_Value>
                                </cyboxCommon:Hash>
                            </WinExecutableFileObj:Header_Hashes>
                        </WinExecutableFileObj:Section>
                    </WinExecutableFileObj:Sections>
                </cybox:Properties>
            </cybox:Object>
        </cybox:Observable>


[1] STIX: https://oasis-open.github.io/cti-documentation/
[2] Cybox example: https://github.com/CybOXProject/schemas/blob/master/samples/CybOX_IPv4Address_Instance.xml
[3] MAEC: https://maec.mitre.org/
[4] OpenIOC to STIX repository on Github: https://github.com/STIXProject/openioc-to-stix
[5] STIX 2.x support (MISP): https://github.com/MISP/MISP/issues/2046
[6] Janusgraph: http://janusgraph.org/
[7] MAR-10135536-F_WHITE_stix.xml: https://www.us-cert.gov/sites/default/files/publications/MAR-10135536-F_WHITE_stix.xml