Over what have become some years, cyber security professionals have been working on optimising the sharing of information and knowledge. A lot of the efforts have recently been focused around intelligence- and data-driven teams. Today many of these discussions have ended evolving around something related to the STIX format. > Don't use a lot where a little will do > – Unknown origin This post features a perspective of the potential of today's standard-oriented approach for documenting indicator sets related to cyber security threat actors and incidents. It turns out we have a longer way to go than expected. For the purpose of this article, an indicator is a characteristic or evidence of something unwanted, or hostile if you'd like. I like to refer to the military term "Indicators & Warnings" in this regard. In other words, an indicator isn't necessarily limited to the cyber domain alone either. Physical security could be in an even worse condition than cyber security when it comes to expressing threat indicators. I'll leave the cross-domain discussion for another time. ## Up Until Today Multiple standards have evolved and disappeared, and one that I have been in favor of previously is the OpenIOC 1.1 standard. However, times are changing, and so are the terminology and breadth of how we are able to express the intrusion sets. Even though OpenIOC was a very good start, and still is as far as I am concerned, it has far been surpassed Cybox and ultimately STIX [1] in popularity. STIX is a container, a quite verbose XML format (which is turning JSON in 2.0). Cybox is the artefact format [2], for malware you have MAEC [3] and so on. Basically it's a set of projects collaborating. This all sounds good, right? Not quite. Have a look at the OpenIOC to STIX repository on Github [4] and you will find that ``stuxnet.stix.xml`` is 202 lines of XML code for 18 atomic indicators. OpenIOC on the other hand, is 91 lines, and that is a verbose format as well. In fact the overhead ratio of the STIX file is about 10:1, while OpenIOC is about 5:1. To add to the mind-blowing inefficiency I have yet to see, on a regular basis, complex and nested expressions of an actor or a campaign in the STIX format. Before you continue, do a simple Google search for "STIX editor" and "cybox editor". Do it now, and while you are at it google for "openioc editor" as well. Hello guys, these standards have been going around for many years. So, how should we interpret that there aren't any user friendly approaches to using them? The closest I've come is through MISP, and that is generally speaking not using these standards for their internal workings either. This one on the MISP GitHub issue tracker says it all: STIX 2.x support (MISP) [5]. I'm sure that some may disagree with the above statements, calling out the infancy of these formats. However, they can't be said to be new standards anymore. They are just too complex. One example of such is the graph-oriented relations implemented into the formats. Why not just let a graph database take care of these instead? This is not just a post to establish the current state. How would a better approach look? ## What Is The Problem to Be Solved? Back to where things have gone since the OpenIOC 1.1/atomic indicator days. The most promising addition, in my opinion, is the MITRE PRE-ATT&CK and ATT&CK frameworks. The two frameworks builds on a less structured approach than seen for atomic indicators (Lockheed's Kill-Chain). The latter can for instance be viewed in form of the Intelligence Pyramid. The Intelligence Pyramid's abstraction levels can be mapped against what it is supposed to support when it comes to indicators like the following: | Level of abstraction | | Supports |-----------------------|----|------------- | Behavior | | Knowledge |-----------------------|--->|------------- | Derived | | Information |-----------------------|--->|------------- | Atomic | | Data The purpose of the abstration layer is in this case to support assessments and measures at the corresponding contextual level. For instance a technical report tailored to an Incident Response Team (IRT) generally concerns Derived and Atomic indicators, while an intelligence report would usually be based on the Behavioural level. Having covered the abstraction layers, we can recognize that OpenIOC (or Cybox and MAEC) covers the bottom layers of abstration, while MITRE (PRE-)ATT&CK in its current form is mostly about the Behaviour level. For Derived indicators there are primarily two well-established, seasoned and successful formats that have become standards through its widespread usage. This is amongst others caused by the indicators and rules being effective, rapid, easy and pleasing to write. First we have Snort/Suricata rules and Lua scripts which was designed for network detection. For Snort/Suricata I'd say that most of what is detected of metadata today is probably expressable in OpenIOC (except for the magic that can be done with Lua). Second there is the Yara format which has become known for its applicability against malicious files. The simplicity of both formats is obviously due to their power of expression. Thus, I'd say that Yara and Snort/Suricata formats is the ones to look for when it comes to content and pattern detection. > Indicators should be easy and pleasing to write. To summarize the above, each of the formats can be mapped to an abstraction level: | Level of abstraction | | Formats |-----------------------|----|------------- | Behavior | | MITRE (PRE-)ATT&CK |-----------------------|--->|------------- | Derived | | Suricata+Lua, Yara |-----------------------|--->|------------- | Atomic | | OpenIOC 1.1 Going through my notes on how I document my own indicators I also found that I use the CVE database, datetimes, confidence, analyst comments for context and classification as well (the latter being irrelevant for detection). One of the major problems is: everything that is currently out there breaks the analyst workflow. You either need to log in to some fancy web interface, edit XML files (god forbid) or you would just jot down everything in a text file. The text file seems to be the natural fallback in almost any instance. I have even attempted to use the very good initiative by Yahoo, PyIOCe, and Mandiant's long-forgotten IOC Editor. These projects have both lost tracktion, as almost every other intiative in this space. So that is right folks, the text editor is still the preferred tool in 2018, and let's face it: indicators should be pleasing to design and create - like putting your signature to an incident or a job well done. > an indicator set should be for humans and machines by humans After all, the human is the one that is going to have to deal with the indicator sets at some point, and we are the slowest link. So let us not slow ourselves down more than necessary. At this point I would like to propose the golden rule of creating golden rules: an indicator set should be for humans and machines by humans. You may also have noticed that when all these standards suddendly are combined into one standard, they become less user-friendly. In other words, let us rather find back to our common \*NIX roots where each tool had a limited set of tasks. Graphs are essential when writing indicators. Almost everything in the world around us can be modelled as a network, and infiltration and persistence in cyberspace is no exception. Thus, an indicator format needs to be representable in a graph, and guess what? Almost everything are as long as it maintains some kind of structure. For graphs there are two ways of going about the problem: 1) Implement the graph in the format 2) Make sure that you have a good graph backend and a automatable and traversable format available For option 1, the graph in the format will increase the complexity significantly. Option 2 results in the opposite, but that does not mean that it can't be converted to a graph. To make an elaborate discussion short, this is what we have graph databases for, such as Janusgraph [6]. ## A Conceptual View Summarizing the above, I'd like to propose the following requirements for indicator formats: 1) Indicator sets should be easy and inviting to create 2) You should be able to start writing at any time, when you need it 3) Unnecessary complexity should be avoided 4) The format should be human readable and editable 5) A machine should be able to interpret the format 6) Indicator sets should be graph compatible With a basis in this article, I believe that the best approach is to provide a basic plain text format specification that inherits from the OpenIOC 1.1 and MITRE frameworks and references other formats where necessary. Let us imagine that we found an IP address in one situation. The IP-address was connected to a domain that we found using passive DNS. Further, it was found that a specific file was associated with that domain through a Twitter comment. Representing the given information in its purest (readable) form looks like the following: // a test file class tlp:white date 2018/02/18 ipv4 low 188.226.130.166 domain med secdiary.com technique PRE-T1146 filename med some_filename.docx comment found in open sources To recap some of the previous points: the above format is simple, it can be written at any time based on knowledge of well known standards. The best of it all is that if you are heavily invested in specific formats, it can be converted to them all using a simple interpreter traversing the format. Further, such a format is easily converted into a tree and can be loaded into a graph for traversing and automated assessments. Each confidence value can be quantified (``low=0.33``, ``med=0.66``, ``high=1.0``). That said, simplicity in this case equals actionable indicators. | v: 188.226.130.166 (0.33) | match | | e | | | v: secdiary.com (0.66) | no match | (0.33+0.66)/2=0.5 | e | | | v: some_filename.docx (0.66) | match | For networks vs hierarchies: a drawback of the latter, as mentioned in the former section, is the lack of e.g. multiple domains being connected to different other vertices. A practical solution goes as follows: ipv4 low 188.226.130.166 domain med secdiary.com domain low secdiary.com ipv4 low 128.199.56.232 The graph receiving the above indicator file should identify the domain as being a unique entity and link the two IP addresses to the same domain: | v: 188.226.130.166 (0.33) | e: 0.5 | v: secdiary.com (0.5) | e: 0.33 | v: 128.199.56.232 (0.33) As for structuring the indicator format for machines in the practical aspect, consider the following pseudocode: indicators = [(0,'ipv4','low','188.226.130.166'),...] _tree = tree(root_node) for indicator in indicators depth = indicator[0] _tree.insert(indicator,depth) Now that we have the tree represented in code, it is trivially traversable when loading it into some graph: method load_indicators(node,depth): graph.insert(node.parent,edge_label,node) for child in node.children load_indicator(child,depth+1) load_indicators(tree,0) ## Summary Hopefully I did not kill too many kittens with this post. You may or may not agree, but I do believe that most analysts share at least parts of my purist views on the matter. We are currently too focused on supporting standards and having everyone use as few of them as possible. I believe that energy is better used on getting more consistent in the way we document and actually exchange more developed indicator sets than the md5 hash- and domainlists that are typically shared today ("not looking at these kinds of files at all" - even though it's not the worst I've seen: ``MAR-10135536-F_WHITE_stix.xml`` [7]). In the conceptual part of this article I propose a simple but yet effective way of representing indicators in a practical manner. Frankly, it is even too simple to be novel. It is just consistent and intutitive. PS! For the STIX example above, have a look at the following to get a feel with the actual content of the file (used one of the mentioned specimens to show the point): class tlp:white date 2018/02/05 sha1 high 4efb9c09d7bffb2f64fc6fe2519ea85378756195 comment NCCIC:Observable-724f9bfe-1392-456e-8d9b-c143af15f8d4 comment did not convert all attributes compiler Microsoft Visual C++ 6.0 md5 high 3dae0dc356c2b217a452b477c4b1db06 date 2016-01-29T09:21:46Z entropy med 6.65226708818 #sections low 5 intname med ProxyDll.dll detection med symantec:Heur.AdvML.B The original document states for those same indicators in no less than 119 lines with an overhead ratio of about 1:5 (it looks completely insane): 3DAE0DC356C2B217A452B477C4B1DB06 336073 PE32 executable (DLL) (console) Intel 80386, for MS Windows MD5 3dae0dc356c2b217a452b477c4b1db06 SHA1 4efb9c09d7bffb2f64fc6fe2519ea85378756195 SHA256 8acfe8ba294ebb81402f37aa094cca8f914792b9171bc62e758a3bbefafb6e02 SHA512 e52b8878bd8c3bdd28d696470cba8a18dcc5a6d234169e26a2fbd9862b10ec1d40196fac981bc3c5a67e661cd60c10036321388e5e5c1f60a7e9937dd71fadb1 SSDEEP 3072:jUdidTaC07zIQt9xSx1pYxHvQY06emquSYttxlxep0xnC:jyi1XCzcbpYdvQ2e9g3kp01C Microsoft Visual C++ 6.0 Microsoft Visual C++ 6.0 DLL (Debug) 6.65226708818 5 2016-01-29T09:21:46Z 4096 MD5 e14dca360e273ca75c52a4446cd39897 0.672591739631 .text 49152 6.41338619924 MD5 076cdf2a2c0b721f0259de10578505a1 .rdata 8192 3.293891672 MD5 4a6af2b49d08dd42374deda5564c24ef .data 110592 6.78785911234 MD5 c797dda9277ee1d5469683527955d77a .reloc 8192 3.46819043887 MD5 fbefbe53b3d0ca62b2134f249d249774 [1] STIX: https://oasis-open.github.io/cti-documentation/ [2] Cybox example: https://github.com/CybOXProject/schemas/blob/master/samples/CybOX_IPv4Address_Instance.xml [3] MAEC: https://maec.mitre.org/ [4] OpenIOC to STIX repository on Github: https://github.com/STIXProject/openioc-to-stix [5] STIX 2.x support (MISP): https://github.com/MISP/MISP/issues/2046 [6] Janusgraph: http://janusgraph.org/ [7] MAR-10135536-F_WHITE_stix.xml: https://www.us-cert.gov/sites/default/files/publications/MAR-10135536-F_WHITE_stix.xml