220 lines
8 KiB
Markdown
220 lines
8 KiB
Markdown
|
|
||
|
## Key Takeaways
|
||
|
|
||
|
* It is possible to index and tag a high number of RSS, OTX and
|
||
|
Twitter articles on limited computational power in seconds
|
||
|
* Building logic around timestamps is complex
|
||
|
* Structuring the resulting data in a graph is meaningful.
|
||
|
|
||
|
## Introduction
|
||
|
|
||
|
Today I am sharing some details about one of the multi-year
|
||
|
projects I am running. The project motivation is:
|
||
|
|
||
|
> To stay up to date on cyber security developments within days.
|
||
|
|
||
|
I didn't want a realtime alerting service, but an analysis tool to
|
||
|
gather important fragments of data over time. These fragments
|
||
|
makes up the basis of my open source research. The curated
|
||
|
information usually ends up on a channel like an NNTP feed,
|
||
|
sometimes with added comments.
|
||
|
|
||
|
My solution was to create a common interface to ingest and search
|
||
|
content from third party sources, Achieving this is difficult, and
|
||
|
requires some work, but I found it feasible.
|
||
|
|
||
|
Going throught some basic research I found that much of what
|
||
|
happens on the web eventually ends up on one of the following
|
||
|
three places (e.g. a mention):
|
||
|
|
||
|
1. OTX
|
||
|
2. Twitter
|
||
|
3. RSS
|
||
|
|
||
|
After some work I found that there were two things important to me
|
||
|
in the first iteration:
|
||
|
|
||
|
1. Being able to recognize the characteristics of the content
|
||
|
2. Knowing the publish time of the data
|
||
|
|
||
|
The primary problem was thus to build a program that scales with a
|
||
|
large number of feeds.
|
||
|
|
||
|
Going from there I built a prototype in Python, which I've now
|
||
|
matured into a more performant Golang version. What follows from
|
||
|
here is my experience from that work.
|
||
|
|
||
|
The tested component list of the program I am currently running are:
|
||
|
|
||
|
* Gofeed [1]
|
||
|
* Badger [2]
|
||
|
* Apache Janusgraph [3,4]
|
||
|
* Apache Cassandra [5]
|
||
|
* Go-Twitter [6]
|
||
|
* Alienvault OTX API [7]
|
||
|
* Araddon Dateparse [8]
|
||
|
|
||
|
[1] https://github.com/mmcdole/gofeed
|
||
|
[2] https://github.com/dgraph-io/badger
|
||
|
[3] https://janusgraph.org
|
||
|
[4] https://docs.janusgraph.org/basics/gremlin/
|
||
|
[5] https://cassandra.apache.org
|
||
|
[6] https://github.com/dghubble/go-twitter/twitter
|
||
|
[7] https://github.com/AlienVault-OTX/OTX-Go-SDK/src/otxapi
|
||
|
[8] https://github.com/araddon/dateparse
|
||
|
|
||
|
|
||
|
|
||
|
## The Lesson of Guestimation: Not All Feeds Are Created Equal
|
||
|
|
||
|
Timestamps is perhaps some of the more challenging things to
|
||
|
interpret in a crawler and search engine. RSS is a loose standard,
|
||
|
at least when it comes to implementation. This means that
|
||
|
timestamps may vary: localized, invalid per the RFC standards,
|
||
|
ambiguous, missing and so on. Much like the web otherwise. Luckily
|
||
|
without javascript.
|
||
|
|
||
|
The goal is simply about recognizing what timestamp are the most
|
||
|
correct one. A feed may contain one form of timestamp, while a
|
||
|
website may indicate another one. To solve this I use and compare
|
||
|
two levels of timestamping:
|
||
|
|
||
|
* The feed published, updated and all items individual timestamps
|
||
|
* The item and website last modified timestamps
|
||
|
|
||
|
Looking back, solving the first level of timestamping was
|
||
|
straight forward. These timestamps are present in the feed and for
|
||
|
RSS the logic to build a list of timestamps would look like this:
|
||
|
|
||
|
|
||
|
/* First we check the timestamp of all
|
||
|
* feed items (including the primary).
|
||
|
* We then estimate what is the newest
|
||
|
* one */
|
||
|
var feedElectedTime time.Time
|
||
|
var ts = make(map[string]string)
|
||
|
ts["published"] = feed.Published
|
||
|
ts["updated"] = feed.Updated
|
||
|
var i=0
|
||
|
for _, item := range feed.Items {
|
||
|
ts[strconv.Itoa(i)] = item.Published
|
||
|
i++
|
||
|
ts[strconv.Itoa(i)] = item.Updated
|
||
|
i++
|
||
|
}
|
||
|
feedElectedTime, _, err = tsGuestimate(ts, link, false)
|
||
|
|
||
|
The elected time can be used to compare with a previous feed
|
||
|
checkpoint to avoid downloading all items again. Using the above
|
||
|
logic I was also able to dramatically increase the success rate of
|
||
|
the program, since it requires a valid timestamp. The
|
||
|
`tsGuestimate` logic is something for a future post.
|
||
|
|
||
|
Further the item/website timestamps requires a similar method, but in
|
||
|
addition I found it an advantage to do a HTTP HEAD request to the
|
||
|
destination URL to combine with the timestamps available from the
|
||
|
feed. The central and important aspect here is to abort retrieval
|
||
|
if an item already exists in the database, this is dramatically
|
||
|
increases the processing in each run.
|
||
|
|
||
|
False timestamps are a problem. I noticed that websites publish
|
||
|
feeds with dynamic timestamps, which means that when you retrieve
|
||
|
the feed it adds the timestamp of now. This obviously creates
|
||
|
resource-intesive operations since the whole feed is then at risk
|
||
|
for re-indexing each run.
|
||
|
|
||
|
|
||
|
## Noise Reduction: Recognizing Content Characteristics
|
||
|
|
||
|
Retrieving content is possible in several ways. For recognizing the
|
||
|
content I opted for and have success/good coverage using
|
||
|
regex. This is also some of the good things of curating articles,
|
||
|
since this means experience with questions such as "why did I miss
|
||
|
this article?" evolves into a new iteration of the program input.
|
||
|
|
||
|
For instance, to stay on top of targeted cyber operations, I found
|
||
|
that much used phrases in articles was "targeted attack" and
|
||
|
"spear phishing". So based on that I deployed the following
|
||
|
keyword search (regular expression) which applies to every new
|
||
|
item ingested:
|
||
|
|
||
|
"targeted":"(?i)targeted\\satt|spear\\sp",
|
||
|
|
||
|
So a new article containing "targeted attack" in the body or title
|
||
|
is tagged with a hotword "targeted". Another hotword could be
|
||
|
"breach".
|
||
|
|
||
|
Perhaps not surprising this data can be modelled in a graph like
|
||
|
follows.
|
||
|
|
||
|
Tweet ─> URL in tweet ┌─> Targeted
|
||
|
└─> Breach
|
||
|
|
||
|
## A Practical Example
|
||
|
|
||
|
Traversing a news graph, we can go from the hotword "targeted", to
|
||
|
all items and articles for the past days linked to the hotword.
|
||
|
|
||
|
I use Gremlin for querying. An example is shown below (some
|
||
|
details omitted):
|
||
|
|
||
|
keyw="targeted"
|
||
|
_date="2021-02-10"
|
||
|
g.V().hasLabel('hotword').has('title',keyw).as("origin_hw").
|
||
|
in().in().hasLabel('article:m').has('timestamp',gte(_date)).order().by('timestamp',asc).as('article').
|
||
|
.select("origin_hw","article").by(values('title','timestamp'))
|
||
|
|
||
|
The procedure above summarized:
|
||
|
|
||
|
1. Find the node with the keyword "targeted"
|
||
|
2. Find all articles (for instance a tweet) that are two steps out
|
||
|
from the keyword (since these may be linked via a content node)
|
||
|
3. Get title and timestamp from hotword and tweet
|
||
|
|
||
|
Using a match, which was incidentally not a tweet but an article,
|
||
|
from a RSS feed, we find the following:
|
||
|
|
||
|
==>{origin_hw=targeted, article=WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK}
|
||
|
|
||
|
Retrieving the article with Gremlin, we can decide the source:
|
||
|
|
||
|
gremlin > g.V().has('title','WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK').valueMap()
|
||
|
|
||
|
|
||
|
=>{link=[https://www.reddit.com/r/netsec/.rss],
|
||
|
title=[WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK],
|
||
|
src=[Reddit - NetSec],
|
||
|
src_type=[rss],
|
||
|
sha256=[8a285ce1b6d157f83d9469c06b6accaa514c794042ae7243056292d4ea245daf],
|
||
|
added=[2021-02-12 10:42:16.640587 +0100 CET],
|
||
|
timestamp=[2021-02-10 20:31:06 +0000 +0000],
|
||
|
version=[1]}
|
||
|
|
||
|
==>{link=[http://www.reddit.com/r/Malware/.rss],
|
||
|
title=[WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK],
|
||
|
src=[Reddit - Malware],
|
||
|
src_type=[rss],
|
||
|
sha256=[69737b754a7d9605d11aecff730ca3fc244c319f35174a7b37dd0d1846a823b7],
|
||
|
added=[2021-02-12 10:41:48.510538 +0100 CET],
|
||
|
timestamp=[2021-02-10 20:35:11 +0000 +0000],
|
||
|
version=[1]}
|
||
|
|
||
|
In this instance the source was two Reddit posts which triggered
|
||
|
the keyword in question and others about a targeted incident in
|
||
|
China. Additionally this triggered a zero day hotword.
|
||
|
|
||
|
|
||
|
## Summary
|
||
|
|
||
|
Through this post I have shown some key parts of how to build a
|
||
|
feed aggregator that can scale to thousands of feeds on a single
|
||
|
computer, with update times in seconds.
|
||
|
|
||
|
I have also given a brief view on how Janusgraph and similar
|
||
|
systems can be used to model such data in a way which makes it
|
||
|
possible to search, find and eventually stay up to date on
|
||
|
relevant information to cyber security.
|
||
|
|
||
|
When in place such a system may save hours per day since the data
|
||
|
is normalised and searchable in one place.
|