thoughts/data/signals-feeds.md
Tommy Skaug 805a34f937
All checks were successful
Export / Explore-GitHub-Actions (push) Successful in 2m19s
initial migration
2024-08-05 20:24:56 +02:00

219 lines
8 KiB
Markdown

## Key Takeaways
* It is possible to index and tag a high number of RSS, OTX and
Twitter articles on limited computational power in seconds
* Building logic around timestamps is complex
* Structuring the resulting data in a graph is meaningful.
## Introduction
Today I am sharing some details about one of the multi-year
projects I am running. The project motivation is:
> To stay up to date on cyber security developments within days.
I didn't want a realtime alerting service, but an analysis tool to
gather important fragments of data over time. These fragments
makes up the basis of my open source research. The curated
information usually ends up on a channel like an NNTP feed,
sometimes with added comments.
My solution was to create a common interface to ingest and search
content from third party sources, Achieving this is difficult, and
requires some work, but I found it feasible.
Going throught some basic research I found that much of what
happens on the web eventually ends up on one of the following
three places (e.g. a mention):
1. OTX
2. Twitter
3. RSS
After some work I found that there were two things important to me
in the first iteration:
1. Being able to recognize the characteristics of the content
2. Knowing the publish time of the data
The primary problem was thus to build a program that scales with a
large number of feeds.
Going from there I built a prototype in Python, which I've now
matured into a more performant Golang version. What follows from
here is my experience from that work.
The tested component list of the program I am currently running are:
* Gofeed [1]
* Badger [2]
* Apache Janusgraph [3,4]
* Apache Cassandra [5]
* Go-Twitter [6]
* Alienvault OTX API [7]
* Araddon Dateparse [8]
[1] https://github.com/mmcdole/gofeed
[2] https://github.com/dgraph-io/badger
[3] https://janusgraph.org
[4] https://docs.janusgraph.org/basics/gremlin/
[5] https://cassandra.apache.org
[6] https://github.com/dghubble/go-twitter/twitter
[7] https://github.com/AlienVault-OTX/OTX-Go-SDK/src/otxapi
[8] https://github.com/araddon/dateparse
## The Lesson of Guestimation: Not All Feeds Are Created Equal
Timestamps is perhaps some of the more challenging things to
interpret in a crawler and search engine. RSS is a loose standard,
at least when it comes to implementation. This means that
timestamps may vary: localized, invalid per the RFC standards,
ambiguous, missing and so on. Much like the web otherwise. Luckily
without javascript.
The goal is simply about recognizing what timestamp are the most
correct one. A feed may contain one form of timestamp, while a
website may indicate another one. To solve this I use and compare
two levels of timestamping:
* The feed published, updated and all items individual timestamps
* The item and website last modified timestamps
Looking back, solving the first level of timestamping was
straight forward. These timestamps are present in the feed and for
RSS the logic to build a list of timestamps would look like this:
/* First we check the timestamp of all
* feed items (including the primary).
* We then estimate what is the newest
* one */
var feedElectedTime time.Time
var ts = make(map[string]string)
ts["published"] = feed.Published
ts["updated"] = feed.Updated
var i=0
for _, item := range feed.Items {
ts[strconv.Itoa(i)] = item.Published
i++
ts[strconv.Itoa(i)] = item.Updated
i++
}
feedElectedTime, _, err = tsGuestimate(ts, link, false)
The elected time can be used to compare with a previous feed
checkpoint to avoid downloading all items again. Using the above
logic I was also able to dramatically increase the success rate of
the program, since it requires a valid timestamp. The
`tsGuestimate` logic is something for a future post.
Further the item/website timestamps requires a similar method, but in
addition I found it an advantage to do a HTTP HEAD request to the
destination URL to combine with the timestamps available from the
feed. The central and important aspect here is to abort retrieval
if an item already exists in the database, this is dramatically
increases the processing in each run.
False timestamps are a problem. I noticed that websites publish
feeds with dynamic timestamps, which means that when you retrieve
the feed it adds the timestamp of now. This obviously creates
resource-intesive operations since the whole feed is then at risk
for re-indexing each run.
## Noise Reduction: Recognizing Content Characteristics
Retrieving content is possible in several ways. For recognizing the
content I opted for and have success/good coverage using
regex. This is also some of the good things of curating articles,
since this means experience with questions such as "why did I miss
this article?" evolves into a new iteration of the program input.
For instance, to stay on top of targeted cyber operations, I found
that much used phrases in articles was "targeted attack" and
"spear phishing". So based on that I deployed the following
keyword search (regular expression) which applies to every new
item ingested:
"targeted":"(?i)targeted\\satt|spear\\sp",
So a new article containing "targeted attack" in the body or title
is tagged with a hotword "targeted". Another hotword could be
"breach".
Perhaps not surprising this data can be modelled in a graph like
follows.
Tweet ─> URL in tweet ┌─> Targeted
└─> Breach
## A Practical Example
Traversing a news graph, we can go from the hotword "targeted", to
all items and articles for the past days linked to the hotword.
I use Gremlin for querying. An example is shown below (some
details omitted):
keyw="targeted"
_date="2021-02-10"
g.V().hasLabel('hotword').has('title',keyw).as("origin_hw").
in().in().hasLabel('article:m').has('timestamp',gte(_date)).order().by('timestamp',asc).as('article').
.select("origin_hw","article").by(values('title','timestamp'))
The procedure above summarized:
1. Find the node with the keyword "targeted"
2. Find all articles (for instance a tweet) that are two steps out
from the keyword (since these may be linked via a content node)
3. Get title and timestamp from hotword and tweet
Using a match, which was incidentally not a tweet but an article,
from a RSS feed, we find the following:
==>{origin_hw=targeted, article=WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK}
Retrieving the article with Gremlin, we can decide the source:
gremlin > g.V().has('title','WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK').valueMap()
=>{link=[https://www.reddit.com/r/netsec/.rss],
title=[WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK],
src=[Reddit - NetSec],
src_type=[rss],
sha256=[8a285ce1b6d157f83d9469c06b6accaa514c794042ae7243056292d4ea245daf],
added=[2021-02-12 10:42:16.640587 +0100 CET],
timestamp=[2021-02-10 20:31:06 +0000 +0000],
version=[1]}
==>{link=[http://www.reddit.com/r/Malware/.rss],
title=[WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK],
src=[Reddit - Malware],
src_type=[rss],
sha256=[69737b754a7d9605d11aecff730ca3fc244c319f35174a7b37dd0d1846a823b7],
added=[2021-02-12 10:41:48.510538 +0100 CET],
timestamp=[2021-02-10 20:35:11 +0000 +0000],
version=[1]}
In this instance the source was two Reddit posts which triggered
the keyword in question and others about a targeted incident in
China. Additionally this triggered a zero day hotword.
## Summary
Through this post I have shown some key parts of how to build a
feed aggregator that can scale to thousands of feeds on a single
computer, with update times in seconds.
I have also given a brief view on how Janusgraph and similar
systems can be used to model such data in a way which makes it
possible to search, find and eventually stay up to date on
relevant information to cyber security.
When in place such a system may save hours per day since the data
is normalised and searchable in one place.