thoughts/data/signals-feeds.md
Tommy Skaug 805a34f937
All checks were successful
Export / Explore-GitHub-Actions (push) Successful in 2m19s
initial migration
2024-08-05 20:24:56 +02:00

8 KiB

Key Takeaways

  • It is possible to index and tag a high number of RSS, OTX and Twitter articles on limited computational power in seconds
  • Building logic around timestamps is complex
  • Structuring the resulting data in a graph is meaningful.

Introduction

Today I am sharing some details about one of the multi-year projects I am running. The project motivation is:

To stay up to date on cyber security developments within days.

I didn't want a realtime alerting service, but an analysis tool to gather important fragments of data over time. These fragments makes up the basis of my open source research. The curated information usually ends up on a channel like an NNTP feed, sometimes with added comments.

My solution was to create a common interface to ingest and search content from third party sources, Achieving this is difficult, and requires some work, but I found it feasible.

Going throught some basic research I found that much of what happens on the web eventually ends up on one of the following three places (e.g. a mention):

  1. OTX
  2. Twitter
  3. RSS

After some work I found that there were two things important to me in the first iteration:

  1. Being able to recognize the characteristics of the content
  2. Knowing the publish time of the data

The primary problem was thus to build a program that scales with a large number of feeds.

Going from there I built a prototype in Python, which I've now matured into a more performant Golang version. What follows from here is my experience from that work.

The tested component list of the program I am currently running are:

  • Gofeed [1]
  • Badger [2]
  • Apache Janusgraph [3,4]
  • Apache Cassandra [5]
  • Go-Twitter [6]
  • Alienvault OTX API [7]
  • Araddon Dateparse [8]

[1] https://github.com/mmcdole/gofeed
[2] https://github.com/dgraph-io/badger
[3] https://janusgraph.org
[4] https://docs.janusgraph.org/basics/gremlin/
[5] https://cassandra.apache.org
[6] https://github.com/dghubble/go-twitter/twitter
[7] https://github.com/AlienVault-OTX/OTX-Go-SDK/src/otxapi
[8] https://github.com/araddon/dateparse

The Lesson of Guestimation: Not All Feeds Are Created Equal

Timestamps is perhaps some of the more challenging things to interpret in a crawler and search engine. RSS is a loose standard, at least when it comes to implementation. This means that timestamps may vary: localized, invalid per the RFC standards, ambiguous, missing and so on. Much like the web otherwise. Luckily without javascript.

The goal is simply about recognizing what timestamp are the most correct one. A feed may contain one form of timestamp, while a website may indicate another one. To solve this I use and compare two levels of timestamping:

  • The feed published, updated and all items individual timestamps
  • The item and website last modified timestamps

Looking back, solving the first level of timestamping was straight forward. These timestamps are present in the feed and for RSS the logic to build a list of timestamps would look like this:

/* First we check the timestamp of all
*  feed items (including the primary).
*  We then estimate what is the newest
*  one */
var feedElectedTime time.Time
var ts = make(map[string]string)
ts["published"] = feed.Published
ts["updated"] = feed.Updated
var i=0
for _, item := range feed.Items {
    ts[strconv.Itoa(i)] = item.Published
    i++
    ts[strconv.Itoa(i)] = item.Updated
    i++
}
feedElectedTime, _, err = tsGuestimate(ts, link, false)

The elected time can be used to compare with a previous feed checkpoint to avoid downloading all items again. Using the above logic I was also able to dramatically increase the success rate of the program, since it requires a valid timestamp. The tsGuestimate logic is something for a future post.

Further the item/website timestamps requires a similar method, but in addition I found it an advantage to do a HTTP HEAD request to the destination URL to combine with the timestamps available from the feed. The central and important aspect here is to abort retrieval if an item already exists in the database, this is dramatically increases the processing in each run.

False timestamps are a problem. I noticed that websites publish feeds with dynamic timestamps, which means that when you retrieve the feed it adds the timestamp of now. This obviously creates resource-intesive operations since the whole feed is then at risk for re-indexing each run.

Noise Reduction: Recognizing Content Characteristics

Retrieving content is possible in several ways. For recognizing the content I opted for and have success/good coverage using regex. This is also some of the good things of curating articles, since this means experience with questions such as "why did I miss this article?" evolves into a new iteration of the program input.

For instance, to stay on top of targeted cyber operations, I found that much used phrases in articles was "targeted attack" and "spear phishing". So based on that I deployed the following keyword search (regular expression) which applies to every new item ingested:

"targeted":"(?i)targeted\\satt|spear\\sp",

So a new article containing "targeted attack" in the body or title is tagged with a hotword "targeted". Another hotword could be "breach".

Perhaps not surprising this data can be modelled in a graph like follows.

Tweet ─> URL in tweet ┌─> Targeted
                      └─> Breach

A Practical Example

Traversing a news graph, we can go from the hotword "targeted", to all items and articles for the past days linked to the hotword.

I use Gremlin for querying. An example is shown below (some details omitted):

keyw="targeted"
_date="2021-02-10"
g.V().hasLabel('hotword').has('title',keyw).as("origin_hw").
  in().in().hasLabel('article:m').has('timestamp',gte(_date)).order().by('timestamp',asc).as('article').
  .select("origin_hw","article").by(values('title','timestamp'))

The procedure above summarized:

  1. Find the node with the keyword "targeted"
  2. Find all articles (for instance a tweet) that are two steps out from the keyword (since these may be linked via a content node)
  3. Get title and timestamp from hotword and tweet

Using a match, which was incidentally not a tweet but an article, from a RSS feed, we find the following:

==>{origin_hw=targeted, article=WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK}

Retrieving the article with Gremlin, we can decide the source:

gremlin > g.V().has('title','WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK').valueMap()


=>{link=[https://www.reddit.com/r/netsec/.rss], 
title=[WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK], 
src=[Reddit - NetSec], 
src_type=[rss],
sha256=[8a285ce1b6d157f83d9469c06b6accaa514c794042ae7243056292d4ea245daf],
added=[2021-02-12 10:42:16.640587 +0100 CET],
timestamp=[2021-02-10 20:31:06 +0000 +0000], 
version=[1]}

==>{link=[http://www.reddit.com/r/Malware/.rss], 
title=[WINDOWS KERNEL ZERO-DAY EXPLOIT (CVE-2021-1732) IS USED BY BITTER APT IN TARGETED ATTACK], 
src=[Reddit - Malware], 
src_type=[rss],
sha256=[69737b754a7d9605d11aecff730ca3fc244c319f35174a7b37dd0d1846a823b7],
added=[2021-02-12 10:41:48.510538 +0100 CET],
timestamp=[2021-02-10 20:35:11 +0000 +0000],
version=[1]}

In this instance the source was two Reddit posts which triggered the keyword in question and others about a targeted incident in China. Additionally this triggered a zero day hotword.

Summary

Through this post I have shown some key parts of how to build a feed aggregator that can scale to thousands of feeds on a single computer, with update times in seconds.

I have also given a brief view on how Janusgraph and similar systems can be used to model such data in a way which makes it possible to search, find and eventually stay up to date on relevant information to cyber security.

When in place such a system may save hours per day since the data is normalised and searchable in one place.