Skip to content

Commit e20e541

Browse files
committed
feed docs: start howto, add TSV example
1 parent 231f815 commit e20e541

File tree

3 files changed

+149
-7
lines changed

3 files changed

+149
-7
lines changed

docs/dev/adding-feeds.md

Lines changed: 120 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!-- comment
2-
SPDX-FileCopyrightText: 2015-2023 Sebastian Wagner, Filip Pokorný
2+
SPDX-FileCopyrightText: 2015-2021 nic.at GmbH, 2023 Filip Pokorný, 2025 Institute for Common Good Technology
33
SPDX-License-Identifier: AGPL-3.0-or-later
44
-->
55

@@ -30,12 +30,123 @@ Adding a feed doesn't necessarily require any programming experience. There are
3030
3131
If the data source utilizes some unusual way of distribution or uses a custom format for the data it might be necessary to develop specialized bot(s) for this particular data source. Always try to use existing bots before you start developing your own. Please also consider extending an existing bot if your use-case is close enough to it's features. If you are unsure which way to take, start an [issue](https://github.com/certtools/intelmq/issues) and you will receive guidance.
3232
33+
## Howto
34+
35+
### Choosing the collector
36+
37+
### Choosing the parser
38+
39+
### Classification
40+
41+
### Other static fields
42+
43+
* Feed accuracy
44+
* TLP
45+
* Event Description
46+
* Target
47+
* Text
48+
* URL
49+
* Protocol
50+
* Application Protocol
51+
* Transport Protocol
52+
53+
## Example Feeds
54+
55+
### Simple List
56+
57+
As an example, let's add the - very simple - feed *Toxic IP Addresses (CIDR)* by StopForumSpam to the documentation. The data URL is https://www.stopforumspam.com/downloads/toxic_ip_cidr.txt and contains a list of IP Network Ranges in CIDR notation, separated by newlines.
58+
59+
As the resource is available via HTTP, we will use the [HTTP Collector](../user/bots.md#intelmq.bots.collectors.http.collector_http) for the data retrieval and [Generic CSV Parser](../user/bots.md#intelmq.bots.parsers.generic.parser_csv) for parsing.
60+
For the collector, we only specify the module to use (the HTTP collector, as seen on the bots documentation), an estimate on the feed accuracy (as it is a blacklist, not 100%, but still reasonably high), the resource URL to download and the rate limit of 1 hour, as there might be frequent updates.
61+
62+
For the parser we again specify the module name and the required parameter (columns) to map the input data field to the IntelMQ field `source.network`. Further we add some static field values which are equal for all data lines.
63+
64+
```
65+
Stop Forum Spam:
66+
Toxic IP Addresses:
67+
description: IP Networks that are believed will only ever be used for abuse
68+
documentation: https://www.stopforumspam.com/downloads
69+
revision: 2025-09-21
70+
public: true
71+
bots:
72+
collector:
73+
module: intelmq.bots.collectors.http.collector_http
74+
parameters:
75+
accuracy: 80
76+
http_url: https://www.stopforumspam.com/downloads/toxic_ip_cidr.txt
77+
rate_limit: 86400
78+
parser:
79+
module: intelmq.bots.parsers.generic.parser_csv
80+
parameters:
81+
columns: source.network
82+
default_fields:
83+
classification.type: blacklist
84+
protocol.application: http
85+
protocol.transport: tcp
86+
event_description.target: web forums
87+
event_description.text: web forum spam
88+
event_description.url: https://www.stopforumspam.com/
89+
tlp: white
90+
```
91+
92+
### TSV document
93+
94+
As a next example, let's add a feed for https://hole.cert.pl/domains/v2/domains.csv (7 MB).
95+
Contrary to its file name ending, the separator is not a comma, but a tab character.
96+
The file contains four columns:
97+
```
98+
PozycjaRejestru AdresDomeny DataWpisu DataWykreslenia
99+
285107 0-1-x.06215785.xyz 2025-04-02T09:02:19+00:00
100+
332655 d15k2d11r6t6rl.cloudfront.net 2025-06-12T17:06:08+00:00 2025-06-13T13:54:55+00:00
101+
[...]
102+
```
103+
104+
The feeds description is at https://cert.pl/en/warning-list/ and it says the list of blocked domains is updated about every 5 minutes. In IntelMQ we usually don't need such high refresh rates, but setting it to half an hour is reasonable for most use cases.
105+
The list is automatically composed, and the list contains domains for warnings so the accuracy is lower.
106+
As the descriptions says the listed domains are websites, we can again assume the protocol is HTTP/TCP. Although the list is about phishing websites, it's use case is a warning/blacklist and therefore the classification is blacklist. In the event description we explain the kind of blacklist.
107+
The most crucial part is the mapping of da columns to IntelMQ fields. In this case, they are given in Polish.
108+
- `PozycjaRejestru`: Position in the Register. We do not need this in IntelMQ, so we save it as `extra.certpl_register`
109+
- `AdresDomeny`: The domain address, lands in `source.fqdn`. This is the information we case about
110+
- `DataWpisu`: The date of entry, and
111+
- `DataWykreslenia`: The date of deletion
112+
- This is a tricky situation we as have no clear indication at which time the information is current. Based on the feed description, if the deletion date would is not present, the time of fetching the data (`time.observation`) is closest to the meaning of `time.source`.
113+
- Therefore, instead of using the Generic CSV Parser, a custom Parser or a downstream expert is required to accomplish this.
114+
- For simplicity, we map these columns to `extra.first_seen` and `extra.expiration_date`. Both fields are already in use by other bots and feeds.
115+
116+
```yaml
117+
CERT.PL
118+
Hole Domains v2:
119+
description: Dangerous websites Warning List
120+
documentation: https://cert.pl/en/warning-list/
121+
revision: 2025-09-23
122+
public: true
123+
bots:
124+
collector:
125+
module: intelmq.bots.collectors.http.collector_http
126+
parameters:
127+
accuracy: 50
128+
rate_limit: 1800
129+
http_url: https://hole.cert.pl/domains/v2/domains.csv
130+
parser:
131+
module: intelmq.bots.parsers.generic.parser_csv
132+
parameters:
133+
columns: extra.certpl_register,source.fqdn,extra.first_seen,extra.expiration_date
134+
default_fields:
135+
classification.type: blacklist
136+
protocol.application: http
137+
protocol.transport: tcp
138+
event_description.target: users
139+
event_description.text: phishing
140+
event_description.url: https://cert.pl/en/warning-list/
141+
tlp: white
142+
```
143+
33144
## Feeds Wishlist
34145

35146
This is a list with potentially interesting data sources, which are either currently not supported or the usage is not clearly documented in IntelMQ. If you want to **contribute** new feeds to IntelMQ, this is a great place to start!
36147

37148
!!! note
38-
Some of the following data sources might better serve as an expert bot for enriching processed events.
149+
Some of the following data sources might also serve as an expert bot for enriching processed events.
39150

40151
- Lists of feeds:
41152
- [threatfeeds.io](https://threatfeeds.io)
@@ -48,6 +159,7 @@ This is a list with potentially interesting data sources, which are either curre
48159
- Some third party intelmq bots: [NRDCS IntelMQ fork](https://github.com/NRDCS/intelmq/tree/certlt/intelmq/bots)
49160
- List of potentially interesting data sources:
50161
- [Abuse.ch SSL Blacklists](https://sslbl.abuse.ch/blacklist/)
162+
- [aa419 Fake Banks List](https://db.aa419.org/fakebankslist.php)
51163
- [AbuseIPDB](https://www.abuseipdb.com/pricing)
52164
- [Adblock Plus](https://adblockplus.org/en/subscriptions)
53165
- [apivoid IP Reputation API](https://www.apivoid.com/api/ip-reputation/)
@@ -75,17 +187,18 @@ This is a list with potentially interesting data sources, which are either curre
75187
- [Google Webmaster Alerts](https://www.google.com/webmasters/)
76188
- [GPF Comics DNS Blacklist](https://www.gpf-comics.com/dnsbl/export.php)
77189
- [Greensnow](https://blocklist.greensnow.co/greensnow.txt)
78-
- [Greynoise](https://developer.greynoise.io/reference/community-api)
190+
- [Greynoise](https://docs.greynoise.io/docs/using-the-greynoise-community-api)
79191
- [HP Feeds](https://github.com/rep/hpfeeds)
80192
- [IBM X-Force Exchange](https://exchange.xforce.ibmcloud.com/)
81193
- [ImproWare AntiSpam](https://antispam.imp.ch/)
82194
- [ISightPartners](http://www.isightpartners.com/)
83195
- [James Brine](https://jamesbrine.com.au/)
84196
- [Joewein](http://www.joewein.net)
85197
- Maltrail:
86-
- [Malware](https://github.com/stamparm/maltrail/tree/master/trails/static/images/malware)
87-
- [Suspicious](https://github.com/stamparm/maltrail/tree/master/trails/static/images/suspicious)
88-
- [Mass Scanners](https://github.com/stamparm/maltrail/blob/master/trails/static/images/mass_scanner.txt)
198+
- [Malware](https://github.com/stamparm/maltrail/tree/master/trails/static/malware)
199+
- [Suspicious](https://github.com/stamparm/maltrail/tree/master/trails/static/suspicious)
200+
- [Malicious](https://github.com/stamparm/maltrail/tree/master/trails/static/malicious)
201+
- [Mass Scanners](https://github.com/stamparm/maltrail/blob/master/trails/static/mass_scanner.txt)
89202
(for whitelisting)
90203
- [Malshare](https://malshare.com/)
91204
- [MalSilo Malware URLs](https://malsilo.gitlab.io/feeds/dumps/url_list.txt)
@@ -109,7 +222,7 @@ This is a list with potentially interesting data sources, which are either curre
109222
- [SANS ISC](https://isc.sans.edu/api/)
110223
- [ShadowServer Sandbox API](http://www.shadowserver.org/wiki/pmwiki.php/Services/Sandboxapi)
111224
- [Shodan search API](https://shodan.readthedocs.io/en/latest/tutorial.html#searching-shodan)
112-
- [Snort](http://labs.snort.org/feeds/ip-filter.blf)
225+
- [Snort](https://www.snort.org/downloads/ip-block-list)
113226
- [stopforumspam Toxic IP addresses and domains](https://www.stopforumspam.com/downloads)
114227
- [Spamhaus Botnet Controller List](https://www.spamhaus.org/bcl/)
115228
- [SteveBlack Hosts File](https://github.com/StevenBlack/hosts)

docs/dev/bot-development.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -404,6 +404,10 @@ IntelMQ uses Python's type hints/type annotations where possible.
404404
For doc strings, we are using the
405405
[sphinx-napoleon-google-type-annotation](http://www.sphinx-doc.org/en/stable/ext/napoleon.html#type-annotations) where applicable.
406406

407+
#### Bot documentations
408+
409+
#### Feed documentation
410+
407411
## Getting the code upstream
408412

409413
Entry to the change log and news files

intelmq/etc/feeds.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1721,3 +1721,28 @@ providers:
17211721
event_description.text: web forum spam
17221722
event_description.url: https://www.stopforumspam.com/
17231723
tlp: white
1724+
CERT.PL
1725+
Hole Domains v2:
1726+
description: Dangerous websites Warning List
1727+
documentation: https://cert.pl/en/warning-list/
1728+
revision: 2025-09-23
1729+
public: true
1730+
bots:
1731+
collector:
1732+
module: intelmq.bots.collectors.http.collector_http
1733+
parameters:
1734+
accuracy: 50
1735+
rate_limit: 1800
1736+
http_url: https://hole.cert.pl/domains/v2/domains.csv
1737+
parser:
1738+
module: intelmq.bots.parsers.generic.parser_csv
1739+
parameters:
1740+
columns: extra.certpl_register,source.fqdn,extra.first_seen,extra.expiration_date
1741+
default_fields:
1742+
classification.type: blacklist
1743+
protocol.application: http
1744+
protocol.transport: tcp
1745+
event_description.target: users
1746+
event_description.text: phishing
1747+
event_description.url: https://cert.pl/en/warning-list/
1748+
tlp: white

0 commit comments

Comments
 (0)