-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Hi,
I would like to use the inbloom library for creating a bloom filter for the alexa top 1 million domain list. When trying to dump and load the bloom filter from a file I always get the following error:
$ python test_inbloom.py
Traceback (most recent call last):
File "test_inbloom.py", line 34, in <module>
bf = inbloom.load(base64.b64decode(data))
inbloom.error: invalid data length
It seems like I'm running into this error clause: https://github.com/EverythingMe/inbloom/blob/master/py/inbloom/inbloom.c#L221
My test script looks like this:
import requests
import sys
import csv
import base64
import zipfile
import inbloom
from io import BytesIO, TextIOWrapper
ALEXA_URL = "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip"
FP_RATIO = 0.00001 # 0.0001 -> 2.3MB bloom filter file, 0.00001 -> 2.9MB bloom filter file
if __name__ == "__main__":
alexa_inbloom = None
response = requests.get(ALEXA_URL)
if not response or response.status_code != 200:
sys.exit(-1)
archive = zipfile.ZipFile(BytesIO(response.content))
file = archive.open("top-1m.csv")
with TextIOWrapper(file, encoding="utf-8") as text_file:
reader = csv.reader(text_file)
alexa_inbloom = inbloom.Filter(entries=1000000, error=FP_RATIO)
for row in reader:
alexa_inbloom.add(row[1].lower())
assert alexa_inbloom.contains("youtube.com")
with open("alexa.inbloom", "wb") as f:
data = base64.b64encode(inbloom.dump(alexa_inbloom))
f.write(data)
with open("alexa.inbloom", "rb") as f:
data = f.read()
bf = inbloom.load(base64.b64decode(data))
assert bf.contains("youtube.com")
May I ask you to have a look please?
Thanks,
Konstantin
Metadata
Metadata
Assignees
Labels
No labels