(Trying to) Crawl HackerNews

Home

January 29, 2020

As part of our Information Retrieval project, we are tasked to build search engine on a specific domain. After some discussion, we decided to proceed on building search engine for HackerNews comments, which I think has not existed yet. The current search only works for post title and url. algolia

Earlier today I was trying to do the first step, which is to crawl the data from HackerNews. There are more than 22 million items (stories and comments) in the site. I will just try to play around first.

Thankfully, HackerNews provides API for their data so that we don’t need to scrape manually. So I just create a simple python script using requests. But I just realized that if I want to get for like 10,000 items, then I need to wait for 10,000 latency time, which will be roughly 1s.

So, I’ll try to find some solution for parallelization. I don’t feel like to use the multiprocessing or threading library for this because it seems like an overkill. So I found grequests. But somehow I get recursion error, I guess it might be because of the redirection from http to https not handled properly. Then turns out that they actually say to use another library. grequests

requests-threads API seems to be awkward and https://github.com/ross/requests-futures is just perfect. It is just so simple to use.

from requests_futures.sessions import FuturesSession

session = FuturesSession()
# first request is started in background
future_one = session.get('http://httpbin.org/get')
# second requests is started immediately
future_two = session.get('http://httpbin.org/get?foo=bar')
# wait for the first request to complete, if it hasn't already
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
# wait for the second request to complete, if it hasn't already
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)

It’s doing pretty good. For 5000 requests it takes around 2 minutes. Then suddenly I found that we can set the number of workers.

from requests_futures.sessions import FuturesSession
session = FuturesSession(max_workers=10)

They say the default is 8. So I try how to change it arbitrarily to 1000. Then I got error saying Failed to establish a new connection. error

I don’t know where the exact constraint really is. I guess that at least it is affected by the delay time (more waiting response if higher latency). So after trial and error, I got that 100 max_workers seems to reliably work in my computer. And now it’s down to 19 seconds. Yey!

This is the final MVP script.

import json
import logging
import random

import requests
import tqdm
from requests_futures.sessions import FuturesSession

BASE = 'https://hacker-news.firebaseio.com/v0'
GOOD_STORY_MINIMUM_VOTE = 100

logging.basicConfig(level=logging.INFO)


if __name__ == "__main__":
    max_id = int(requests.get(f'{BASE}/maxitem.json').text)

    count = 0
    random_good_items = []

    urls = (f"{BASE}/item/{item_id}.json" for item_id in random.sample(range(1, max_id+1), 5000))

    session = FuturesSession(max_workers=100)
    waiting_responses = [session.get(url) for url in urls]

    for waiting_response in tqdm.tqdm(waiting_responses):
        response = waiting_response.result()
        if response.status_code != 200:
            logging.error(response)
            continue

        content = json.loads(response.text)
        if "score" in content and content["score"] >= GOOD_STORY_MINIMUM_VOTE:
            random_good_items.append(content)

    for item in random_good_items:
        logging.info(item)
    logging.info(f"{len(random_good_items)} out of 5000 items")

You’ll need to install requests, tqdm, and requests-futures for your python.

(I didn’t know at first that just collecting data from API can be this complicated!)