Tokenizing HN Text


February 2, 2020

Today, as continuation from previous blogs (tokenization and crawling), I am continuing on my information retrieval school project. Now I am focusing on the tokenization part.

Problem with Formatting

The first apparent problem is the formatting in HN (HackerNews) text. Turns out they allow some rich text formatting (or at least that’s how they store it). For example: 1968: Volume One of Knuth&#x27;s TAOCP: <a href="http:&#x2F;&#x2F;;~knuth&#x2F;brochure.pdf" rel="nofollow">http:&#x2F;&#x2F;;~knuth&#x2F;brochure.pdf</a><p>Planned to run to 7 volumes at that time, and volume 5 is currently &quot;in preparation&quot;. The original link is here.

If we directly tokenize it using out of the box nltk.word_tokenize, we will get: 1968 : Volume One of Knuth & # x27 ; s TAOCP : < a href= '' http : & # x2F ; & # x2F ; & # x2F ; ~knuth & # x2F ; brochure.pdf '' rel= '' nofollow '' > http : & # x2F ; & # x2F ; & # x2F ; ~knuth & # x2F ; brochure.pdf < /a > < p > Planned to run to 7 volumes at that time , and volume 5 is currently & quot ; in preparation & quot ; .

It seems like that HN stores the html representation for the formatted text. So with a quick checking on, turns out that they don’t allow much fancy formatting, so I guess the rich text is just the html representation. So I need to convert it to plain text string.


Using BeautifulSoup seems like a good option to go. The text above is converted to a python string 1968: Volume One of Knuth's TAOCP: to run to 7 volumes at that time, and volume 5 is currently "in preparation". And if we use the nltk word tokenizer again:

1968 : Volume One of Knuth 's TAOCP : http : // to run to 7 volumes at that time , and volume 5 is currently " in preparation " .

So ya formatting issue can be considered solved.

Problem with word tokenize

word_tokenize cannot be perfect for all use case at once. How if I want the complete url with the http:// included as a token? How if I want Knuth's as a single token? Looking more into how the word_tokenize works in the documentation, turns out that word_tokenize use Punkt as the tokenizer, such that they intelligently tokenize you're to you 're, they've to they 've and other tricks.

Simple wordpunct

If we contrast word_tokenize, which uses Punkt, to a simple wordpunct_tokenize, which simply just use a regex pattern \w+|[^\w\s]+ to be recognized as a token (a string of word characters: alphanumeric and underscore and its counterpart). 1968 : Volume One of Knuth ' s TAOCP : http :// www - cs - faculty . stanford . edu /~ knuth / brochure . pdfPlanned to run to 7 volumes at that time , and volume 5 is currently " in preparation ".


I still don’t know what Punkt means or stands for. Regarding how it exactly works, nltk says

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used. Furthermore: PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

I’m still not really sure what they mean by these. They pointed out to a 42 pages article:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.

which can be accessed from But I don’t think I have time to read it.

So, if my current best explanation is that they just to try improve the tokenizing based on the context for each domain language, but how it’s exactly done need to be followed later.

Regexp Tokenizer

I forgot to mention why do I need to bother with knowing the internals of Punkt. I need it because we might need to alter some behavior of the Punkt tokenizer. For example

>>> nltk.word_tokenize('--enable-optimization')
['--', 'enable-optimization']

>>> nltk.word_tokenize('f(x)')
['f', '(', 'x', ')']

>>> nltk.word_tokenize('lambda x -> x**2')
['lambda', 'x', '-', '>', 'x**2']

which probably is not the desired behavior.

So instead of tinkering with Punkt, I think an easier solution is to use additional tokenizer, using regex ( For example, adding pattern -+> (->, -->, --->, etc.) as a valid token such that lambda x -> x**2 will be tokenized as lambda x - > x**2 ->. This means that we will have more token.

Alternative: using mwe (multi word expression)

If we really need to remove the duplication in tokens caused by the regex tokenizer, there’s actually an alternative by using mwe tokenizer ( This works on an existing list of tokens and try to merge it if a recognized mwe appears. For example, just a little bit more is tokenized to just a little bit more. If we add a little bit as a mwe, we can pass the tokens and it will merge the mwe tokens to just a_little_bit more. But, this does not come without caveat. Seems like we can’t use a regex pattern as an mwe, such that we can’t use pattern like -+> as an mwe.

Final prototype script

import json
import logging
import random

import nltk
import requests
import tqdm
from requests_futures.sessions import FuturesSession
from bs4 import BeautifulSoup

BASE = ''


def fetch_random_items(item_count=100):
    max_id = int(requests.get(f'{BASE}/maxitem.json').text)
    items = []

    urls = [f"{BASE}/item/{item_id}.json" for item_id in random.sample(range(1, max_id+1), item_count)]

    session = FuturesSession(max_workers=30)
    waiting_responses = [session.get(url) for url in urls]

    for waiting_response in tqdm.tqdm(waiting_responses):
        response = waiting_response.result()
        if response.status_code != 200:
        content = json.loads(response.text)
        if content is None:
            raise Exception(response.text)

    return items

if __name__ == "__main__":
    other_token_patterns = [
    combined_pattern = "|".join(other_token_patterns)
    regex_tokenizer = nltk.RegexpTokenizer(combined_pattern)
    items = fetch_random_items(100)
    for item in items:
        if 'text' in item:
            corpus = item['text']
        elif 'title' in item:
            corpus = item['title']

        soup = BeautifulSoup(corpus, 'html.parser')
        tokens = nltk.word_tokenize(soup.text)


(to be continued.)