Today, as continuation from previous blogs (tokenization and crawling), I am continuing on my information retrieval school project. Now I am focusing on the tokenization part.
Problem with Formatting
The first apparent problem is the formatting in HN (HackerNews) text. Turns out they allow some rich text formatting (or at least that’s how they store it). For example:
1968: Volume One of Knuth's TAOCP: <a href="http://www-cs-faculty.stanford.edu/~knuth/brochure.pdf" rel="nofollow">http://www-cs-faculty.stanford.edu/~knuth/brochure.pdf</a><p>Planned to run to 7 volumes at that time, and volume 5 is currently "in preparation".
The original link is here.
If we directly tokenize it using out of the box nltk.word_tokenize
, we will get:
1968
:
Volume
One
of
Knuth
&
#
x27
;
s
TAOCP
:
<
a
href=
''
http
:
&
#
x2F
;
&
#
x2F
;
www-cs-faculty.stanford.edu
&
#
x2F
;
~knuth
&
#
x2F
;
brochure.pdf
''
rel=
''
nofollow
''
>
http
:
&
#
x2F
;
&
#
x2F
;
www-cs-faculty.stanford.edu
&
#
x2F
;
~knuth
&
#
x2F
;
brochure.pdf
<
/a
>
<
p
>
Planned
to
run
to
7
volumes
at
that
time
,
and
volume
5
is
currently
&
quot
;
in
preparation
&
quot
;
.
It seems like that HN stores the html representation for the formatted text. So with a quick checking on https://news.ycombinator.com/formatdoc, turns out that they don’t allow much fancy formatting, so I guess the rich text is just the html representation. So I need to convert it to plain text string.
bs4
Using BeautifulSoup seems like a good option to go. The text above is converted to a python string 1968: Volume One of Knuth's TAOCP: http://www-cs-faculty.stanford.edu/~knuth/brochure.pdfPlanned to run to 7 volumes at that time, and volume 5 is currently "in preparation".
And if we use the nltk word tokenizer again:
1968
:
Volume
One
of
Knuth
's
TAOCP
:
http
:
//www-cs-faculty.stanford.edu/~knuth/brochure.pdfPlanned
to
run
to
7
volumes
at
that
time
,
and
volume
5
is
currently
"
in
preparation
"
.
So ya formatting issue can be considered solved.
Problem with word tokenize
word_tokenize
cannot be perfect for all use case at once. How if I want the complete url with the http://
included as a token? How if I want Knuth's
as a single token? Looking more into how the word_tokenize
works
in the documentation, turns out that word_tokenize
use Punkt as the tokenizer, such that they intelligently tokenize you're
to you
're
, they've
to they
've
and other tricks.
Simple wordpunct
If we contrast word_tokenize
, which uses Punkt, to a simple wordpunct_tokenize
, which simply just use a regex pattern \w+|[^\w\s]+
to be recognized as a token (a string of word characters: alphanumeric and underscore and its counterpart).
1968
:
Volume
One
of
Knuth
'
s
TAOCP
:
http
://
www
-
cs
-
faculty
.
stanford
.
edu
/~
knuth
/
brochure
.
pdfPlanned
to
run
to
7
volumes
at
that
time
,
and
volume
5
is
currently
"
in
preparation
".
Punkt
I still don’t know what Punkt means or stands for. Regarding how it exactly works, nltk says
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used. Furthermore: PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.
I’m still not really sure what they mean by these. They pointed out to a 42 pages article:
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
Boundary Detection. Computational Linguistics 32: 485-525.
which can be accessed from https://dl.acm.org/doi/10.1162/coli.2006.32.4.485. But I don’t think I have time to read it.
So, if my current best explanation is that they just to try improve the tokenizing based on the context for each domain language, but how it’s exactly done need to be followed later.
Regexp Tokenizer
I forgot to mention why do I need to bother with knowing the internals of Punkt. I need it because we might need to alter some behavior of the Punkt tokenizer. For example
>>> nltk.word_tokenize('--enable-optimization')
['--', 'enable-optimization']
>>> nltk.word_tokenize('f(x)')
['f', '(', 'x', ')']
>>> nltk.word_tokenize('lambda x -> x**2')
['lambda', 'x', '-', '>', 'x**2']
which probably is not the desired behavior.
So instead of tinkering with Punkt, I think an easier solution is to use additional tokenizer, using regex (https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.regexp). For example, adding pattern -+>
(->
, -->
, --->
, etc.) as a valid token such that lambda x -> x**2
will be tokenized as lambda
x
-
>
x**2
->
. This means that we will have more token.
Alternative: using mwe (multi word expression)
If we really need to remove the duplication in tokens caused by the regex tokenizer, there’s actually an alternative by using mwe tokenizer (https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.mwe). This works on an existing list of tokens and try to merge it if a recognized mwe appears. For example, just a little bit more
is tokenized to just
a
little
bit
more
. If we add a little bit
as a mwe, we can pass the tokens and it will merge the mwe tokens to just
a_little_bit
more
. But, this does not come without caveat. Seems like we can’t use a regex pattern as an mwe, such that we can’t use pattern like -+>
as an mwe.
Final prototype script
import json
import logging
import random
import nltk
import requests
import tqdm
from requests_futures.sessions import FuturesSession
from bs4 import BeautifulSoup
BASE = 'https://hacker-news.firebaseio.com/v0'
GOOD_STORY_MINIMUM_VOTE = 100
logging.basicConfig(level=logging.INFO)
def fetch_random_items(item_count=100):
max_id = int(requests.get(f'{BASE}/maxitem.json').text)
items = []
urls = [f"{BASE}/item/{item_id}.json" for item_id in random.sample(range(1, max_id+1), item_count)]
session = FuturesSession(max_workers=30)
waiting_responses = [session.get(url) for url in urls]
for waiting_response in tqdm.tqdm(waiting_responses):
response = waiting_response.result()
if response.status_code != 200:
logging.error(response)
continue
content = json.loads(response.text)
if content is None:
raise Exception(response.text)
items.append(content)
return items
if __name__ == "__main__":
other_token_patterns = [
r"--\w+-\w+",
r"\w\(\w\)",
r"->",
]
combined_pattern = "|".join(other_token_patterns)
regex_tokenizer = nltk.RegexpTokenizer(combined_pattern)
items = fetch_random_items(100)
for item in items:
if 'text' in item:
corpus = item['text']
elif 'title' in item:
corpus = item['title']
else:
continue
soup = BeautifulSoup(corpus, 'html.parser')
tokens = nltk.word_tokenize(soup.text)
tokens.extend(regex_tokenizer.tokenize(soup.text))
print(tokens)
(to be continued.)