Trying Random Forest


February 18, 2020

I am currently taking the machine learning course from NTU, and it has a course project. Basically the course project requires us to join one of a few past Kaggle competitions and apply what we have learned from the class to the competition. It doesn’t specify much more and the grading is based on vague criteria such as score/performance, novelty, and report quality.

Our team decided to chose a credit card fraud detection competition. We chose it because it seems to be the most “familiar” compared to other scenarios given from the other competition. And we also wanted to avoid image recognition considering that it might require much heavier computational power.

Initial look

We are quite surprised to look at the dataset given by the competition because it started to get really unfamiliar. They have hundreds of columns with obscure column name and undescriptive value like id1, id2, …, id30, then V1, V2, V3, etc. despite the general idea that those idX represents customer properties and VX represents only the transaction. It also has other expected columns such as TransactionID, TransactionAmount, TransactionTime, DeviceType, etc. though the actual value may have been obscured by offset or some other function.

Only then I realize that this is requires much less domain knowledge than what I expected.

Still confused at what to do, we decided to looked at some of the existing notebooks. They do have some good notebooks that explores the data, such as this one. This gives us a gist of what we need to do. And it seems to be a lot.

Interesting exploration

Looking at those existing explorations, some is quite interesting. Particularly, one guesses that the value of the TransactionTime represents seconds by calculating the possible time range if it were a year, month, or seconds. They also graph it to show how the cycle of a day represented on the distribution by the expected peak and down hours. It might not be 100% true that it is actually a second, and actually it does not matter as much as if we know the up and down cycle represent a day.

If it is really true that it represents a day, taking the absolute value of the TransactionTime may be not as optimal as taking the modulo of the day (probably the fraudsters has a tendency to operate at certain hours).

So what to do now?

The project description from our course really does not give us any step by step of what to do. They just describe the end goal. So we are confused where to start. Those notebooks in the kaggle submissions used complicated techniques that we can’t understand yet. Even the data exploration is difficult.

Thus I decided to just give a try and start doing it with whathever technique. Since we have just learned about decision tree, and I heard that RandomForest is something built on top of decision tree by aggregating the result from multiple trees to get the final result, I guess I can try to use RandomForest. The fact that the data has a lot of features (more than 300 columns) will probably do good to our using RandomForest because it heard that it’s quite good to ignore unimportant features.

Let’s do it!

So I do the usual things (I worked a bit with data analysis before at Shopee as intern). Opened Jupyter (again, cough!), load the CSV, and list the columns, count unique values, data types etc.

The data source come in two CSV, transcation data and customer data. From the notebooks I read it should be safe to just merge the two. Pandas has both join and merge operation. I didn’t manage to make the join to work and then found the linked article. It worked with merge, though until know I still don’t fully understand the difference. I am still confused with the concept of index and how it works in Pandas. It is more complex than what I have in my mental model and don’t really bother for now.

Then I decided to explore more from the data. I tried to plot the data but found it to be useless and take too much time. I guess just reading from the existing notebook will help much more.

Heck, I just try to do the RandomForest directly. I don’t really know much of the mechanics, just the general idea. Probably I should’ve read more before proceeding but I just went without it.

Dealing with Categorical value in RandomForest

Then the first problem arises. The RandomForest does not work with categorial values. Then there’s two way to make it work, either with OneHotEncoding, or hash the categorical value to convert it to something numeric. I think there’s a caveat of using the hash because two unrelated categorical value may be grouped together numerically causing some uninteded association. So I guess just try with the OneHotEncoding.

There’s a problem with OneHotEncoding that it can grow really big if I don’t use Sparse matrix. But using Sparse matrix means that I will loose control of the column names and it will be much more complicated to deal with. Added with the fact that the training data and the test data may have a different sets of values, I guess I’ll just try with the hash instead.

This is how do I do it with the hash

def feature_hash_obj(df):
    OHE_columns = df.select_dtypes(include='object').columns  # OHE stands for OneHotEncoded
    import hashlib
    def hash_md5(obj):
        return int(hashlib.md5(str(obj).encode('UTF-8')).hexdigest(), 16)
    from tqdm import tqdm
    for column in tqdm(OHE_columns):
        df[column] = df[column].apply(hash_md5).apply(float)
        df[column] = df[column].apply(lambda x: x * pow(2, -10))  # to fit in float32
    return df

And it seems to work fine.

Is it not too high?

The data seems to be ready to use (though I haven’t considered normalization and dealing with missing values). I just fill all the NaN with -1 (most of the data contain value from 0 to 1, so should be okay).

Then using cross-validation, I unexpectedly got the accuracy score of around 0.96. This is not possible, I thought. Then I forgot that the data is mostly not fraud. Even with the default option of choosing all to be not fraud, it still perform >0.96. So yeah I need some balancing in the fit. I just choose balanced (means that the classifier will treat the two classes to be equal in density).

But I don’t seem to be able to change the scoring method to be balanced also (instead, the scoring takes the distribution of the whole population, means still testing more of not fraud). So I just manually split test the data, and the accuracy now is 0.83, which makes more sense. This means that given a transaction with equal probability of being fraud or not fraud, the classifier guess it correctly 83% of the time.

Submit to Kaggle

It was near the end of day, I guess I can stop here for now and try again with other approach next week. When I submit to Kaggle, I got the score of 0.75. I don’t know how exactly they measure the score, which I guess is intended because they don’t want people to reverse engineer their scoring (?). It wasn’t great, around the 85% percentile. But of course more effort can be done and a lot of place for improvement.