Twelveth competition following Youhan Lee's curriculum. Natural Language Processing competition.
First Kernel: Mercari Interactive EDA + Topic Modelling
- EDA kernel with matplotlib.
Insight / Summary:
1. Log transformation on target var
- Our response or target variables is the price we are suggesting to the Mercari's marketplace sellers.
- The median price of all the items in the training is about $267 but given the existence of some extreme values of over $100 and the maximum at $2,009, the distribution of the variables is heavily skewed to the left.
- So let's make log-transformation on the price (we added +1 to the value before the transformation to avoid zero and negative values).
plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution - Training Set', fontsize=17)
plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log(Price) Distribution - Training Set', fontsize=17)
plt.show()
2. Dealing with Item Description feature
- It will be more challenging to parse through this particular item since it's unstructured data.
- Does it mean a more detailed and lengthy description will result in a higher bidding price?
- We will strip out all punctuations, remove some english stop words (i.e. redundant words such as "a", "the", etc.) and any other words with a length less than 3:
def wordCount(text):
# convert to lower case and strip regex
try:
# convert to lower case and strip regex
text = text.lower()
regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
txt = regex.sub(" ", text)
# tokenize
# words = nltk.word_tokenize(clean_txt)
# remove words in stop words
words = [w for w in txt.split(" ") \
if not w in stop_words.ENGLISH_STOP_WORDS and len(w)>3]
return len(words)
except:
return 0
# add a column of word counts to both the training and test set
train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))
test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))
- We also need to check if there are any missing values in the item description (4 observations don't have a description) andl remove those observations from our training set.
train.item_description.isnull().sum()
# result: 4
# remove missing values in item description
train = train[pd.notnull(train['item_description'])]
3. Pre-processing: tokenization
- Most of the time, the first steps of an NLP project is to "tokenize" your documents, which main purpose is to normalize our texts.
- The three fundamental stages will usually include:
- break the descriptions into sentences and then break the sentences into tokens
- remove punctuation and stop words
- lowercase the tokens
- herein, I will also only consider words that have length equal to or greater than 3 characters
stop = set(stopwords.words('english'))
def tokenize(text):
"""
sent_tokenize(): segment text into sentences
word_tokenize(): break sentences into words
"""
try:
regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
text = regex.sub(" ", text) # remove punctuation
tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
tokens = []
for token_by_sent in tokens_:
tokens += token_by_sent
tokens = list(filter(lambda t: t.lower() not in stop, tokens))
filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3]
return filtered_tokens
except TypeError as e: print(text,e)
# apply the tokenizer into the item descriptipn column
train['tokens'] = train['item_description'].map(tokenize)
test['tokens'] = test['item_description'].map(tokenize)
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
4. WordCloud package
- We could aso use the package WordCloud to easily visualize which words has the highest frequencies within each category:
# build dictionary with key=category and values as all the descriptions related.
cat_desc = dict()
for cat in general_cats:
text = " ".join(train.loc[train['general_cat']==cat, 'item_description'].values)
cat_desc[cat] = tokenize(text)
# find the most common words for the top 4 categories
women100 = Counter(cat_desc['Women']).most_common(100)
beauty100 = Counter(cat_desc['Beauty']).most_common(100)
kids100 = Counter(cat_desc['Kids']).most_common(100)
electronics100 = Counter(cat_desc['Electronics']).most_common(100)
def generate_wordcloud(tup):
wordcloud = WordCloud(background_color='white',
max_words=50, max_font_size=40,
random_state=42
).generate(str(tup))
return wordcloud
fig,axes = plt.subplots(2, 2, figsize=(30, 15))
ax = axes[0, 0]
ax.imshow(generate_wordcloud(women100), interpolation="bilinear")
ax.axis('off')
ax.set_title("Women Top 100", fontsize=30)
ax = axes[0, 1]
ax.imshow(generate_wordcloud(beauty100))
ax.axis('off')
ax.set_title("Beauty Top 100", fontsize=30)
ax = axes[1, 0]
ax.imshow(generate_wordcloud(kids100))
ax.axis('off')
ax.set_title("Kids Top 100", fontsize=30)
ax = axes[1, 1]
ax.imshow(generate_wordcloud(electronics100))
ax.axis('off')
ax.set_title("Electronic Top 100", fontsize=30)
5. Pre-processing: tf-idf
- tf-idf is the acronym for Term Frequency-Inverse Document Frequency.
- It quantifies the importance of a particular word in relative to the vocabulary of a collection of documents or corpus.
- The metric depends on two factors:
- Term Frequency: the occurences of a word in a given document (i.e. bag of words)
- Inverse Document Frequency: the reciprocal number of times a word occurs in a corpus of documents
- Think about of it this way: If the word is used extensively in all documents, its existence within a specific document will not be able to provide us much specific information about the document itself.
- So the second term could be seen as a penalty term that penalizes common words such as "a", "the", "and", etc. tf-idf can therefore, be seen as a weighting scheme for words relevancy in a specific document.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10,
max_features=180000,
tokenizer=tokenize,
ngram_range=(1, 2))
all_desc = np.append(train['item_description'].values, test['item_description'].values)
vz = vectorizer.fit_transform(list(all_desc))
- vz is a tfidf matrix where:
- the number of rows is the total number of descriptions
- the number of columns is the total number of unique tokens across the descriptions
- Given the high dimension of our tfidf matrix, we need to reduce their dimension using the Singular Value Decomposition (SVD) technique.
- And to visualize our vocabulary, we could next use t-SNE to reduce the dimension from 50 to 2.
- t-SNE is more suitable for dimensionality reduction to 2 or 3.
SVD (Singular Value Decomposition):
- Linear dimensionality reduction
- Preserves major patterns in data
- Relatively fast computation
- Suitable for reduction to larger dimensions (e.g., 50 dimensions)
t-SNE (t-Distributed Stochastic Neighbor Embedding):
- Non-linear dimensionality reduction
- Preserves similarity relationships between data points
- Optimized for visualization (2-3 dimensions)
- Effectively reveals cluster structures
6. t-Distributed Stochastic Neighbor Embedding (t-SNE)
- t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
- The goal is to take a set of points in a high-dimensional space and find a representation of those points in a lower-dimensional space, typically the 2D plane.
- It is based on probability distributions with random walk on neighborhood graphs to find the structure within the data.
- But since t-SNE complexity is significantly high, usually we'd use other high-dimension reduction techniques before applying t-SNE.
- First, let's take a sample from the both training and testing item's description since t-SNE can take a very long time to execute.
- We can then reduce the dimension of each vector from to n_components (50) using SVD.
trn = train.copy()
tst = test.copy()
trn['is_train'] = 1
tst['is_train'] = 0
sample_sz = 15000
combined_df = pd.concat([trn, tst])
combined_sample = combined_df.sample(n=sample_sz)
vz_sample = vectorizer.fit_transform(list(combined_sample['item_description']))
from sklearn.decomposition import TruncatedSVD
n_comp=30
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)
# Dimension from 50 to 2
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)
7. tf-idf clustering of the item description
plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "tokens": "@tokens", "category":"@category"}
show(plot_tfidf)
8. K-Means Clustering
- K-means clustering objective is to minimize the average squared Euclidean distance of the document / description from their cluster centroids.
from sklearn.cluster import MiniBatchKMeans
num_clusters = 30 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters,
init='k-means++',
n_init=1,
init_size=1000, batch_size=1000, verbose=0, max_iter=1000)
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(num_clusters):
print("Cluster %d:" % i)
aux = ''
for j in sorted_centroids[i, :10]:
aux += terms[j] + ' | '
print(aux)
print()
In order to plot these clusters, first we will need to reduce the dimension of the distances to 2 using tsne:
# repeat the same steps for the sample
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
# reduce dimension to 2 using tsne
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)
#combined_sample.reset_index(drop=True, inplace=True)
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['description'] = combined_sample['item_description']
kmeans_df['category'] = combined_sample['general_cat']
#kmeans_df['cluster']=kmeans_df.cluster.astype(str).astype('category')
plot_kmeans = bp.figure(plot_width=700, plot_height=600,
title="KMeans clustering of the description",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'],
color=colormap[kmeans_clusters],
description=kmeans_df['description'],
category=kmeans_df['category'],
cluster=kmeans_df['cluster']))
plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "category": "@category", "cluster":"@cluster" }
show(plot_kmeans)
9. Latent Dirichlet Allocation
- Latent Dirichlet Allocation (LDA) is an algorithms used to discover the topics that are present in a corpus.
- LDA starts from a fixed number of topics.
- Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics.
- Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.
- Its input is a bag of words, i.e. each document represented as a row, with each columns containing the count of words in the corpus.
cvectorizer = CountVectorizer(min_df=4,
max_features=180000,
tokenizer=tokenize,
ngram_range=(1,2))
cvz = cvectorizer.fit_transform(combined_sample['item_description'])
lda_model = LatentDirichletAllocation(n_components=20,
learning_method='online',
max_iter=20,
random_state=42)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 10
topic_summaries = []
topic_word = lda_model.components_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i, ' | '.join(topic_words)))
# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(X_topics)
unnormalized = np.matrix(X_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)
lda_keys = []
for i, tweet in enumerate(combined_sample['item_description']):
lda_keys += [doc_topic[i].argmax()]
lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['description'] = combined_sample['item_description']
lda_df['category'] = combined_sample['general_cat']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)
plot_lda = bp.figure(plot_width=700,
plot_height=600,
title="LDA topic visualization",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
color=colormap[lda_keys],
description=lda_df['description'],
topic=lda_df['topic'],
category=lda_df['category']))
plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_kmeans.select(dict(type=HoverTool))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"description":"@description",
"topic":"@topic", "category":"@category"}
- pyLDAvis is a powerful tool that gives us an interactive visualization for LDA.
- It's a shame that by putting the HTML of the visualization using pyLDAvis, it will distort the layout of the kernel, I won't upload in here.
- But if you follow the below code, there should be an HTML file generated with very interesting interactive bubble chart that visualizes the space of your topic clusters and the term components within each topic.
def prepareLDAData():
data = {
'vocab': vocab,
'doc_topic_dists': doc_topic,
'doc_lengths': list(lda_df['len_docs']),
'term_frequency':cvectorizer.vocabulary_,
'topic_term_dists': lda_model.components_
}
return data
import pyLDAvis
lda_df['len_docs'] = combined_sample['tokens'].map(len)
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)
Second Kernel: A simple nn solution with Keras (~0.48611 PL)
- Kernel using neural network for modeling.
Insight / Summary:
1. Metric
def rmsle(y, y_pred):
assert len(y) == len(y_pred)
to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
return (sum(to_sum) * (1.0/len(y))) ** 0.5
#Source: https://www.kaggle.com/marknagelberg/rmsle-function
2. Missing value
#HANDLE MISSING VALUES
print("Handling missing values...")
def handle_missing(dataset):
dataset.category_name.fillna(value="missing", inplace=True)
dataset.brand_name.fillna(value="missing", inplace=True)
dataset.item_description.fillna(value="missing", inplace=True)
return (dataset)
train = handle_missing(train)
test = handle_missing(test)
3. Categorical data - label encoding
#PROCESS CATEGORICAL DATA
print("Handling categorical variables...")
le = LabelEncoder()
le.fit(np.hstack([train.category_name, test.category_name]))
train.category_name = le.transform(train.category_name)
test.category_name = le.transform(test.category_name)
le.fit(np.hstack([train.brand_name, test.brand_name]))
train.brand_name = le.transform(train.brand_name)
test.brand_name = le.transform(test.brand_name)
del le
train.head(3)
4. raw text - tokenization
#PROCESS TEXT: RAW
print("Text to seq process...")
from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])
print(" Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print(" Transforming text to seq...")
train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())
train.head(3)
5. Scaling target variable
#SCALE target variable
train["target"] = np.log(train.price+1)
target_scaler = MinMaxScaler(feature_range=(-1, 1))
train["target"] = target_scaler.fit_transform(train.target.reshape(-1,1))
pd.DataFrame(train.target).hist()
6. Modeling GRU NN
1) Finding max values for NN
#EMBEDDINGS MAX VALUE
#Base on the histograms, we select the next lengths
MAX_NAME_SEQ = 10
MAX_ITEM_DESC_SEQ = 75
MAX_TEXT = np.max([np.max(train.seq_name.max())
, np.max(test.seq_name.max())
, np.max(train.seq_item_description.max())
, np.max(test.seq_item_description.max())])+2
MAX_CATEGORY = np.max([train.category_name.max(), test.category_name.max()])+1
MAX_BRAND = np.max([train.brand_name.max(), test.brand_name.max()])+1
MAX_CONDITION = np.max([train.item_condition_id.max(), test.item_condition_id.max()])+1
2) Actual modeling
#KERAS MODEL DEFINITION
from keras.layers import Input, Dropout, Dense, BatchNormalization, Activation, concatenate, GRU, Embedding, Flatten, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras import backend as K
def get_callbacks(filepath, patience=2):
es = EarlyStopping('val_loss', patience=patience, mode="min")
msave = ModelCheckpoint(filepath, save_best_only=True)
return [es, msave]
def rmsle_cust(y_true, y_pred):
first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)
second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)
return K.sqrt(K.mean(K.square(first_log - second_log), axis=-1))
def get_model():
#params
dr_r = 0.1
#Inputs
name = Input(shape=[X_train["name"].shape[1]], name="name")
item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
brand_name = Input(shape=[1], name="brand_name")
category_name = Input(shape=[1], name="category_name")
item_condition = Input(shape=[1], name="item_condition")
num_vars = Input(shape=[X_train["num_vars"].shape[1]], name="num_vars")
#Embeddings layers
emb_name = Embedding(MAX_TEXT, 50)(name)
emb_item_desc = Embedding(MAX_TEXT, 50)(item_desc)
emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
emb_category_name = Embedding(MAX_CATEGORY, 10)(category_name)
emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
#rnn layer
rnn_layer1 = GRU(16) (emb_item_desc)
rnn_layer2 = GRU(8) (emb_name)
#main layer
main_l = concatenate([
Flatten() (emb_brand_name)
, Flatten() (emb_category_name)
, Flatten() (emb_item_condition)
, rnn_layer1
, rnn_layer2
, num_vars
])
main_l = Dropout(dr_r) (Dense(128) (main_l))
main_l = Dropout(dr_r) (Dense(64) (main_l))
#output
output = Dense(1, activation="linear") (main_l)
#model
model = Model([name, item_desc, brand_name
, category_name, item_condition, num_vars], output)
model.compile(loss="mse", optimizer="adam", metrics=["mae", rmsle_cust])
return model
model = get_model()
model.summary()
Third Kernel: Ridge (LB 0.41943)
- Mainly using Ridge model kernel.
Insight / Summary:
1. Overall Summary
- This code implements a machine learning pipeline for product price prediction.
- Here are the main steps:
- Data Preprocessing:
- Remove data with zero prices
- Clean text data including categories, brand names, product names, and descriptions
- Fill missing brand names using the SymSpell algorithm based on similarity
- Split categories into major/medium/minor classifications
- Combine text data to create rich features
- Feature Engineering:
- Vectorize text data using HashingVectorizer and CountVectorizer
- Encode categorical variables using OneHotEncoder
- Apply TF-IDF transformation to reflect text feature importance
- Select only features common to both training and test sets
- Modeling:
- Use Ridge regression for price prediction
- Use log-transformed prices as targets
- Generate final prices through exponential transformation of predictions
- Data Preprocessing:
- The main characteristics of this implementation are:
- Focus on Text Data Processing:
- Use HashingVectorizer for memory-efficient handling of large vocabularies
- Extract context information through n-gram based features
- Reflect word importance through TF-IDF transformation
- Efficient Memory Management:
- Immediately release unnecessary data from memory
- Optimize memory usage with HashingVectorizer
- Maintain only features common to training/test sets
- Robust Preprocessing:
- Intelligent filling of missing brand names
- Hierarchical use of category information
- Text data normalization and combination
- Scalable Structure:
- Modularization using Pipeline and FeatureUnion
- Flexibility through custom transformer classes
- Support for multiprocessing
- Focus on Text Data Processing:
2. Code Details
# Import essential libraries
import multiprocessing as mp # Library for parallel processing
import pandas as pd # pandas for data processing
from time import time # For measuring execution time
from scipy.sparse import csr_matrix # For sparse matrix operations
import os # For OS related functionality
from sklearn.linear_model import Ridge # Ridge regression model
from sklearn.pipeline import FeatureUnion, Pipeline # For feature processing pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer # Text processing
from sklearn.metrics import mean_squared_log_error # Evaluation metric
from sklearn.preprocessing import OneHotEncoder # For encoding categorical variables
import numpy as np # For numerical operations
import gc # Garbage collection
from sklearn.base import BaseEstimator, TransformerMixin # For creating custom transformers
import re # Regular expressions
from pandas.api.types import is_numeric_dtype, is_categorical_dtype # For checking data types
# Multithreading configuration
os.environ['MKL_NUM_THREADS'] = '4' # Limit Intel Math Kernel Library threads
os.environ['OMP_NUM_THREADS'] = '4' # Limit OpenMP threads
os.environ['JOBLIB_START_METHOD'] = 'forkserver' # Set joblib parallel processing method
# Set input data path
INPUT_PATH = r'../input'
# Function to calculate Damerau-Levenshtein distance
def dameraulevenshtein(seq1, seq2):
"""Calculate the Damerau-Levenshtein distance between sequences.
This method has not been modified from the original.
Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
This distance is the number of additions, deletions, substitutions,
and transpositions needed to transform the first sequence into the
second. Although generally used with strings, any sequences of
comparable objects will work.
Transpositions are exchanges of *consecutive* characters; all other
operations are self-explanatory.
This implementation is O(N*M) time and O(M) space, for N and M the
lengths of the two sequences.
>>> dameraulevenshtein('ba', 'abc')
2
>>> dameraulevenshtein('fee', 'deed')
2
It works with arbitrary sequences too:
>>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
2
"""
# codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
# Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
# However, only the current and two previous rows are needed at once,
# so we only store those.
# Implementation maintained as original using dynamic programming
# Stores only current row and previous two rows for memory efficiency
oneago = None
thisrow = list(range(1, len(seq2) + 1)) + [0]
for x in range(len(seq1)):
twoago, oneago, thisrow = (oneago, thisrow, [0] * len(seq2) + [x + 1])
for y in range(len(seq2)):
delcost = oneago[y] + 1 # Deletion cost
addcost = thisrow[y - 1] + 1 # Addition cost
subcost = oneago[y - 1] + (seq1[x] != seq2[y]) # Substitution cost
thisrow[y] = min(delcost, addcost, subcost)
# Handle transpositions
if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
return thisrow[len(seq2) - 1]
class SymSpell:
"""
A class implementing the SymSpell algorithm.
This algorithm provides an efficient method for spell correction.
"""
def __init__(self, max_edit_distance=3, verbose=0):
"""
Parameters:
max_edit_distance: Maximum edit distance (how many edits to allow)
verbose: Verbosity level (0: top suggestion only, 1: all suggestions with minimal distance, 2: all possible suggestions)
"""
self.max_edit_distance = max_edit_distance
self.verbose = verbose
self.dictionary = {} # Word dictionary
self.longest_word_length = 0 # Length of longest word
def get_deletes_list(self, w):
"""
Generates all possible combinations of the word with characters deleted up to max_edit_distance.
Example: "word" -> ["ord", "wrd", "wod", "wor"]
"""
deletes = []
queue = [w]
for d in range(self.max_edit_distance):
temp_queue = []
for word in queue:
if len(word) > 1:
for c in range(len(word)):
word_minus_c = word[:c] + word[c + 1:]
if word_minus_c not in deletes:
deletes.append(word_minus_c)
if word_minus_c not in temp_queue:
temp_queue.append(word_minus_c)
queue = temp_queue
return deletes
def create_dictionary_entry(self, w):
"""
Adds a word and its derived deletion variants to the dictionary.
Returns:
bool: Whether a new real word was added
"""
new_real_word_added = False
if w in self.dictionary:
# If word exists, increase frequency
self.dictionary[w] = (self.dictionary[w][0], self.dictionary[w][1] + 1)
else:
# Add new word
self.dictionary[w] = ([], 1)
self.longest_word_length = max(self.longest_word_length, len(w))
if self.dictionary[w][1] == 1:
# If this is the first occurrence of the word in the corpus
new_real_word_added = True
deletes = self.get_deletes_list(w)
for item in deletes:
if item in self.dictionary:
# Add original word to the deletion's entry
self.dictionary[item][0].append(w)
else:
# Add new deletion form
self.dictionary[item] = ([w], 0)
return new_real_word_added
def create_dictionary_from_arr(self, arr, token_pattern=r'[a-z]+'):
"""
Creates a word dictionary from an array.
Parameters:
arr: Array containing words
token_pattern: Regular expression pattern for extracting words
Returns:
dictionary: Generated word dictionary
"""
total_word_count = 0 # Total words processed
unique_word_count = 0 # Number of unique words
for line in arr:
# Split words by non-alphabetic characters
words = re.findall(token_pattern, line.lower())
for word in words:
total_word_count += 1
if self.create_dictionary_entry(word):
unique_word_count += 1
# Print processing results
print("total words processed: %i" % total_word_count)
print("total unique words in corpus: %i" % unique_word_count)
print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))
print(" edit distance for deletions: %i" % self.max_edit_distance)
print(" length of longest word in corpus: %i" % self.longest_word_length)
return self.dictionary
def create_dictionary(self, fname):
"""
Creates a word dictionary from a file.
Parameters:
fname: Path to the file to read
Returns:
dictionary: Generated word dictionary
How it works:
1. Reads the file line by line.
2. Extracts words containing only alphabetic characters from each line.
3. Converts each word to lowercase and adds it to the dictionary.
4. Prints processing results.
"""
total_word_count = 0 # Total number of words processed
unique_word_count = 0 # Number of unique words
with open(fname) as file: # Open file with context manager
for line in file:
# Split words by non-alphabetic characters
# [a-z]+ pattern finds one or more consecutive lowercase letters
words = re.findall('[a-z]+', line.lower())
for word in words:
total_word_count += 1 # Increase total word count
# If create_dictionary_entry returns True (new word added)
# Increase unique word count
if self.create_dictionary_entry(word):
unique_word_count += 1
# Print processing results
print("total words processed: %i" % total_word_count) # Total words processed
print("total unique words in corpus: %i" % unique_word_count) # Unique words
print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary)) # Dictionary size
print(" edit distance for deletions: %i" % self.max_edit_distance) # Maximum edit distance
print(" length of longest word in corpus: %i" % self.longest_word_length) # Length of longest word
return self.dictionary # Return generated dictionary
def get_suggestions(self, string, silent=False):
"""return list of suggested corrections for potentially incorrectly
spelled word"""
if (len(string) - self.longest_word_length) > self.max_edit_distance:
if not silent:
print("no items in dictionary within maximum edit distance")
return []
suggest_dict = {}
min_suggest_len = float('inf')
queue = [string]
q_dictionary = {} # items other than string that we've checked
while len(queue) > 0:
q_item = queue[0] # pop
queue = queue[1:]
# early exit
if ((self.verbose < 2) and (len(suggest_dict) > 0) and
((len(string) - len(q_item)) > min_suggest_len)):
break
# process queue item
if (q_item in self.dictionary) and (q_item not in suggest_dict):
if self.dictionary[q_item][1] > 0:
# word is in dictionary, and is a word from the corpus, and
# not already in suggestion list so add to suggestion
# dictionary, indexed by the word with value (frequency in
# corpus, edit distance)
# note q_items that are not the input string are shorter
# than input string since only deletes are added (unless
# manual dictionary corrections are added)
assert len(string) >= len(q_item)
suggest_dict[q_item] = (self.dictionary[q_item][1],
len(string) - len(q_item))
# early exit
if (self.verbose < 2) and (len(string) == len(q_item)):
break
elif (len(string) - len(q_item)) < min_suggest_len:
min_suggest_len = len(string) - len(q_item)
# the suggested corrections for q_item as stored in
# dictionary (whether or not q_item itself is a valid word
# or merely a delete) can be valid corrections
for sc_item in self.dictionary[q_item][0]:
if sc_item not in suggest_dict:
# compute edit distance
# suggested items should always be longer
# (unless manual corrections are added)
assert len(sc_item) > len(q_item)
# q_items that are not input should be shorter
# than original string
# (unless manual corrections added)
assert len(q_item) <= len(string)
if len(q_item) == len(string):
assert q_item == string
item_dist = len(sc_item) - len(q_item)
# item in suggestions list should not be the same as
# the string itself
assert sc_item != string
# calculate edit distance using, for example,
# Damerau-Levenshtein distance
item_dist = dameraulevenshtein(sc_item, string)
# do not add words with greater edit distance if
# verbose setting not on
if (self.verbose < 2) and (item_dist > min_suggest_len):
pass
elif item_dist <= self.max_edit_distance:
assert sc_item in self.dictionary # should already be in dictionary if in suggestion list
suggest_dict[sc_item] = (self.dictionary[sc_item][1], item_dist)
if item_dist < min_suggest_len:
min_suggest_len = item_dist
# depending on order words are processed, some words
# with different edit distances may be entered into
# suggestions; trim suggestion dictionary if verbose
# setting not on
if self.verbose < 2:
suggest_dict = {k: v for k, v in suggest_dict.items() if v[1] <= min_suggest_len}
# now generate deletes (e.g. a substring of string or of a delete)
# from the queue item
# as additional items to check -- add to end of queue
assert len(string) >= len(q_item)
# do not add words with greater edit distance if verbose setting
# is not on
if (self.verbose < 2) and ((len(string) - len(q_item)) > min_suggest_len):
pass
elif (len(string) - len(q_item)) < self.max_edit_distance and len(q_item) > 1:
for c in range(len(q_item)): # character index
word_minus_c = q_item[:c] + q_item[c + 1:]
if word_minus_c not in q_dictionary:
queue.append(word_minus_c)
q_dictionary[word_minus_c] = None # arbitrary value, just to identify we checked this
# queue is now empty: convert suggestions in dictionary to
# list for output
if not silent and self.verbose != 0:
print("number of possible corrections: %i" % len(suggest_dict))
print(" edit distance for deletions: %i" % self.max_edit_distance)
# output option 1
# sort results by ascending order of edit distance and descending
# order of frequency
# and return list of suggested word corrections only:
# return sorted(suggest_dict, key = lambda x:
# (suggest_dict[x][1], -suggest_dict[x][0]))
# output option 2
# return list of suggestions with (correction,
# (frequency in corpus, edit distance)):
as_list = suggest_dict.items()
# outlist = sorted(as_list, key=lambda (term, (freq, dist)): (dist, -freq))
outlist = sorted(as_list, key=lambda x: (x[1][1], -x[1][0]))
if self.verbose == 0:
return outlist[0]
else:
return outlist
'''
Option 1:
['file', 'five', 'fire', 'fine', ...]
Option 2:
[('file', (5, 0)),
('five', (67, 1)),
('fire', (54, 1)),
('fine', (17, 1))...]
'''
def best_word(self, s, silent=False):
"""
Returns the best correction for a given word.
Parameters:
s: Word to check
silent: If True, don't print progress
Returns:
tuple or None: (corrected word, (frequency, edit distance)) or None if failed
"""
try:
return self.get_suggestions(s, silent)[0]
except:
return None
class ItemSelector(BaseEstimator, TransformerMixin):
"""
A transformer for selecting specific fields from a pandas DataFrame and converting them to appropriate format.
This is a custom transformer for use in scikit-learn Pipelines.
"""
def __init__(self, field, start_time=time()):
self.field = field # Column name to select from DataFrame
self.start_time = start_time # Start time for processing time measurement
def fit(self, x, y=None):
return self
def transform(self, dataframe):
"""
Selects and transforms specific fields from the DataFrame.
- Categorical data is converted to codes
- Numeric data is kept as is
- Other data is treated as text
"""
print(f'[{time()-self.start_time}] select {self.field}')
dt = dataframe[self.field].dtype
if is_categorical_dtype(dt):
return dataframe[self.field].cat.codes[:, None]
elif is_numeric_dtype(dt):
return dataframe[self.field][:, None]
else:
return dataframe[self.field]
class DropColumnsByDf(BaseEstimator, TransformerMixin):
"""
A transformer that filters features (columns) based on document frequency
"""
def __init__(self, min_df=1, max_df=1.0):
"""
Parameters:
min_df: Minimum document frequency (features below this are removed)
max_df: Maximum document frequency ratio (features above this are removed)
"""
self.min_df = min_df
self.max_df = max_df
def fit(self, X, y=None):
"""
Calculates document frequency for given data and determines which columns to filter.
"""
# Convert to CSC (Compressed Sparse Column) format
m = X.tocsc()
# Process minimum document frequency (min_df) condition
# (m != 0).sum(axis=0): Calculate number of non-zero values in each column
# >= self.min_df: Check if it's greater than minimum document frequency
# .A1: Flatten array to 1 dimension
self.nnz_cols = ((m != 0).sum(axis=0) >= self.min_df).A1
# Process maximum document frequency (max_df) condition
if self.max_df < 1.0:
# Calculate maximum allowed number of documents
max_df = m.shape[0] * self.max_df
# AND operation with maximum document frequency condition
self.nnz_cols = self.nnz_cols & ((m != 0).sum(axis=0) <= max_df).A1
return self
def transform(self, X, y=None):
"""
Selects features according to the determined filtering criteria.
"""
m = X.tocsc()
# Select columns according to conditions determined in fit (self.nnz_cols)
return m[:, self.nnz_cols]
def get_rmsle(y_true, y_pred):
return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))
def split_cat(text):
try:
cats = text.split("/")
return cats[0], cats[1], cats[2], cats[0] + '/' + cats[1]
except:
print("no category")
return 'other', 'other', 'other', 'other/other'
# Function to fill in missing brand names
# Uses the SymSpell algorithm to find and fill brand names from product names and descriptions
# Processes single-word and multi-word brand names separately
def brands_filling(dataset):
vc = dataset['brand_name'].value_counts()
brands = vc[vc > 0].index
brand_word = r"[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+"
many_w_brands = brands[brands.str.contains(' ')]
one_w_brands = brands[~brands.str.contains(' ')]
ss2 = SymSpell(max_edit_distance=0)
ss2.create_dictionary_from_arr(many_w_brands, token_pattern=r'.+')
ss1 = SymSpell(max_edit_distance=0)
ss1.create_dictionary_from_arr(one_w_brands, token_pattern=r'.+')
two_words_re = re.compile(r"(?=(\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+))")
def find_in_str_ss2(row):
for doc_word in two_words_re.finditer(row):
print(doc_word)
suggestion = ss2.best_word(doc_word.group(1), silent=True)
if suggestion is not None:
return doc_word.group(1)
return ''
def find_in_list_ss1(list):
for doc_word in list:
suggestion = ss1.best_word(doc_word, silent=True)
if suggestion is not None:
return doc_word
return ''
def find_in_list_ss2(list):
for doc_word in list:
suggestion = ss2.best_word(doc_word, silent=True)
if suggestion is not None:
return doc_word
return ''
print(f"Before empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")
n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(
pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_name]
n_desc = dataset[dataset['brand_name'] == '']['item_description'].str.findall(
pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_desc]
n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(pat=brand_word)
dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in n_name]
desc_lower = dataset[dataset['brand_name'] == '']['item_description'].str.findall(pat=brand_word)
dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in desc_lower]
print(f"After empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")
del ss1, ss2
gc.collect()
def preprocess_regex(dataset, start_time=time()):
karats_regex = r'(\d)([\s-]?)(karat|karats|carat|carats|kt)([^\w])'
karats_repl = r'\1k\4'
unit_regex = r'(\d+)[\s-]([a-z]{2})(\s)'
unit_repl = r'\1\2\3'
dataset['name'] = dataset['name'].str.replace(karats_regex, karats_repl)
dataset['item_description'] = dataset['item_description'].str.replace(karats_regex, karats_repl)
print(f'[{time() - start_time}] Karats normalized.')
dataset['name'] = dataset['name'].str.replace(unit_regex, unit_repl)
dataset['item_description'] = dataset['item_description'].str.replace(unit_regex, unit_repl)
print(f'[{time() - start_time}] Units glued.')
def preprocess_pandas(train, test, start_time=time()):
train = train[train.price > 0.0].reset_index(drop=True)
print('Train shape without zero price: ', train.shape)
nrow_train = train.shape[0]
y_train = np.log1p(train["price"])
merge: pd.DataFrame = pd.concat([train, test])
del train
del test
gc.collect()
merge['has_category'] = (merge['category_name'].notnull()).astype('category')
print(f'[{time() - start_time}] Has_category filled.')
merge['category_name'] = merge['category_name'] \
.fillna('other/other/other') \
.str.lower() \
.astype(str)
merge['general_cat'], merge['subcat_1'], merge['subcat_2'], merge['gen_subcat1'] = \
zip(*merge['category_name'].apply(lambda x: split_cat(x)))
print(f'[{time() - start_time}] Split categories completed.')
merge['has_brand'] = (merge['brand_name'].notnull()).astype('category')
print(f'[{time() - start_time}] Has_brand filled.')
merge['gencat_cond'] = merge['general_cat'].map(str) + '_' + merge['item_condition_id'].astype(str)
merge['subcat_1_cond'] = merge['subcat_1'].map(str) + '_' + merge['item_condition_id'].astype(str)
merge['subcat_2_cond'] = merge['subcat_2'].map(str) + '_' + merge['item_condition_id'].astype(str)
print(f'[{time() - start_time}] Categories and item_condition_id concancenated.')
merge['name'] = merge['name'] \
.fillna('') \
.str.lower() \
.astype(str)
merge['brand_name'] = merge['brand_name'] \
.fillna('') \
.str.lower() \
.astype(str)
merge['item_description'] = merge['item_description'] \
.fillna('') \
.str.lower() \
.replace(to_replace='No description yet', value='')
print(f'[{time() - start_time}] Missing filled.')
preprocess_regex(merge, start_time)
brands_filling(merge)
print(f'[{time() - start_time}] Brand name filled.')
merge['name'] = merge['name'] + ' ' + merge['brand_name']
print(f'[{time() - start_time}] Name concancenated.')
merge['item_description'] = merge['item_description'] \
+ ' ' + merge['name'] \
+ ' ' + merge['subcat_1'] \
+ ' ' + merge['subcat_2'] \
+ ' ' + merge['general_cat'] \
+ ' ' + merge['brand_name']
print(f'[{time() - start_time}] Item description concatenated.')
merge.drop(['price', 'test_id', 'train_id'], axis=1, inplace=True)
return merge, y_train, nrow_train
def intersect_drop_columns(train: csr_matrix, valid: csr_matrix, min_df=0):
t = train.tocsc()
v = valid.tocsc()
nnz_train = ((t != 0).sum(axis=0) >= min_df).A1
nnz_valid = ((v != 0).sum(axis=0) >= min_df).A1
nnz_cols = nnz_train & nnz_valid
res = t[:, nnz_cols], v[:, nnz_cols]
return res
if __name__ == '__main__':
mp.set_start_method('forkserver', True)
start_time = time()
train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'),
engine='c',
dtype={'item_condition_id': 'category',
'shipping': 'category'}
)
test = pd.read_table(os.path.join(INPUT_PATH, 'test.tsv'),
engine='c',
dtype={'item_condition_id': 'category',
'shipping': 'category'}
)
print(f'[{time() - start_time}] Finished to load data')
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
submission: pd.DataFrame = test[['test_id']]
merge, y_train, nrow_train = preprocess_pandas(train, test, start_time)
meta_params = {'name_ngram': (1, 2),
'name_max_f': 75000,
'name_min_df': 10,
'category_ngram': (2, 3),
'category_token': '.+',
'category_min_df': 10,
'brand_min_df': 10,
'desc_ngram': (1, 3),
'desc_max_f': 150000,
'desc_max_df': 0.5,
'desc_min_df': 10}
stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ])
# 'i', 'so', 'its', 'am', 'are'])
vectorizer = FeatureUnion([
('name', Pipeline([
('select', ItemSelector('name', start_time=start_time)),
('transform', HashingVectorizer(
ngram_range=(1, 2),
n_features=2 ** 27,
norm='l2',
lowercase=False,
stop_words=stopwords
)),
('drop_cols', DropColumnsByDf(min_df=2))
])),
('category_name', Pipeline([
('select', ItemSelector('category_name', start_time=start_time)),
('transform', HashingVectorizer(
ngram_range=(1, 1),
token_pattern='.+',
tokenizer=split_cat,
n_features=2 ** 27,
norm='l2',
lowercase=False
)),
('drop_cols', DropColumnsByDf(min_df=2))
])),
('brand_name', Pipeline([
('select', ItemSelector('brand_name', start_time=start_time)),
('transform', CountVectorizer(
token_pattern='.+',
min_df=2,
lowercase=False
)),
])),
('gencat_cond', Pipeline([
('select', ItemSelector('gencat_cond', start_time=start_time)),
('transform', CountVectorizer(
token_pattern='.+',
min_df=2,
lowercase=False
)),
])),
('subcat_1_cond', Pipeline([
('select', ItemSelector('subcat_1_cond', start_time=start_time)),
('transform', CountVectorizer(
token_pattern='.+',
min_df=2,
lowercase=False
)),
])),
('subcat_2_cond', Pipeline([
('select', ItemSelector('subcat_2_cond', start_time=start_time)),
('transform', CountVectorizer(
token_pattern='.+',
min_df=2,
lowercase=False
)),
])),
('has_brand', Pipeline([
('select', ItemSelector('has_brand', start_time=start_time)),
('ohe', OneHotEncoder())
])),
('shipping', Pipeline([
('select', ItemSelector('shipping', start_time=start_time)),
('ohe', OneHotEncoder())
])),
('item_condition_id', Pipeline([
('select', ItemSelector('item_condition_id', start_time=start_time)),
('ohe', OneHotEncoder())
])),
('item_description', Pipeline([
('select', ItemSelector('item_description', start_time=start_time)),
('hash', HashingVectorizer(
ngram_range=(1, 3),
n_features=2 ** 27,
dtype=np.float32,
norm='l2',
lowercase=False,
stop_words=stopwords
)),
('drop_cols', DropColumnsByDf(min_df=2)),
]))
], n_jobs=1)
sparse_merge = vectorizer.fit_transform(merge)
print(f'[{time() - start_time}] Merge vectorized')
print(sparse_merge.shape)
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(sparse_merge)
print(f'[{time() - start_time}] TF/IDF completed')
X_train = X[:nrow_train]
print(X_train.shape)
X_test = X[nrow_train:]
del merge
del sparse_merge
del vectorizer
del tfidf_transformer
gc.collect()
X_train, X_test = intersect_drop_columns(X_train, X_test, min_df=1)
print(f'[{time() - start_time}] Drop only in train or test cols: {X_train.shape[1]}')
gc.collect()
ridge = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=200, normalize=False, tol=0.01)
ridge.fit(X_train, y_train)
print(f'[{time() - start_time}] Train Ridge completed. Iterations: {ridge.n_iter_}')
predsR = ridge.predict(X_test)
print(f'[{time() - start_time}] Predict Ridge completed.')
submission.loc[:, 'price'] = np.expm1(predsR)
submission.loc[submission['price'] < 0.0, 'price'] = 0.0
submission.to_csv("submission_ridge.csv", index=False)
3. Damerau-Levenshtein distance
- Basic Concept:
- Calculates the minimum number of edits needed to transform one string into another
- It's an extension of the Levenshtein distance that adds the operation of "transposing adjacent characters"
- Allowed Edit Operations:
- Character insertion
- Example: "cat" → "cart" (insert r)
- Character deletion
- Example: "cart" → "cat" (delete r)
- Character substitution
- Example: "cat" → "cut" (substitute a with u)
- Adjacent character transposition
- Example: "cloud" → "could" (swap u and l)
- Use Cases:
- Spell checking
- Similar word search
- Measuring similarity between brand names or product names
- Text matching in natural language processing
- Example:
# Converting "kitten" to "sitting":
# 1. k → s (substitution)
# 2. e → i (substitution)
# 3. n → ng (insertion)
# Total distance: 3
- In this code, it's used to find missing brand names by calculating similarity with existing brand names.
- For example, it can help correct a misspelled brand name like "Nkie" to "Nike".
4. Getting suggestion code detail
- get_suggestions(self, string, silent=False) function:
- Purpose: Generates a list of correction suggestions for potentially misspelled words
- Main features:
- Finds similar words to the input word in the dictionary
- Calculates edit distance (character deletion/addition/change)
- Finds and sorts all possible correction words
- Provides (frequency, edit distance) information for each suggested word
- Return values:
- When verbose=0: Returns only the best suggestion
- When verbose>0: Returns all possible suggestions in the form (word, (frequency, edit distance))
- best_word(self, s, silent=False) function:
- Purpose: Finds the optimal correction word for a given word
- Main features:
- Calls get_suggestions function to get suggestion list
- Selects the most appropriate word (smallest edit distance and highest frequency) from the suggestion list
- Return values:
- On success: (corrected word, (frequency, edit distance))
- On failure: None
5. More about Ridge Regression
- Basic Concept:
- It's an advanced form of linear regression
- Created to prevent overfitting
- A regression model that uses L2 regularization
- How it works:
- Basic linear regression equation: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
- Ridge regression adds a penalty term
- Penalty term: α(w₁² + w₂² + ... + wₙ²)
- α is a hyperparameter that controls regularization strength
- Applies penalty to the sum of squared weights
- Advantages:
- Can solve multicollinearity problems
- Example: Handles strongly correlated features like 'height' and 'weight' well
- Prevents overfitting and improves model generalization
- Maintains all features while adjusting their influence
- Difference from Regular Linear Regression:
# Cost function for regular linear regression
Cost = Σ(y - ŷ)²
# Cost function for Ridge regression
Cost = Σ(y - ŷ)² + α(w₁² + w₂² + ... + wₙ²)
- When to use:
- Data with many features
- When features have strong correlations
- When overfitting is a concern
- For example, in the given code, Ridge regression is used because text vectorization creates many features that might be correlated:
ridge = Ridge(
solver='auto', # Automatically choose best solver
fit_intercept=True, # Use intercept term
alpha=0.4, # Regularization strength (α value)
max_iter=200, # Maximum iterations
normalize=False, # Whether to normalize
tol=0.01 # Convergence tolerance
)
- This configured Ridge regression model helps predict prices while appropriately adjusting the influence of many features.
Fourth Kernel: LGB and FM [18th Place - 0.40604]
- 18th place solution using lightgbm and FM_FTRL.
- 0.33 * FM_FTRL + 0.67 * LGB
Insight / Summary:
1. Overall Summary
- WordBatch is a Python package designed for fast processing of large-scale text data.
- FM_FTRL is a model that combines Factorization Machines with the Follow-The-Regularized-Leader algorithm.
- Looking at each component:
- FM (Factorization Machines)
- A method for modeling interactions between features in high-dimensional sparse data
- Particularly effective in tasks like recommendation systems and click-through rate (CTR) prediction
- Represents potential interactions between features as low-dimensional vectors
- FTRL (Follow-The-Regularized-Leader)
- A type of online learning algorithm
- Effectively handles L1 and L2 regularization
- Particularly effective in learning sparse models
- A sparse model refers to a model where many parameters (weights) are zero
- Memory efficient and capable of incremental learning
- FM (Factorization Machines)
- Main advantages of this combination:
- Suitable for large-scale sparse data processing
- Data is very sparse due to numerous text and categorical variables
- FM effectively learns feature interactions in sparse data
- Automatically learns feature interactions
- FM automatically learns second-order feature interactions
- Example: Captures the impact of brand and category combinations on price
- Enables online learning for real-time updates
- FTRL is an online learning algorithm that is memory efficient
- Can effectively train on large-scale datasets
- High prediction performance
- Suitable for large-scale sparse data processing
- Commonly used in tasks such as advertising click-through rate prediction, recommendation systems, and tasks requiring real-time prediction.
- Looking at each component:
- Feature Engineering
- Extraction of various statistical features from text data MANUALLY
- Text vectorization using WordBatch and TF-IDF
- Label encoding for categorical variables
- Model Ensemble
- FM_FTRL: Factorization Machine effective for sparse data
- Ridge Regression: Basic regression for text features
- LightGBM: High-performance gradient boosting model
- Ridge model?
- The Ridge model is used to perform basic regression analysis on text data (product names and descriptions)
- Basic linear regression model with L2 regularization applied.
- Memory Optimization
- Use of sparse matrices
- Periodic garbage collection
- Removal of unnecessary variables
- Final Prediction
- Weighted average of FM_FTRL and LightGBM predictions
- Save results after reversing log transformation
2. Code Details
# Record start time for execution time measurement
import time
start_time = time.time()
# Set submission mode (True: train on full data, False: use validation split)
SUBMIT_MODE = True
# Import required libraries
import pandas as pd
import numpy as np
import time
import gc # For garbage collection
import string
import re
# Use NLTK stopwords
from nltk.corpus import stopwords
# Import scipy for sparse matrix handling
from scipy.sparse import csr_matrix, hstack
# Import sklearn for text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Import sklearn for feature selection
from sklearn.feature_selection.univariate_selection import SelectKBest, f_regression
# Import sklearn for label encoding
from sklearn.preprocessing import LabelBinarizer
# Import WordBatch related modules (for text processing)
import wordbatch
from wordbatch.extractors import WordBag
from wordbatch.models import FM_FTRL
# Import sklearn model related modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.naive_bayes import MultinomialNB
import lightgbm as lgb
# Define RMSE calculation function
def rmse(predicted, actual):
"""
Calculate Root Mean Squared Error between predicted and actual values
Args:
predicted: array of predicted values
actual: array of actual values
Returns:
RMSE value
"""
return np.sqrt(((predicted - actual) ** 2).mean())
# Category splitting function
def split_cat(text):
"""
Split category string into subcategories
Args:
text: Category string with '/' delimiter
Returns:
Tuple of 3 subcategories (returns 'No Label' if missing)
"""
try:
return text.split("/")
except:
return ("No Label", "No Label", "No Label")
# Define Target Encoding class
class TargetEncoder:
"""
Class for performing target encoding on categorical variables
Numerically encodes categories based on mean target values
"""
def __repr__(self):
return 'TargetEncoder'
def __init__(self, cols, smoothing=1, min_samples_leaf=1, noise_level=0, keep_original=False):
"""
Args:
cols: List of columns to encode
smoothing: Smoothing parameter
min_samples_leaf: Minimum number of samples
noise_level: Level of noise to add
keep_original: Whether to keep original columns
"""
self.cols = cols
self.smoothing = smoothing
self.min_samples_leaf = min_samples_leaf
self.noise_level = noise_level
self.keep_original = keep_original
@staticmethod
def add_noise(series, noise_level):
"""
Add noise to prevent overfitting
"""
return series * (1 + noise_level * np.random.randn(len(series)))
def encode(self, train, test, target):
"""
Perform target encoding on categorical columns in given dataframe
"""
for col in self.cols:
if self.keep_original:
train[col + '_te'], test[col + '_te'] = self.encode_column(train[col], test[col], target)
else:
train[col], test[col] = self.encode_column(train[col], test[col], target)
return train, test
def encode_column(self, trn_series, tst_series, target):
"""
Perform target encoding on a single column
"""
temp = pd.concat([trn_series, target], axis=1)
# Calculate target means
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Calculate smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
# Calculate overall mean
prior = target.mean()
# Calculate smoothed means
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(['mean', 'count'], axis=1, inplace=True)
# Apply encoding to train/test data
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=trn_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
ft_trn_series.index = trn_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
ft_tst_series.index = tst_series.index
return self.add_noise(ft_trn_series, self.noise_level), self.add_noise(ft_tst_series, self.noise_level)
# Numeric value processing functions
def to_number(x):
"""
Convert string to number (limit to 100 if greater than 100)
"""
try:
if not x.isdigit():
return 0
x = int(x)
if x > 100:
return 100
else:
return x
except:
return 0
def sum_numbers(desc):
"""
Calculate sum of numbers in description text
"""
if not isinstance(desc, str):
return 0
try:
return sum([to_number(s) for s in desc.split()])
except:
return 0
# Set regex and stopwords for text preprocessing
stopwords = {x: 1 for x in stopwords.words('english')}
non_alphanums = re.compile(u'[^A-Za-z0-9]+')
non_alphanumpunct = re.compile(u'[^A-Za-z0-9\.?!,; \(\)\[\]\'\"\$]+')
RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])
def normalize_text(text):
"""
Text normalization:
- Convert to lowercase
- Remove special characters
- Remove stopwords
- Remove short words
"""
return u" ".join(
[x for x in [y for y in non_alphanums.sub(' ', text).lower().strip().split(" ")] \
if len(x) > 1 and x not in stopwords])
def clean_name(x):
"""
Extract first word from name
"""
if len(x):
x = non_alphanums.sub(' ', x).split()
if len(x):
return x[0].lower()
return ''
# Load data
print('[{}] Finished defining stuff'.format(time.time() - start_time))
# Load training data
train = pd.read_table('../input/train.tsv', engine='c',
dtype={'item_condition_id': 'category',
'shipping': 'category',
},
converters={'category_name': split_cat})
# Load test data
test = pd.read_table('../input/test.tsv', engine='c',
dtype={'item_condition_id': 'category',
'shipping': 'category',
},
converters={'category_name': split_cat})
print('[{}] Finished load data'.format(time.time() - start_time))
# Add flag for train/test data distinction
train['is_train'] = 1
test['is_train'] = 0
print('[{}] Compiled train / test'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
# Remove data with price 0
train = train[train.price != 0].reset_index(drop=True)
print('[{}] Removed zero price'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
# Log transform target variable (price)
y = np.log1p(train['price'])
nrow_train = train.shape[0]
# Merge train/test data
merge = pd.concat([train, test])
submission = test[['test_id']]
print('[{}] Compiled merge'.format(time.time() - start_time))
print('Merge shape: ', merge.shape)
# Remove unnecessary columns and clear memory
del train
del test
merge.drop(['train_id', 'test_id', 'price'], axis=1, inplace=True)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))
# Split and process categories
merge['gencat_name'] = merge['category_name'].str.get(0).replace('', 'missing').astype('category')
merge['subcat1_name'] = merge['category_name'].str.get(1).fillna('missing').astype('category')
merge['subcat2_name'] = merge['category_name'].str.get(2).fillna('missing').astype('category')
merge.drop('category_name', axis=1, inplace=True)
print('[{}] Split categories completed.'.format(time.time() - start_time))
# Handle missing values
merge['item_condition_id'] = merge['item_condition_id'].cat.add_categories(['missing']).fillna('missing')
merge['shipping'] = merge['shipping'].cat.add_categories(['missing']).fillna('missing')
merge['item_description'].fillna('missing', inplace=True)
merge['brand_name'] = merge['brand_name'].fillna('missing').astype('category')
print('[{}] Handle missing completed.'.format(time.time() - start_time))
# Start feature engineering
# Name-related features
merge['name_first'] = merge['name'].apply(clean_name)
print('[{}] FE 1/37'.format(time.time() - start_time))
merge['name_first_count'] = merge.groupby('name_first')['name_first'].transform('count')
print('[{}] FE 2/37'.format(time.time() - start_time))
# Category-related features
merge['gencat_name_count'] = merge.groupby('gencat_name')['gencat_name'].transform('count')
print('[{}] FE 3/37'.format(time.time() - start_time))
merge['subcat1_name_count'] = merge.groupby('subcat1_name')['subcat1_name'].transform('count')
print('[{}] FE 4/37'.format(time.time() - start_time))
merge['subcat2_name_count'] = merge.groupby('subcat2_name')['subcat2_name'].transform('count')
print('[{}] FE 5/37'.format(time.time() - start_time))
merge['brand_name_count'] = merge.groupby('brand_name')['brand_name'].transform('count')
print('[{}] FE 6/37'.format(time.time() - start_time))
# Text-related features
merge['NameLower'] = merge.name.str.count('[a-z]')
print('[{}] FE 7/37'.format(time.time() - start_time))
merge['DescriptionLower'] = merge.item_description.str.count('[a-z]')
print('[{}] FE 8/37'.format(time.time() - start_time))
merge['NameUpper'] = merge.name.str.count('[A-Z]')
print('[{}] FE 9/37'.format(time.time() - start_time))
merge['DescriptionUpper'] = merge.item_description.str.count('[A-Z]')
print('[{}] FE 10/37'.format(time.time() - start_time))
# Length-related features
merge['name_len'] = merge['name'].apply(lambda x: len(x))
print('[{}] FE 11/37'.format(time.time() - start_time))
merge['des_len'] = merge['item_description'].apply(lambda x: len(x))
print('[{}] FE 12/37'.format(time.time() - start_time))
merge['name_desc_len_ratio'] = merge['name_len']/merge['des_len']
print('[{}] FE 13/37'.format(time.time() - start_time))
# Word count related features
merge['desc_word_count'] = merge['item_description'].apply(lambda x: len(x.split()))
print('[{}] FE 14/37'.format(time.time() - start_time))
merge['mean_des'] = merge['item_description'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 15/37'.format(time.time() - start_time))
merge['name_word_count'] = merge['name'].apply(lambda x: len(x.split()))
print('[{}] FE 16/37'.format(time.time() - start_time))
merge['mean_name'] = merge['name'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 17/37'.format(time.time() - start_time))
# Characters per word features
merge['desc_letters_per_word'] = merge['des_len'] / merge['desc_word_count']
print('[{}] FE 18/37'.format(time.time() - start_time))
merge['name_letters_per_word'] = merge['name_len'] / merge['name_word_count']
print('[{}] FE 19/37'.format(time.time() - start_time))
# Upper/lowercase ratio features
merge['NameLowerRatio'] = merge['NameLower'] / merge['name_len']
print('[{}] FE 20/37'.format(time.time() - start_time))
merge['DescriptionLowerRatio'] = merge['DescriptionLower'] / merge['des_len']
print('[{}] FE 21/37'.format(time.time() - start_time))
merge['NameUpperRatio'] = merge['NameUpper'] / merge['name_len']
print('[{}] FE 22/37'.format(time.time() - start_time))
merge['DescriptionUpperRatio'] = merge['DescriptionUpper'] / merge['des_len']
print('[{}] FE 23/37'.format(time.time() - start_time))
# Punctuation related features
merge['NamePunctCount'] = merge.name.str.count(RE_PUNCTUATION)
print('[{}] FE 24/37'.format(time.time() - start_time))
merge['DescriptionPunctCount'] = merge.item_description.str.count(RE_PUNCTUATION)
print('[{}] FE 25/37'.format(time.time() - start_time))
merge['NamePunctCountRatio'] = merge['NamePunctCount'] / merge['name_word_count']
print('[{}] FE 26/37'.format(time.time() - start_time))
merge['DescriptionPunctCountRatio'] = merge['DescriptionPunctCount'] / merge['desc_word_count']
print('[{}] FE 27/37'.format(time.time() - start_time))
# Number related features
merge['NameDigitCount'] = merge.name.str.count('[0-9]')
print('[{}] FE 28/37'.format(time.time() - start_time))
merge['DescriptionDigitCount'] = merge.item_description.str.count('[0-9]')
print('[{}] FE 29/37'.format(time.time() - start_time))
merge['NameDigitCountRatio'] = merge['NameDigitCount'] / merge['name_word_count']
print('[{}] FE 30/37'.format(time.time() - start_time))
merge['DescriptionDigitCountRatio'] = merge['DescriptionDigitCount']/merge['desc_word_count']
print('[{}] FE 31/37'.format(time.time() - start_time))
# Stopword and special character related features
merge['stopword_ratio_desc'] = merge['item_description'].apply(lambda x: len([w for w in x.split() if w in stopwords])) / merge['desc_word_count']
print('[{}] FE 32/37'.format(time.time() - start_time))
merge['num_sum'] = merge['item_description'].apply(sum_numbers) # Sum of numbers in description
print('[{}] FE 33/37'.format(time.time() - start_time))
merge['weird_characters_desc'] = merge['item_description'].str.count(non_alphanumpunct) # Count of special characters
print('[{}] FE 34/37'.format(time.time() - start_time))
merge['weird_characters_name'] = merge['name'].str.count(non_alphanumpunct)
print('[{}] FE 35/37'.format(time.time() - start_time))
# Price related keyword features
merge['prices_count'] = merge['item_description'].str.count('[rm]') # Count of price indicator characters (rm)
print('[{}] FE 36/37'.format(time.time() - start_time))
merge['price_in_name'] = merge['item_description'].str.contains('[rm]', regex=False).astype('int') # Price indicator presence
print('[{}] FE 37/37'.format(time.time() - start_time))
# Feature normalization
cols = set(merge.columns.values)
basic_cols = {'name', 'item_condition_id', 'brand_name',
'shipping', 'item_description', 'gencat_name',
'subcat1_name', 'subcat2_name', 'name_first', 'is_train'}
# Separate columns to normalize and keep
cols_to_normalize = cols - basic_cols - {'price_in_name'}
other_cols = basic_cols | {'price_in_name'}
# Perform Min-Max normalization
merge_to_normalize = merge[list(cols_to_normalize)]
merge_to_normalize = (merge_to_normalize - merge_to_normalize.mean()) / (merge_to_normalize.max() - merge_to_normalize.min())
print('[{}] FE Normalized'.format(time.time() - start_time))
# Merge normalized features and basic features
merge = merge[list(other_cols)]
merge = pd.concat([merge, merge_to_normalize], axis=1)
print('[{}] FE Merged'.format(time.time() - start_time))
# Memory cleanup
del(merge_to_normalize)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))
# Split train/test data
df_test = merge.loc[merge['is_train'] == 0]
df_train = merge.loc[merge['is_train'] == 1]
del merge
gc.collect()
df_test = df_test.drop(['is_train'], axis=1)
df_train = df_train.drop(['is_train'], axis=1)
# Split validation data (if not in submit mode)
if SUBMIT_MODE:
y_train = y
del y
gc.collect()
else:
df_train, df_test, y_train, y_test = train_test_split(df_train, y, test_size=0.2, random_state=144)
print('[{}] Splitting completed.'.format(time.time() - start_time))
# Process name text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
"hash_ngrams": 2, # Use up to 2-grams
"hash_ngrams_weights": [1.5, 1.0], # Weights for unigram and bigram
"hash_size": 2 ** 29, # Hash size
"norm": None, # No normalization
"tf": 'binary', # Use binary TF
"idf": None, # Don't use IDF
}), procs=8)
wb.dictionary_freeze = True
X_name_train = wb.fit_transform(df_train['name'])
X_name_test = wb.transform(df_test['name'])
del(wb)
# Remove low frequency features
mask = np.where(X_name_train.getnnz(axis=0) > 3)[0]
X_name_train = X_name_train[:, mask]
X_name_test = X_name_test[:, mask]
print('[{}] Vectorize `name` completed.'.format(time.time() - start_time))
# Process item description text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
"hash_ngrams": 2,
"hash_ngrams_weights": [1.0, 1.0],
"hash_size": 2 ** 28,
"norm": "l2", # Use L2 normalization
"tf": 1.0, # Use actual frequency
"idf": None
}), procs=8)
wb.dictionary_freeze = True
X_description_train = wb.fit_transform(df_train['item_description'])
X_description_test = wb.transform(df_test['item_description'])
del(wb)
# Remove low frequency features
mask = np.where(X_description_train.getnnz(axis=0) > 3)[0]
X_description_train = X_description_train[:, mask]
X_description_test = X_description_test[:, mask]
print('[{}] Vectorize `item_description` completed.'.format(time.time() - start_time))
# Ridge regression modeling for description text
# Split data in half for cross validation
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_description_train, y_train,
test_size=0.5,
shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))
# First Ridge model
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train desc ridge (1)'.format(time.time() - start_time))
desc_ridge_preds1 = model.predict(X_train_2)
desc_ridge_preds1f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (1)'.format(time.time() - start_time))
# Second Ridge model
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train desc ridge (2)'.format(time.time() - start_time))
desc_ridge_preds2 = model.predict(X_train_1)
desc_ridge_preds2f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (2)'.format(time.time() - start_time))
# Combine Ridge predictions
desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(desc_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
print('RMSLE TEST: {}'.format(rmse(desc_ridge_preds_test, y_test)))
# Ridge regression modeling for name text (same process as above)
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_name_train, y_train,
test_size=0.5,
shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train name ridge (1)'.format(time.time() - start_time))
name_ridge_preds1 = model.predict(X_train_2)
name_ridge_preds1f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (1)'.format(time.time() - start_time))
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train name ridge (2)'.format(time.time() - start_time))
name_ridge_preds2 = model.predict(X_train_1)
name_ridge_preds2f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (2)'.format(time.time() - start_time))
name_ridge_preds_oof = np.concatenate((name_ridge_preds2, name_ridge_preds1), axis=0)
name_ridge_preds_test = (name_ridge_preds1f + name_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(name_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
print('RMSLE TEST: {}'.format(rmse(name_ridge_preds_test, y_test)))
# Memory cleanup
del X_train_1
del X_train_2
del y_train_1
del y_train_2
del name_ridge_preds1
del name_ridge_preds1f
del name_ridge_preds2
del name_ridge_preds2f
del desc_ridge_preds1
del desc_ridge_preds1f
del desc_ridge_preds2
del desc_ridge_preds2f
gc.collect()
print('[{}] Finished garbage collection'.format(time.time() - start_time))
# Process categorical variables
# Label encode brand names
lb = LabelBinarizer(sparse_output=True)
X_brand_train = lb.fit_transform(df_train['brand_name'])
X_brand_test = lb.transform(df_test['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))
# Label encode categories
X_cat_train = lb.fit_transform(df_train['gencat_name'])
X_cat_test = lb.transform(df_test['gencat_name'])
X_cat1_train = lb.fit_transform(df_train['subcat1_name'])
X_cat1_test = lb.transform(df_test['subcat1_name'])
X_cat2_train = lb.fit_transform(df_train['subcat2_name'])
X_cat2_test = lb.transform(df_test['subcat2_name'])
print('[{}] Finished label binarize categories'.format(time.time() - start_time))
# Create dummy variables for numeric features
X_dummies_train = csr_matrix(
pd.get_dummies(df_train[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
sparse=True).values)
print('[{}] Create dummies completed - train'.format(time.time() - start_time))
X_dummies_test = csr_matrix(
pd.get_dummies(df_test[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
sparse=True).values)
print('[{}] Create dummies completed - test'.format(time.time() - start_time))
# Combine all feature matrices
sparse_merge_train = hstack((X_dummies_train, X_description_train, X_brand_train, X_cat_train,
X_cat1_train, X_cat2_train, X_name_train)).tocsr()
del X_description_train, lb, X_name_train, X_dummies_train
gc.collect()
print('[{}] Create sparse merge train completed'.format(time.time() - start_time))
sparse_merge_test = hstack((X_dummies_test, X_description_test, X_brand_test, X_cat_test,
X_cat1_test, X_cat2_test, X_name_test)).tocsr()
del X_description_test, X_name_test, X_dummies_test
gc.collect()
print('[{}] Create sparse merge test completed'.format(time.time() - start_time))
# Set number of iterations for FM_FTRL model training
if SUBMIT_MODE:
iters = 3
else:
iters = 1
rounds = 3
# Define FM_FTRL model
model = FM_FTRL(alpha=0.035, beta=0.001, L1=0.00001, L2=0.15, D=sparse_merge_train.shape[1],
alpha_fm=0.05, L2_fm=0.0, init_fm=0.01,
D_fm=100, e_noise=0, iters=iters, inv_link="identity", threads=4)
# Train and predict with FM_FTRL model
if SUBMIT_MODE:
model.fit(sparse_merge_train, y_train)
print('[{}] Train FM completed'.format(time.time() - start_time))
predsFM = model.predict(sparse_merge_test)
print('[{}] Predict FM completed'.format(time.time() - start_time))
else:
# In validation mode, repeat multiple times to check performance
for i in range(rounds):
model.fit(sparse_merge_train, y_train)
predsFM = model.predict(sparse_merge_test)
print('[{}] Iteration {}/{} -- RMSLE: {}'.format(time.time() - start_time, i + 1, rounds, rmse(predsFM, y_test)))
del model
gc.collect()
if not SUBMIT_MODE:
print("FM_FTRL dev RMSLE:", rmse(predsFM, y_test))
# Feature selection (SelectKBest)
fselect = SelectKBest(f_regression, k=48000)
train_features = fselect.fit_transform(sparse_merge_train, y_train)
test_features = fselect.transform(sparse_merge_test)
print('[{}] Select best completed'.format(time.time() - start_time))
# Memory cleanup
del sparse_merge_train
del sparse_merge_test
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))
# TF-IDF vectorization (name)
tv = TfidfVectorizer(max_features=250000,
ngram_range=(1, 3),
stop_words=None)
X_name_train = tv.fit_transform(df_train['name'])
print('[{}] Finished TFIDF vectorize `name` (1/2)'.format(time.time() - start_time))
X_name_test = tv.transform(df_test['name'])
print('[{}] Finished TFIDF vectorize `name` (2/2)'.format(time.time() - start_time))
# TF-IDF vectorization (item description)
tv = TfidfVectorizer(max_features=500000,
ngram_range=(1, 3),
stop_words=None)
X_description_train = tv.fit_transform(df_train['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (1/2)'.format(time.time() - start_time))
X_description_test = tv.transform(df_test['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (2/2)'.format(time.time() - start_time))
# Prepare dataset for LightGBM model
d_train = lgb.Dataset(train_features, label=y_train)
del train_features; gc.collect()
if SUBMIT_MODE:
watchlist = [d_train]
else:
d_valid = lgb.Dataset(test_features, label=y_test)
watchlist = [d_train, d_valid]
# Set LightGBM parameters
params = {
'learning_rate': 0.15,
'application': 'regression',
'max_depth': 13,
'num_leaves': 400,
'verbosity': -1, # Don't print training progress
'metric': 'RMSE',
'data_random_seed': 1,
'bagging_fraction': 0.8, # Data sampling ratio for bagging
'feature_fraction': 0.6, # Feature ratio to use in each tree
'nthread': 4, # Number of CPU threads to use
'lambda_l1': 10, # L1 regularization
'lambda_l2': 10 # L2 regularization
}
print('[{}] Finished compiling LGB'.format(time.time() - start_time))
# Train LightGBM model
modelL = lgb.train(params,
train_set=d_train,
num_boost_round=1350, # Number of boosting iterations
valid_sets=watchlist,
verbose_eval=50) # Print evaluation results every 50 iterations
# LightGBM prediction
predsL = modelL.predict(test_features)
predsL[predsL < 0] = 0 # Adjust negative predictions to 0
if not SUBMIT_MODE:
print("LGB RMSLE:", rmse(predsL, y_test))
# Memory cleanup
del d_train
del modelL
if not SUBMIT_MODE:
del d_valid
gc.collect()
# Combine FM_FTRL and LightGBM predictions (weighted average)
preds_final = predsFM * 0.33 + predsL * 0.67
if not SUBMIT_MODE:
print('Final RMSE: ', rmse(preds_final, y_test))
# Save final prediction results
if SUBMIT_MODE:
preds_final = np.expm1(preds_final) # Reverse log transformation
submission['price'] = preds_final
submission.to_csv('lgb_and_fm_separate_train_test.csv', index=False)
print('[{}] Writing submission done'.format(time.time() - start_time))
3. Combining Ridge predictions
desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0
- OOF Predictions:
- Each data point is predicted by a model that didn't use it for training
- Can obtain predictions for the entire training data without overfitting
- These predictions can be used as features for the next level models
- Test Predictions Average:
- More stable predictions by averaging predictions from two models
- Offsets errors from individual models
- Applies the basic principle of ensemble learning
- However, this part in this kernel is only used for printing sample prediction result with simple ridge model in the middle of the process. NOT FOR FINAL PREDICTION!!!
Stay focused on your goals and don't let distractions derail you from your path.
- Max Holloway -
'캐글' 카테고리의 다른 글
[Kaggle Study] #15 2017 Kaggle Machine Learning & Data Science Survey (0) | 2024.12.05 |
---|---|
[Kaggle Study] #14 Toxic Comment Classification Challenge (0) | 2024.12.04 |
[Kaggle Study] #12 Spooky Author Identification (0) | 2024.12.04 |
[Kaggle Study] #11 Credit Card Fraud Detection (0) | 2024.12.03 |
[Kaggle Study] #10 Zillow Prize: Zillow’s Home Value Prediction (Zestimate) (0) | 2024.11.29 |