캐글

[Kaggle Study] #13 Mercari Price Suggestion Challenge

dongsunseng 2024. 12. 6. 01:07
반응형

Twelveth competition following Youhan Lee's curriculum. Natural Language Processing competition.

First Kernel: Mercari Interactive EDA + Topic Modelling

  • EDA kernel with matplotlib.

Insight / Summary:

1. Log transformation on target var

  • Our response or target variables is the price we are suggesting to the Mercari's marketplace sellers.
  • The median price of all the items in the training is about $267 but given the existence of some extreme values of over $100 and the maximum at $2,009, the distribution of the variables is heavily skewed to the left.
  • So let's make log-transformation on the price (we added +1 to the value before the transformation to avoid zero and negative values).
plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution - Training Set', fontsize=17)

plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log(Price) Distribution - Training Set', fontsize=17)
plt.show()

2. Dealing with Item Description feature

  • It will be more challenging to parse through this particular item since it's unstructured data. 
  • Does it mean a more detailed and lengthy description will result in a higher bidding price? 
  • We will strip out all punctuations, remove some english stop words (i.e. redundant words such as "a", "the", etc.) and any other words with a length less than 3:
def wordCount(text):
    # convert to lower case and strip regex
    try:
         # convert to lower case and strip regex
        text = text.lower()
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        txt = regex.sub(" ", text)
        # tokenize
        # words = nltk.word_tokenize(clean_txt)
        # remove words in stop words
        words = [w for w in txt.split(" ") \
                 if not w in stop_words.ENGLISH_STOP_WORDS and len(w)>3]
        return len(words)
    except: 
        return 0
# add a column of word counts to both the training and test set
train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))
test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))
  • We also need to check if there are any missing values in the item description (4 observations don't have a description) andl remove those observations from our training set.
train.item_description.isnull().sum()
# result: 4

# remove missing values in item description
train = train[pd.notnull(train['item_description'])]

 

3. Pre-processing: tokenization

  • Most of the time, the first steps of an NLP project is to "tokenize" your documents, which main purpose is to normalize our texts. 
  • The three fundamental stages will usually include:
    • break the descriptions into sentences and then break the sentences into tokens
    • remove punctuation and stop words
    • lowercase the tokens
    • herein, I will also only consider words that have length equal to or greater than 3 characters
stop = set(stopwords.words('english'))
def tokenize(text):
    """
    sent_tokenize(): segment text into sentences
    word_tokenize(): break sentences into words
    """
    try: 
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        text = regex.sub(" ", text) # remove punctuation
        
        tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent
        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
        filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3]
        
        return filtered_tokens
            
    except TypeError as e: print(text,e)
# apply the tokenizer into the item descriptipn column
train['tokens'] = train['item_description'].map(tokenize)
test['tokens'] = test['item_description'].map(tokenize)

train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

 

4. WordCloud package

  • We could aso use the package WordCloud to easily visualize which words has the highest frequencies within each category:
# build dictionary with key=category and values as all the descriptions related.
cat_desc = dict()
for cat in general_cats: 
    text = " ".join(train.loc[train['general_cat']==cat, 'item_description'].values)
    cat_desc[cat] = tokenize(text)


# find the most common words for the top 4 categories
women100 = Counter(cat_desc['Women']).most_common(100)
beauty100 = Counter(cat_desc['Beauty']).most_common(100)
kids100 = Counter(cat_desc['Kids']).most_common(100)
electronics100 = Counter(cat_desc['Electronics']).most_common(100)

def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          max_words=50, max_font_size=40,
                          random_state=42
                         ).generate(str(tup))
    return wordcloud
fig,axes = plt.subplots(2, 2, figsize=(30, 15))

ax = axes[0, 0]
ax.imshow(generate_wordcloud(women100), interpolation="bilinear")
ax.axis('off')
ax.set_title("Women Top 100", fontsize=30)

ax = axes[0, 1]
ax.imshow(generate_wordcloud(beauty100))
ax.axis('off')
ax.set_title("Beauty Top 100", fontsize=30)

ax = axes[1, 0]
ax.imshow(generate_wordcloud(kids100))
ax.axis('off')
ax.set_title("Kids Top 100", fontsize=30)

ax = axes[1, 1]
ax.imshow(generate_wordcloud(electronics100))
ax.axis('off')
ax.set_title("Electronic Top 100", fontsize=30)

 

5. Pre-processing: tf-idf

  • tf-idf is the acronym for Term Frequency-Inverse Document Frequency.
  • It quantifies the importance of a particular word in relative to the vocabulary of a collection of documents or corpus.
  • The metric depends on two factors:
    • Term Frequency: the occurences of a word in a given document (i.e. bag of words)
    • Inverse Document Frequency: the reciprocal number of times a word occurs in a corpus of documents
  • Think about of it this way: If the word is used extensively in all documents, its existence within a specific document will not be able to provide us much specific information about the document itself. 
  • So the second term could be seen as a penalty term that penalizes common words such as "a", "the", "and", etc. tf-idf can therefore, be seen as a weighting scheme for words relevancy in a specific document. 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10,
                             max_features=180000,
                             tokenizer=tokenize,
                             ngram_range=(1, 2))
                             
all_desc = np.append(train['item_description'].values, test['item_description'].values)
vz = vectorizer.fit_transform(list(all_desc))
  • vz is a tfidf matrix where:
    • the number of rows is the total number of descriptions
    • the number of columns is the total number of unique tokens across the descriptions
  • Given the high dimension of our tfidf matrix, we need to reduce their dimension using the Singular Value Decomposition (SVD) technique.
  • And to visualize our vocabulary, we could next use t-SNE to reduce the dimension from 50 to 2.
  • t-SNE is more suitable for dimensionality reduction to 2 or 3.

SVD (Singular Value Decomposition):

  • Linear dimensionality reduction
  • Preserves major patterns in data
  • Relatively fast computation
  • Suitable for reduction to larger dimensions (e.g., 50 dimensions)

t-SNE (t-Distributed Stochastic Neighbor Embedding):

  • Non-linear dimensionality reduction
  • Preserves similarity relationships between data points
  • Optimized for visualization (2-3 dimensions)
  • Effectively reveals cluster structures

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
  • The goal is to take a set of points in a high-dimensional space and find a representation of those points in a lower-dimensional space, typically the 2D plane.
  • It is based on probability distributions with random walk on neighborhood graphs to find the structure within the data.
  • But since t-SNE complexity is significantly high, usually we'd use other high-dimension reduction techniques before applying t-SNE.
  • First, let's take a sample from the both training and testing item's description since t-SNE can take a very long time to execute.
  • We can then reduce the dimension of each vector from to n_components (50) using SVD.
trn = train.copy()
tst = test.copy()
trn['is_train'] = 1
tst['is_train'] = 0

sample_sz = 15000

combined_df = pd.concat([trn, tst])
combined_sample = combined_df.sample(n=sample_sz)
vz_sample = vectorizer.fit_transform(list(combined_sample['item_description']))

from sklearn.decomposition import TruncatedSVD

n_comp=30
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)
# Dimension from 50 to 2
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)

tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

 

7. tf-idf clustering of the item description

plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "tokens": "@tokens", "category":"@category"}
show(plot_tfidf)

8. K-Means Clustering

  • K-means clustering objective is to minimize the average squared Euclidean distance of the document / description from their cluster centroids.
from sklearn.cluster import MiniBatchKMeans

num_clusters = 30 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters,
                               init='k-means++',
                               n_init=1,
                               init_size=1000, batch_size=1000, verbose=0, max_iter=1000)

kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(num_clusters):
    print("Cluster %d:" % i)
    aux = ''
    for j in sorted_centroids[i, :10]:
        aux += terms[j] + ' | '
    print(aux)
    print()

In order to plot these clusters, first we will need to reduce the dimension of the distances to 2 using tsne:

# repeat the same steps for the sample
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
# reduce dimension to 2 using tsne
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)
#combined_sample.reset_index(drop=True, inplace=True)
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['description'] = combined_sample['item_description']
kmeans_df['category'] = combined_sample['general_cat']
#kmeans_df['cluster']=kmeans_df.cluster.astype(str).astype('category')

plot_kmeans = bp.figure(plot_width=700, plot_height=600,
                        title="KMeans clustering of the description",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'],
                                    color=colormap[kmeans_clusters],
                                    description=kmeans_df['description'],
                                    category=kmeans_df['category'],
                                    cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "category": "@category", "cluster":"@cluster" }
show(plot_kmeans)

9. Latent Dirichlet Allocation

  • Latent Dirichlet Allocation (LDA) is an algorithms used to discover the topics that are present in a corpus.
  • LDA starts from a fixed number of topics.
  • Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics.
  • Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.
  • Its input is a bag of words, i.e. each document represented as a row, with each columns containing the count of words in the corpus.
cvectorizer = CountVectorizer(min_df=4,
                              max_features=180000,
                              tokenizer=tokenize,
                              ngram_range=(1,2))
                              
cvz = cvectorizer.fit_transform(combined_sample['item_description'])

lda_model = LatentDirichletAllocation(n_components=20,
                                      learning_method='online',
                                      max_iter=20,
                                      random_state=42)
                                      
X_topics = lda_model.fit_transform(cvz)

n_top_words = 10
topic_summaries = []

topic_word = lda_model.components_  # get the topic words
vocab = cvectorizer.get_feature_names()

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(X_topics)
unnormalized = np.matrix(X_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)

lda_keys = []
for i, tweet in enumerate(combined_sample['item_description']):
    lda_keys += [doc_topic[i].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['description'] = combined_sample['item_description']
lda_df['category'] = combined_sample['general_cat']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)
plot_lda = bp.figure(plot_width=700,
                     plot_height=600,
                     title="LDA topic visualization",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
                                    color=colormap[lda_keys],
                                    description=lda_df['description'],
                                    topic=lda_df['topic'],
                                    category=lda_df['category']))

plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_kmeans.select(dict(type=HoverTool))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"description":"@description",
                "topic":"@topic", "category":"@category"}

  • pyLDAvis is a powerful tool that gives us an interactive visualization for LDA.
  • It's a shame that by putting the HTML of the visualization using pyLDAvis, it will distort the layout of the kernel, I won't upload in here.
  • But if you follow the below code, there should be an HTML file generated with very interesting interactive bubble chart that visualizes the space of your topic clusters and the term components within each topic.
def prepareLDAData():
    data = {
        'vocab': vocab,
        'doc_topic_dists': doc_topic,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':cvectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data
    
import pyLDAvis

lda_df['len_docs'] = combined_sample['tokens'].map(len)
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)

Second Kernel: A simple nn solution with Keras (~0.48611 PL)

  • Kernel using neural network for modeling.

Insight / Summary:

1. Metric

def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5
#Source: https://www.kaggle.com/marknagelberg/rmsle-function

 

2. Missing value

#HANDLE MISSING VALUES
print("Handling missing values...")
def handle_missing(dataset):
    dataset.category_name.fillna(value="missing", inplace=True)
    dataset.brand_name.fillna(value="missing", inplace=True)
    dataset.item_description.fillna(value="missing", inplace=True)
    return (dataset)

train = handle_missing(train)
test = handle_missing(test)

 

3. Categorical data - label encoding

#PROCESS CATEGORICAL DATA
print("Handling categorical variables...")
le = LabelEncoder()

le.fit(np.hstack([train.category_name, test.category_name]))
train.category_name = le.transform(train.category_name)
test.category_name = le.transform(test.category_name)

le.fit(np.hstack([train.brand_name, test.brand_name]))
train.brand_name = le.transform(train.brand_name)
test.brand_name = le.transform(test.brand_name)
del le

train.head(3)

 

4. raw text - tokenization

#PROCESS TEXT: RAW
print("Text to seq process...")
from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])

print("   Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print("   Transforming text to seq...")

train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())

train.head(3)

 

5. Scaling target variable

#SCALE target variable
train["target"] = np.log(train.price+1)
target_scaler = MinMaxScaler(feature_range=(-1, 1))
train["target"] = target_scaler.fit_transform(train.target.reshape(-1,1))
pd.DataFrame(train.target).hist()

 

6. Modeling GRU NN

1) Finding max values for NN

#EMBEDDINGS MAX VALUE
#Base on the histograms, we select the next lengths
MAX_NAME_SEQ = 10
MAX_ITEM_DESC_SEQ = 75
MAX_TEXT = np.max([np.max(train.seq_name.max())
                   , np.max(test.seq_name.max())
                  , np.max(train.seq_item_description.max())
                  , np.max(test.seq_item_description.max())])+2
MAX_CATEGORY = np.max([train.category_name.max(), test.category_name.max()])+1
MAX_BRAND = np.max([train.brand_name.max(), test.brand_name.max()])+1
MAX_CONDITION = np.max([train.item_condition_id.max(), test.item_condition_id.max()])+1

2) Actual modeling

#KERAS MODEL DEFINITION
from keras.layers import Input, Dropout, Dense, BatchNormalization, Activation, concatenate, GRU, Embedding, Flatten, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras import backend as K

def get_callbacks(filepath, patience=2):
    es = EarlyStopping('val_loss', patience=patience, mode="min")
    msave = ModelCheckpoint(filepath, save_best_only=True)
    return [es, msave]

def rmsle_cust(y_true, y_pred):
    first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)
    return K.sqrt(K.mean(K.square(first_log - second_log), axis=-1))

def get_model():
    #params
    dr_r = 0.1
    
    #Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand_name = Input(shape=[1], name="brand_name")
    category_name = Input(shape=[1], name="category_name")
    item_condition = Input(shape=[1], name="item_condition")
    num_vars = Input(shape=[X_train["num_vars"].shape[1]], name="num_vars")
    
    #Embeddings layers
    emb_name = Embedding(MAX_TEXT, 50)(name)
    emb_item_desc = Embedding(MAX_TEXT, 50)(item_desc)
    emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
    emb_category_name = Embedding(MAX_CATEGORY, 10)(category_name)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    
    #rnn layer
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)
    
    #main layer
    main_l = concatenate([
        Flatten() (emb_brand_name)
        , Flatten() (emb_category_name)
        , Flatten() (emb_item_condition)
        , rnn_layer1
        , rnn_layer2
        , num_vars
    ])
    main_l = Dropout(dr_r) (Dense(128) (main_l))
    main_l = Dropout(dr_r) (Dense(64) (main_l))
    
    #output
    output = Dense(1, activation="linear") (main_l)
    
    #model
    model = Model([name, item_desc, brand_name
                   , category_name, item_condition, num_vars], output)
    model.compile(loss="mse", optimizer="adam", metrics=["mae", rmsle_cust])
    
    return model

    
model = get_model()
model.summary()

Third Kernel: Ridge (LB 0.41943)

  • Mainly using Ridge model kernel.

Insight / Summary:

1. Overall Summary

  • This code implements a machine learning pipeline for product price prediction.
  • Here are the main steps:
    • Data Preprocessing:
      • Remove data with zero prices
      • Clean text data including categories, brand names, product names, and descriptions
      • Fill missing brand names using the SymSpell algorithm based on similarity
      • Split categories into major/medium/minor classifications
      • Combine text data to create rich features
    • Feature Engineering:
      • Vectorize text data using HashingVectorizer and CountVectorizer
      • Encode categorical variables using OneHotEncoder
      • Apply TF-IDF transformation to reflect text feature importance
      • Select only features common to both training and test sets
    • Modeling:
      • Use Ridge regression for price prediction
      • Use log-transformed prices as targets
      • Generate final prices through exponential transformation of predictions
  •  The main characteristics of this implementation are:
    • Focus on Text Data Processing:
      • Use HashingVectorizer for memory-efficient handling of large vocabularies
      • Extract context information through n-gram based features
      • Reflect word importance through TF-IDF transformation
    • Efficient Memory Management:
      • Immediately release unnecessary data from memory
      • Optimize memory usage with HashingVectorizer
      • Maintain only features common to training/test sets
    • Robust Preprocessing:
      • Intelligent filling of missing brand names
      • Hierarchical use of category information
      • Text data normalization and combination
    • Scalable Structure:
      • Modularization using Pipeline and FeatureUnion
      • Flexibility through custom transformer classes
      • Support for multiprocessing

2. Code Details

# Import essential libraries
import multiprocessing as mp  # Library for parallel processing
import pandas as pd  # pandas for data processing 
from time import time  # For measuring execution time
from scipy.sparse import csr_matrix  # For sparse matrix operations
import os  # For OS related functionality
from sklearn.linear_model import Ridge  # Ridge regression model
from sklearn.pipeline import FeatureUnion, Pipeline  # For feature processing pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer  # Text processing
from sklearn.metrics import mean_squared_log_error  # Evaluation metric
from sklearn.preprocessing import OneHotEncoder  # For encoding categorical variables
import numpy as np  # For numerical operations
import gc  # Garbage collection
from sklearn.base import BaseEstimator, TransformerMixin  # For creating custom transformers
import re  # Regular expressions
from pandas.api.types import is_numeric_dtype, is_categorical_dtype  # For checking data types

# Multithreading configuration
os.environ['MKL_NUM_THREADS'] = '4'  # Limit Intel Math Kernel Library threads
os.environ['OMP_NUM_THREADS'] = '4'  # Limit OpenMP threads
os.environ['JOBLIB_START_METHOD'] = 'forkserver'  # Set joblib parallel processing method

# Set input data path
INPUT_PATH = r'../input'

# Function to calculate Damerau-Levenshtein distance
def dameraulevenshtein(seq1, seq2):
   """Calculate the Damerau-Levenshtein distance between sequences.

    This method has not been modified from the original.
    Source: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
   # Implementation maintained as original using dynamic programming
   # Stores only current row and previous two rows for memory efficiency
   oneago = None
   thisrow = list(range(1, len(seq2) + 1)) + [0]
   for x in range(len(seq1)):
       twoago, oneago, thisrow = (oneago, thisrow, [0] * len(seq2) + [x + 1])
       for y in range(len(seq2)):
           delcost = oneago[y] + 1  # Deletion cost
           addcost = thisrow[y - 1] + 1  # Addition cost
           subcost = oneago[y - 1] + (seq1[x] != seq2[y])  # Substitution cost
           thisrow[y] = min(delcost, addcost, subcost)
           # Handle transpositions
           if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                   and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
               thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
   return thisrow[len(seq2) - 1]
   
class SymSpell:
   """
   A class implementing the SymSpell algorithm.
   This algorithm provides an efficient method for spell correction.
   """
   
   def __init__(self, max_edit_distance=3, verbose=0):
       """
       Parameters:
       max_edit_distance: Maximum edit distance (how many edits to allow)
       verbose: Verbosity level (0: top suggestion only, 1: all suggestions with minimal distance, 2: all possible suggestions)
       """
       self.max_edit_distance = max_edit_distance
       self.verbose = verbose
       self.dictionary = {}  # Word dictionary
       self.longest_word_length = 0  # Length of longest word

   def get_deletes_list(self, w):
       """
       Generates all possible combinations of the word with characters deleted up to max_edit_distance.
       Example: "word" -> ["ord", "wrd", "wod", "wor"]
       """
       deletes = []
       queue = [w]
       for d in range(self.max_edit_distance):
           temp_queue = []
           for word in queue:
               if len(word) > 1:
                   for c in range(len(word)):
                       word_minus_c = word[:c] + word[c + 1:]
                       if word_minus_c not in deletes:
                           deletes.append(word_minus_c)
                       if word_minus_c not in temp_queue:
                           temp_queue.append(word_minus_c)
           queue = temp_queue
       return deletes

   def create_dictionary_entry(self, w):
       """
       Adds a word and its derived deletion variants to the dictionary.
       Returns:
       bool: Whether a new real word was added
       """
       new_real_word_added = False
       if w in self.dictionary:
           # If word exists, increase frequency
           self.dictionary[w] = (self.dictionary[w][0], self.dictionary[w][1] + 1)
       else:
           # Add new word
           self.dictionary[w] = ([], 1)
           self.longest_word_length = max(self.longest_word_length, len(w))
           
       if self.dictionary[w][1] == 1:
           # If this is the first occurrence of the word in the corpus
           new_real_word_added = True
           
       deletes = self.get_deletes_list(w)
       for item in deletes:
           if item in self.dictionary:
               # Add original word to the deletion's entry
               self.dictionary[item][0].append(w)
           else:
               # Add new deletion form
               self.dictionary[item] = ([w], 0)
               
       return new_real_word_added
       
def create_dictionary_from_arr(self, arr, token_pattern=r'[a-z]+'):
   """
   Creates a word dictionary from an array.
   Parameters:
   arr: Array containing words
   token_pattern: Regular expression pattern for extracting words
   Returns:
   dictionary: Generated word dictionary
   """
   total_word_count = 0  # Total words processed
   unique_word_count = 0  # Number of unique words
   
   for line in arr:
       # Split words by non-alphabetic characters
       words = re.findall(token_pattern, line.lower())
       for word in words:
           total_word_count += 1
           if self.create_dictionary_entry(word):
               unique_word_count += 1
   
   # Print processing results
   print("total words processed: %i" % total_word_count)
   print("total unique words in corpus: %i" % unique_word_count)
   print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))
   print(" edit distance for deletions: %i" % self.max_edit_distance)
   print(" length of longest word in corpus: %i" % self.longest_word_length)
   
   return self.dictionary

def create_dictionary(self, fname):
   """
   Creates a word dictionary from a file.
   
   Parameters:
   fname: Path to the file to read
   
   Returns:
   dictionary: Generated word dictionary
   
   How it works:
   1. Reads the file line by line.
   2. Extracts words containing only alphabetic characters from each line.
   3. Converts each word to lowercase and adds it to the dictionary.
   4. Prints processing results.
   """
   total_word_count = 0      # Total number of words processed
   unique_word_count = 0     # Number of unique words
   with open(fname) as file:  # Open file with context manager
       for line in file:
           # Split words by non-alphabetic characters
           # [a-z]+ pattern finds one or more consecutive lowercase letters
           words = re.findall('[a-z]+', line.lower())
           
           for word in words:
               total_word_count += 1  # Increase total word count
               # If create_dictionary_entry returns True (new word added)
               # Increase unique word count
               if self.create_dictionary_entry(word):
                   unique_word_count += 1
                   
   # Print processing results
   print("total words processed: %i" % total_word_count)           # Total words processed
   print("total unique words in corpus: %i" % unique_word_count)   # Unique words
   print("total items in dictionary (corpus words and deletions): %i" % len(self.dictionary))  # Dictionary size
   print("  edit distance for deletions: %i" % self.max_edit_distance)  # Maximum edit distance
   print("  length of longest word in corpus: %i" % self.longest_word_length)  # Length of longest word
   
   return self.dictionary  # Return generated dictionary

def get_suggestions(self, string, silent=False):
        """return list of suggested corrections for potentially incorrectly
           spelled word"""
        if (len(string) - self.longest_word_length) > self.max_edit_distance:
            if not silent:
                print("no items in dictionary within maximum edit distance")
            return []

        suggest_dict = {}
        min_suggest_len = float('inf')

        queue = [string]
        q_dictionary = {}  # items other than string that we've checked

        while len(queue) > 0:
            q_item = queue[0]  # pop
            queue = queue[1:]

            # early exit
            if ((self.verbose < 2) and (len(suggest_dict) > 0) and
                    ((len(string) - len(q_item)) > min_suggest_len)):
                break

            # process queue item
            if (q_item in self.dictionary) and (q_item not in suggest_dict):
                if self.dictionary[q_item][1] > 0:
                    # word is in dictionary, and is a word from the corpus, and
                    # not already in suggestion list so add to suggestion
                    # dictionary, indexed by the word with value (frequency in
                    # corpus, edit distance)
                    # note q_items that are not the input string are shorter
                    # than input string since only deletes are added (unless
                    # manual dictionary corrections are added)
                    assert len(string) >= len(q_item)
                    suggest_dict[q_item] = (self.dictionary[q_item][1],
                                            len(string) - len(q_item))
                    # early exit
                    if (self.verbose < 2) and (len(string) == len(q_item)):
                        break
                    elif (len(string) - len(q_item)) < min_suggest_len:
                        min_suggest_len = len(string) - len(q_item)

                # the suggested corrections for q_item as stored in
                # dictionary (whether or not q_item itself is a valid word
                # or merely a delete) can be valid corrections
                for sc_item in self.dictionary[q_item][0]:
                    if sc_item not in suggest_dict:

                        # compute edit distance
                        # suggested items should always be longer
                        # (unless manual corrections are added)
                        assert len(sc_item) > len(q_item)

                        # q_items that are not input should be shorter
                        # than original string
                        # (unless manual corrections added)
                        assert len(q_item) <= len(string)

                        if len(q_item) == len(string):
                            assert q_item == string
                            item_dist = len(sc_item) - len(q_item)

                        # item in suggestions list should not be the same as
                        # the string itself
                        assert sc_item != string

                        # calculate edit distance using, for example,
                        # Damerau-Levenshtein distance
                        item_dist = dameraulevenshtein(sc_item, string)

                        # do not add words with greater edit distance if
                        # verbose setting not on
                        if (self.verbose < 2) and (item_dist > min_suggest_len):
                            pass
                        elif item_dist <= self.max_edit_distance:
                            assert sc_item in self.dictionary  # should already be in dictionary if in suggestion list
                            suggest_dict[sc_item] = (self.dictionary[sc_item][1], item_dist)
                            if item_dist < min_suggest_len:
                                min_suggest_len = item_dist

                        # depending on order words are processed, some words
                        # with different edit distances may be entered into
                        # suggestions; trim suggestion dictionary if verbose
                        # setting not on
                        if self.verbose < 2:
                            suggest_dict = {k: v for k, v in suggest_dict.items() if v[1] <= min_suggest_len}

            # now generate deletes (e.g. a substring of string or of a delete)
            # from the queue item
            # as additional items to check -- add to end of queue
            assert len(string) >= len(q_item)

            # do not add words with greater edit distance if verbose setting
            # is not on
            if (self.verbose < 2) and ((len(string) - len(q_item)) > min_suggest_len):
                pass
            elif (len(string) - len(q_item)) < self.max_edit_distance and len(q_item) > 1:
                for c in range(len(q_item)):  # character index
                    word_minus_c = q_item[:c] + q_item[c + 1:]
                    if word_minus_c not in q_dictionary:
                        queue.append(word_minus_c)
                        q_dictionary[word_minus_c] = None  # arbitrary value, just to identify we checked this

        # queue is now empty: convert suggestions in dictionary to
        # list for output
        if not silent and self.verbose != 0:
            print("number of possible corrections: %i" % len(suggest_dict))
            print("  edit distance for deletions: %i" % self.max_edit_distance)

        # output option 1
        # sort results by ascending order of edit distance and descending
        # order of frequency
        #     and return list of suggested word corrections only:
        # return sorted(suggest_dict, key = lambda x:
        #               (suggest_dict[x][1], -suggest_dict[x][0]))

        # output option 2
        # return list of suggestions with (correction,
        #                                  (frequency in corpus, edit distance)):
        as_list = suggest_dict.items()
        # outlist = sorted(as_list, key=lambda (term, (freq, dist)): (dist, -freq))
        outlist = sorted(as_list, key=lambda x: (x[1][1], -x[1][0]))

        if self.verbose == 0:
            return outlist[0]
        else:
            return outlist

        '''
        Option 1:
        ['file', 'five', 'fire', 'fine', ...]

        Option 2:
        [('file', (5, 0)),
         ('five', (67, 1)),
         ('fire', (54, 1)),
         ('fine', (17, 1))...]  
        '''

def best_word(self, s, silent=False):
   """
   Returns the best correction for a given word.
   Parameters:
   s: Word to check
   silent: If True, don't print progress
   Returns:
   tuple or None: (corrected word, (frequency, edit distance)) or None if failed
   """
   try:
       return self.get_suggestions(s, silent)[0]
   except:
       return None
 
class ItemSelector(BaseEstimator, TransformerMixin):
   """
   A transformer for selecting specific fields from a pandas DataFrame and converting them to appropriate format.
   This is a custom transformer for use in scikit-learn Pipelines.
   """
   def __init__(self, field, start_time=time()):
       self.field = field  # Column name to select from DataFrame
       self.start_time = start_time  # Start time for processing time measurement

   def fit(self, x, y=None):
       return self

   def transform(self, dataframe):
       """
       Selects and transforms specific fields from the DataFrame.
       - Categorical data is converted to codes
       - Numeric data is kept as is 
       - Other data is treated as text
       """
       print(f'[{time()-self.start_time}] select {self.field}')
       dt = dataframe[self.field].dtype
       if is_categorical_dtype(dt):
           return dataframe[self.field].cat.codes[:, None]
       elif is_numeric_dtype(dt):
           return dataframe[self.field][:, None]
       else:
           return dataframe[self.field]

class DropColumnsByDf(BaseEstimator, TransformerMixin):
   """
   A transformer that filters features (columns) based on document frequency
   """
   def __init__(self, min_df=1, max_df=1.0):
       """
       Parameters:
       min_df: Minimum document frequency (features below this are removed)
       max_df: Maximum document frequency ratio (features above this are removed)
       """
       self.min_df = min_df
       self.max_df = max_df

   def fit(self, X, y=None):
       """
       Calculates document frequency for given data and determines which columns to filter.
       """
       # Convert to CSC (Compressed Sparse Column) format
       m = X.tocsc()
   
       # Process minimum document frequency (min_df) condition
       # (m != 0).sum(axis=0): Calculate number of non-zero values in each column 
   	   # >= self.min_df: Check if it's greater than minimum document frequency
       # .A1: Flatten array to 1 dimension
       self.nnz_cols = ((m != 0).sum(axis=0) >= self.min_df).A1
   
       # Process maximum document frequency (max_df) condition
       if self.max_df < 1.0:
           # Calculate maximum allowed number of documents
           max_df = m.shape[0] * self.max_df
           # AND operation with maximum document frequency condition
           self.nnz_cols = self.nnz_cols & ((m != 0).sum(axis=0) <= max_df).A1
       
   	   return self

   def transform(self, X, y=None):
       """
       Selects features according to the determined filtering criteria.
       """
       m = X.tocsc()
       # Select columns according to conditions determined in fit (self.nnz_cols)
       return m[:, self.nnz_cols]
 
def get_rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))


def split_cat(text):
    try:
        cats = text.split("/")
        return cats[0], cats[1], cats[2], cats[0] + '/' + cats[1]
    except:
        print("no category")
        return 'other', 'other', 'other', 'other/other'

# Function to fill in missing brand names
# Uses the SymSpell algorithm to find and fill brand names from product names and descriptions
# Processes single-word and multi-word brand names separately
def brands_filling(dataset):
    vc = dataset['brand_name'].value_counts()
    brands = vc[vc > 0].index
    brand_word = r"[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+"

    many_w_brands = brands[brands.str.contains(' ')]
    one_w_brands = brands[~brands.str.contains(' ')]

    ss2 = SymSpell(max_edit_distance=0)
    ss2.create_dictionary_from_arr(many_w_brands, token_pattern=r'.+')

    ss1 = SymSpell(max_edit_distance=0)
    ss1.create_dictionary_from_arr(one_w_brands, token_pattern=r'.+')

    two_words_re = re.compile(r"(?=(\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+))")

    def find_in_str_ss2(row):
        for doc_word in two_words_re.finditer(row):
            print(doc_word)
            suggestion = ss2.best_word(doc_word.group(1), silent=True)
            if suggestion is not None:
                return doc_word.group(1)
        return ''

    def find_in_list_ss1(list):
        for doc_word in list:
            suggestion = ss1.best_word(doc_word, silent=True)
            if suggestion is not None:
                return doc_word
        return ''

    def find_in_list_ss2(list):
        for doc_word in list:
            suggestion = ss2.best_word(doc_word, silent=True)
            if suggestion is not None:
                return doc_word
        return ''

    print(f"Before empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")

    n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(
        pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_name]

    n_desc = dataset[dataset['brand_name'] == '']['item_description'].str.findall(
        pat=r"^[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+\s[a-z0-9*/+\-'’?!.,|&%®™ôèéü]+")
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss2(row) for row in n_desc]

    n_name = dataset[dataset['brand_name'] == '']['name'].str.findall(pat=brand_word)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in n_name]

    desc_lower = dataset[dataset['brand_name'] == '']['item_description'].str.findall(pat=brand_word)
    dataset.loc[dataset['brand_name'] == '', 'brand_name'] = [find_in_list_ss1(row) for row in desc_lower]

    print(f"After empty brand_name: {len(dataset[dataset['brand_name'] == ''].index)}")

    del ss1, ss2
    gc.collect()


def preprocess_regex(dataset, start_time=time()):
    karats_regex = r'(\d)([\s-]?)(karat|karats|carat|carats|kt)([^\w])'
    karats_repl = r'\1k\4'

    unit_regex = r'(\d+)[\s-]([a-z]{2})(\s)'
    unit_repl = r'\1\2\3'

    dataset['name'] = dataset['name'].str.replace(karats_regex, karats_repl)
    dataset['item_description'] = dataset['item_description'].str.replace(karats_regex, karats_repl)
    print(f'[{time() - start_time}] Karats normalized.')

    dataset['name'] = dataset['name'].str.replace(unit_regex, unit_repl)
    dataset['item_description'] = dataset['item_description'].str.replace(unit_regex, unit_repl)
    print(f'[{time() - start_time}] Units glued.')


def preprocess_pandas(train, test, start_time=time()):
    train = train[train.price > 0.0].reset_index(drop=True)
    print('Train shape without zero price: ', train.shape)

    nrow_train = train.shape[0]
    y_train = np.log1p(train["price"])
    merge: pd.DataFrame = pd.concat([train, test])

    del train
    del test
    gc.collect()

    merge['has_category'] = (merge['category_name'].notnull()).astype('category')
    print(f'[{time() - start_time}] Has_category filled.')

    merge['category_name'] = merge['category_name'] \
        .fillna('other/other/other') \
        .str.lower() \
        .astype(str)
    merge['general_cat'], merge['subcat_1'], merge['subcat_2'], merge['gen_subcat1'] = \
        zip(*merge['category_name'].apply(lambda x: split_cat(x)))
    print(f'[{time() - start_time}] Split categories completed.')

    merge['has_brand'] = (merge['brand_name'].notnull()).astype('category')
    print(f'[{time() - start_time}] Has_brand filled.')

    merge['gencat_cond'] = merge['general_cat'].map(str) + '_' + merge['item_condition_id'].astype(str)
    merge['subcat_1_cond'] = merge['subcat_1'].map(str) + '_' + merge['item_condition_id'].astype(str)
    merge['subcat_2_cond'] = merge['subcat_2'].map(str) + '_' + merge['item_condition_id'].astype(str)
    print(f'[{time() - start_time}] Categories and item_condition_id concancenated.')

    merge['name'] = merge['name'] \
        .fillna('') \
        .str.lower() \
        .astype(str)
    merge['brand_name'] = merge['brand_name'] \
        .fillna('') \
        .str.lower() \
        .astype(str)
    merge['item_description'] = merge['item_description'] \
        .fillna('') \
        .str.lower() \
        .replace(to_replace='No description yet', value='')
    print(f'[{time() - start_time}] Missing filled.')

    preprocess_regex(merge, start_time)

    brands_filling(merge)
    print(f'[{time() - start_time}] Brand name filled.')

    merge['name'] = merge['name'] + ' ' + merge['brand_name']
    print(f'[{time() - start_time}] Name concancenated.')

    merge['item_description'] = merge['item_description'] \
                                + ' ' + merge['name'] \
                                + ' ' + merge['subcat_1'] \
                                + ' ' + merge['subcat_2'] \
                                + ' ' + merge['general_cat'] \
                                + ' ' + merge['brand_name']
    print(f'[{time() - start_time}] Item description concatenated.')

    merge.drop(['price', 'test_id', 'train_id'], axis=1, inplace=True)

    return merge, y_train, nrow_train


def intersect_drop_columns(train: csr_matrix, valid: csr_matrix, min_df=0):
    t = train.tocsc()
    v = valid.tocsc()
    nnz_train = ((t != 0).sum(axis=0) >= min_df).A1
    nnz_valid = ((v != 0).sum(axis=0) >= min_df).A1
    nnz_cols = nnz_train & nnz_valid
    res = t[:, nnz_cols], v[:, nnz_cols]
    return res


if __name__ == '__main__':
    mp.set_start_method('forkserver', True)

    start_time = time()

    train = pd.read_table(os.path.join(INPUT_PATH, 'train.tsv'),
                          engine='c',
                          dtype={'item_condition_id': 'category',
                                 'shipping': 'category'}
                          )
    test = pd.read_table(os.path.join(INPUT_PATH, 'test.tsv'),
                         engine='c',
                         dtype={'item_condition_id': 'category',
                                'shipping': 'category'}
                         )
    print(f'[{time() - start_time}] Finished to load data')
    print('Train shape: ', train.shape)
    print('Test shape: ', test.shape)

    submission: pd.DataFrame = test[['test_id']]

    merge, y_train, nrow_train = preprocess_pandas(train, test, start_time)

    meta_params = {'name_ngram': (1, 2),
                   'name_max_f': 75000,
                   'name_min_df': 10,

                   'category_ngram': (2, 3),
                   'category_token': '.+',
                   'category_min_df': 10,

                   'brand_min_df': 10,

                   'desc_ngram': (1, 3),
                   'desc_max_f': 150000,
                   'desc_max_df': 0.5,
                   'desc_min_df': 10}

    stopwords = frozenset(['the', 'a', 'an', 'is', 'it', 'this', ])
    # 'i', 'so', 'its', 'am', 'are'])

    vectorizer = FeatureUnion([
        ('name', Pipeline([
            ('select', ItemSelector('name', start_time=start_time)),
            ('transform', HashingVectorizer(
                ngram_range=(1, 2),
                n_features=2 ** 27,
                norm='l2',
                lowercase=False,
                stop_words=stopwords
            )),
            ('drop_cols', DropColumnsByDf(min_df=2))
        ])),
        ('category_name', Pipeline([
            ('select', ItemSelector('category_name', start_time=start_time)),
            ('transform', HashingVectorizer(
                ngram_range=(1, 1),
                token_pattern='.+',
                tokenizer=split_cat,
                n_features=2 ** 27,
                norm='l2',
                lowercase=False
            )),
            ('drop_cols', DropColumnsByDf(min_df=2))
        ])),
        ('brand_name', Pipeline([
            ('select', ItemSelector('brand_name', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('gencat_cond', Pipeline([
            ('select', ItemSelector('gencat_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('subcat_1_cond', Pipeline([
            ('select', ItemSelector('subcat_1_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('subcat_2_cond', Pipeline([
            ('select', ItemSelector('subcat_2_cond', start_time=start_time)),
            ('transform', CountVectorizer(
                token_pattern='.+',
                min_df=2,
                lowercase=False
            )),
        ])),
        ('has_brand', Pipeline([
            ('select', ItemSelector('has_brand', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('shipping', Pipeline([
            ('select', ItemSelector('shipping', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('item_condition_id', Pipeline([
            ('select', ItemSelector('item_condition_id', start_time=start_time)),
            ('ohe', OneHotEncoder())
        ])),
        ('item_description', Pipeline([
            ('select', ItemSelector('item_description', start_time=start_time)),
            ('hash', HashingVectorizer(
                ngram_range=(1, 3),
                n_features=2 ** 27,
                dtype=np.float32,
                norm='l2',
                lowercase=False,
                stop_words=stopwords
            )),
            ('drop_cols', DropColumnsByDf(min_df=2)),
        ]))
    ], n_jobs=1)

    sparse_merge = vectorizer.fit_transform(merge)
    print(f'[{time() - start_time}] Merge vectorized')
    print(sparse_merge.shape)

    tfidf_transformer = TfidfTransformer()

    X = tfidf_transformer.fit_transform(sparse_merge)
    print(f'[{time() - start_time}] TF/IDF completed')

    X_train = X[:nrow_train]
    print(X_train.shape)

    X_test = X[nrow_train:]
    del merge
    del sparse_merge
    del vectorizer
    del tfidf_transformer
    gc.collect()

    X_train, X_test = intersect_drop_columns(X_train, X_test, min_df=1)
    print(f'[{time() - start_time}] Drop only in train or test cols: {X_train.shape[1]}')
    gc.collect()

    ridge = Ridge(solver='auto', fit_intercept=True, alpha=0.4, max_iter=200, normalize=False, tol=0.01)
    ridge.fit(X_train, y_train)
    print(f'[{time() - start_time}] Train Ridge completed. Iterations: {ridge.n_iter_}')

    predsR = ridge.predict(X_test)
    print(f'[{time() - start_time}] Predict Ridge completed.')

    submission.loc[:, 'price'] = np.expm1(predsR)
    submission.loc[submission['price'] < 0.0, 'price'] = 0.0
    submission.to_csv("submission_ridge.csv", index=False)

3. Damerau-Levenshtein distance

  • Basic Concept:
    • Calculates the minimum number of edits needed to transform one string into another
    • It's an extension of the Levenshtein distance that adds the operation of "transposing adjacent characters"
  • Allowed Edit Operations:
    • Character insertion
    • Example: "cat" → "cart" (insert r)
    • Character deletion
    • Example: "cart" → "cat" (delete r)
    • Character substitution
    • Example: "cat" → "cut" (substitute a with u)
    • Adjacent character transposition
    • Example: "cloud" → "could" (swap u and l)
  • Use Cases:
    • Spell checking
    • Similar word search
    • Measuring similarity between brand names or product names
    • Text matching in natural language processing
  • Example:
# Converting "kitten" to "sitting":
# 1. k → s (substitution)
# 2. e → i (substitution)
# 3. n → ng (insertion)
# Total distance: 3
  • In this code, it's used to find missing brand names by calculating similarity with existing brand names.
  • For example, it can help correct a misspelled brand name like "Nkie" to "Nike". 

4. Getting suggestion code detail

  • get_suggestions(self, string, silent=False) function:
    • Purpose: Generates a list of correction suggestions for potentially misspelled words
  • Main features:
    • Finds similar words to the input word in the dictionary
    • Calculates edit distance (character deletion/addition/change)
    • Finds and sorts all possible correction words
    • Provides (frequency, edit distance) information for each suggested word
  • Return values:
    • When verbose=0: Returns only the best suggestion
    • When verbose>0: Returns all possible suggestions in the form (word, (frequency, edit distance))
  •  best_word(self, s, silent=False) function:
    • Purpose: Finds the optimal correction word for a given word
  • Main features:
    • Calls get_suggestions function to get suggestion list
    • Selects the most appropriate word (smallest edit distance and highest frequency) from the suggestion list
  • Return values:
    • On success: (corrected word, (frequency, edit distance))
    • On failure: None

5. More about Ridge Regression

  • Basic Concept:
    • It's an advanced form of linear regression
    • Created to prevent overfitting
    • A regression model that uses L2 regularization
  • How it works:
    • Basic linear regression equation: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
    • Ridge regression adds a penalty term
    • Penalty term: α(w₁² + w₂² + ... + wₙ²)
    • α is a hyperparameter that controls regularization strength
    • Applies penalty to the sum of squared weights
  • Advantages:
    • Can solve multicollinearity problems
    • Example: Handles strongly correlated features like 'height' and 'weight' well
    • Prevents overfitting and improves model generalization
    • Maintains all features while adjusting their influence
  • Difference from Regular Linear Regression:
# Cost function for regular linear regression
Cost = Σ(y - ŷ)²
# Cost function for Ridge regression
Cost = Σ(y - ŷ)² + α(w₁² + w₂² + ... + wₙ²)
  • When to use:
    • Data with many features
    • When features have strong correlations
    • When overfitting is a concern
  • For example, in the given code, Ridge regression is used because text vectorization creates many features that might be correlated:
ridge = Ridge(
    solver='auto',           # Automatically choose best solver
    fit_intercept=True,      # Use intercept term
    alpha=0.4,              # Regularization strength (α value)
    max_iter=200,           # Maximum iterations
    normalize=False,         # Whether to normalize
    tol=0.01                # Convergence tolerance
)
  • This configured Ridge regression model helps predict prices while appropriately adjusting the influence of many features. 

Fourth Kernel: LGB and FM [18th Place - 0.40604]

  • 18th place solution using lightgbm and FM_FTRL.
  • 0.33 * FM_FTRL + 0.67 * LGB

Insight / Summary:

1. Overall Summary

  • WordBatch is a Python package designed for fast processing of large-scale text data.
  • FM_FTRL is a model that combines Factorization Machines with the Follow-The-Regularized-Leader algorithm.
    • Looking at each component:
      • FM (Factorization Machines)
        • A method for modeling interactions between features in high-dimensional sparse data
        • Particularly effective in tasks like recommendation systems and click-through rate (CTR) prediction
        • Represents potential interactions between features as low-dimensional vectors
      • FTRL (Follow-The-Regularized-Leader)
        • A type of online learning algorithm
        • Effectively handles L1 and L2 regularization
        • Particularly effective in learning sparse models
          • A sparse model refers to a model where many parameters (weights) are zero
        • Memory efficient and capable of incremental learning
    • Main advantages of this combination:
      • Suitable for large-scale sparse data processing
        • Data is very sparse due to numerous text and categorical variables
        • FM effectively learns feature interactions in sparse data
      • Automatically learns feature interactions
        • FM automatically learns second-order feature interactions
        • Example: Captures the impact of brand and category combinations on price
      • Enables online learning for real-time updates
        • FTRL is an online learning algorithm that is memory efficient
        • Can effectively train on large-scale datasets
      • High prediction performance
    • Commonly used in tasks such as advertising click-through rate prediction, recommendation systems, and tasks requiring real-time prediction.
  • Feature Engineering
    • Extraction of various statistical features from text data MANUALLY
    • Text vectorization using WordBatch and TF-IDF
    • Label encoding for categorical variables
  • Model Ensemble
    • FM_FTRL: Factorization Machine effective for sparse data
    • Ridge Regression: Basic regression for text features
    • LightGBM: High-performance gradient boosting model
  • Ridge model?
    • The Ridge model is used to perform basic regression analysis on text data (product names and descriptions)
    • Basic linear regression model with L2 regularization applied.
  • Memory Optimization
    • Use of sparse matrices
    • Periodic garbage collection
    • Removal of unnecessary variables
  • Final Prediction
    • Weighted average of FM_FTRL and LightGBM predictions
    • Save results after reversing log transformation

2. Code Details

# Record start time for execution time measurement
import time
start_time = time.time()

# Set submission mode (True: train on full data, False: use validation split)
SUBMIT_MODE = True

# Import required libraries
import pandas as pd
import numpy as np
import time
import gc  # For garbage collection
import string
import re

# Use NLTK stopwords
from nltk.corpus import stopwords

# Import scipy for sparse matrix handling
from scipy.sparse import csr_matrix, hstack
# Import sklearn for text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Import sklearn for feature selection
from sklearn.feature_selection.univariate_selection import SelectKBest, f_regression
# Import sklearn for label encoding
from sklearn.preprocessing import LabelBinarizer

# Import WordBatch related modules (for text processing)
import wordbatch
from wordbatch.extractors import WordBag
from wordbatch.models import FM_FTRL

# Import sklearn model related modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.naive_bayes import MultinomialNB
import lightgbm as lgb

# Define RMSE calculation function
def rmse(predicted, actual):
    """
    Calculate Root Mean Squared Error between predicted and actual values
    Args:
        predicted: array of predicted values
        actual: array of actual values
    Returns:
        RMSE value
    """
    return np.sqrt(((predicted - actual) ** 2).mean())

# Category splitting function
def split_cat(text):
    """
    Split category string into subcategories
    Args:
        text: Category string with '/' delimiter
    Returns:
        Tuple of 3 subcategories (returns 'No Label' if missing)
    """
    try:
        return text.split("/")
    except:
        return ("No Label", "No Label", "No Label")

# Define Target Encoding class
class TargetEncoder:
    """
    Class for performing target encoding on categorical variables
    Numerically encodes categories based on mean target values
    """
    def __repr__(self):
        return 'TargetEncoder'

    def __init__(self, cols, smoothing=1, min_samples_leaf=1, noise_level=0, keep_original=False):
        """
        Args:
            cols: List of columns to encode
            smoothing: Smoothing parameter
            min_samples_leaf: Minimum number of samples
            noise_level: Level of noise to add
            keep_original: Whether to keep original columns
        """
        self.cols = cols
        self.smoothing = smoothing
        self.min_samples_leaf = min_samples_leaf
        self.noise_level = noise_level
        self.keep_original = keep_original

    @staticmethod
    def add_noise(series, noise_level):
        """
        Add noise to prevent overfitting
        """
        return series * (1 + noise_level * np.random.randn(len(series)))

    def encode(self, train, test, target):
        """
        Perform target encoding on categorical columns in given dataframe
        """
        for col in self.cols:
            if self.keep_original:
                train[col + '_te'], test[col + '_te'] = self.encode_column(train[col], test[col], target)
            else:
                train[col], test[col] = self.encode_column(train[col], test[col], target)
        return train, test

    def encode_column(self, trn_series, tst_series, target):
        """
        Perform target encoding on a single column
        """
        temp = pd.concat([trn_series, target], axis=1)
        # Calculate target means
        averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
        # Calculate smoothing
        smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
        # Calculate overall mean
        prior = target.mean()
        # Calculate smoothed means
        averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
        averages.drop(['mean', 'count'], axis=1, inplace=True)
        
        # Apply encoding to train/test data
        ft_trn_series = pd.merge(
            trn_series.to_frame(trn_series.name),
            averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
            on=trn_series.name,
            how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
        ft_trn_series.index = trn_series.index
        
        ft_tst_series = pd.merge(
            tst_series.to_frame(tst_series.name),
            averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
            on=tst_series.name,
            how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
        ft_tst_series.index = tst_series.index
        
        return self.add_noise(ft_trn_series, self.noise_level), self.add_noise(ft_tst_series, self.noise_level)

# Numeric value processing functions
def to_number(x):
    """
    Convert string to number (limit to 100 if greater than 100)
    """
    try:
        if not x.isdigit():
            return 0
        x = int(x)
        if x > 100:
            return 100
        else:
            return x
    except:
        return 0

def sum_numbers(desc):
    """
    Calculate sum of numbers in description text
    """
    if not isinstance(desc, str):
        return 0
    try:
        return sum([to_number(s) for s in desc.split()])
    except:
        return 0

# Set regex and stopwords for text preprocessing
stopwords = {x: 1 for x in stopwords.words('english')}
non_alphanums = re.compile(u'[^A-Za-z0-9]+')
non_alphanumpunct = re.compile(u'[^A-Za-z0-9\.?!,; \(\)\[\]\'\"\$]+')
RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])

def normalize_text(text):
    """
    Text normalization:
    - Convert to lowercase
    - Remove special characters
    - Remove stopwords
    - Remove short words
    """
    return u" ".join(
        [x for x in [y for y in non_alphanums.sub(' ', text).lower().strip().split(" ")] \
         if len(x) > 1 and x not in stopwords])

def clean_name(x):
    """
    Extract first word from name
    """
    if len(x):
        x = non_alphanums.sub(' ', x).split()
        if len(x):
            return x[0].lower()
    return ''

# Load data
print('[{}] Finished defining stuff'.format(time.time() - start_time))

# Load training data
train = pd.read_table('../input/train.tsv', engine='c', 
                      dtype={'item_condition_id': 'category',
                             'shipping': 'category',
                            }, 
                     converters={'category_name': split_cat})
# Load test data
test = pd.read_table('../input/test.tsv', engine='c', 
                      dtype={'item_condition_id': 'category',
                             'shipping': 'category',
                            },
                    converters={'category_name': split_cat})
print('[{}] Finished load data'.format(time.time() - start_time))

# Add flag for train/test data distinction
train['is_train'] = 1
test['is_train'] = 0
print('[{}] Compiled train / test'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

# Remove data with price 0
train = train[train.price != 0].reset_index(drop=True)
print('[{}] Removed zero price'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

# Log transform target variable (price)
y = np.log1p(train['price'])
nrow_train = train.shape[0]

# Merge train/test data
merge = pd.concat([train, test])
submission = test[['test_id']]
print('[{}] Compiled merge'.format(time.time() - start_time))
print('Merge shape: ', merge.shape)

# Remove unnecessary columns and clear memory
del train
del test
merge.drop(['train_id', 'test_id', 'price'], axis=1, inplace=True)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# Split and process categories
merge['gencat_name'] = merge['category_name'].str.get(0).replace('', 'missing').astype('category')
merge['subcat1_name'] = merge['category_name'].str.get(1).fillna('missing').astype('category')
merge['subcat2_name'] = merge['category_name'].str.get(2).fillna('missing').astype('category')
merge.drop('category_name', axis=1, inplace=True)
print('[{}] Split categories completed.'.format(time.time() - start_time))

# Handle missing values
merge['item_condition_id'] = merge['item_condition_id'].cat.add_categories(['missing']).fillna('missing')
merge['shipping'] = merge['shipping'].cat.add_categories(['missing']).fillna('missing')
merge['item_description'].fillna('missing', inplace=True)
merge['brand_name'] = merge['brand_name'].fillna('missing').astype('category')
print('[{}] Handle missing completed.'.format(time.time() - start_time))

# Start feature engineering
# Name-related features
merge['name_first'] = merge['name'].apply(clean_name)
print('[{}] FE 1/37'.format(time.time() - start_time))
merge['name_first_count'] = merge.groupby('name_first')['name_first'].transform('count')
print('[{}] FE 2/37'.format(time.time() - start_time))

# Category-related features
merge['gencat_name_count'] = merge.groupby('gencat_name')['gencat_name'].transform('count')
print('[{}] FE 3/37'.format(time.time() - start_time))
merge['subcat1_name_count'] = merge.groupby('subcat1_name')['subcat1_name'].transform('count')
print('[{}] FE 4/37'.format(time.time() - start_time))
merge['subcat2_name_count'] = merge.groupby('subcat2_name')['subcat2_name'].transform('count')
print('[{}] FE 5/37'.format(time.time() - start_time))
merge['brand_name_count'] = merge.groupby('brand_name')['brand_name'].transform('count')
print('[{}] FE 6/37'.format(time.time() - start_time))

# Text-related features
merge['NameLower'] = merge.name.str.count('[a-z]')
print('[{}] FE 7/37'.format(time.time() - start_time))
merge['DescriptionLower'] = merge.item_description.str.count('[a-z]')
print('[{}] FE 8/37'.format(time.time() - start_time))
merge['NameUpper'] = merge.name.str.count('[A-Z]')
print('[{}] FE 9/37'.format(time.time() - start_time))
merge['DescriptionUpper'] = merge.item_description.str.count('[A-Z]')
print('[{}] FE 10/37'.format(time.time() - start_time))

# Length-related features
merge['name_len'] = merge['name'].apply(lambda x: len(x))
print('[{}] FE 11/37'.format(time.time() - start_time))
merge['des_len'] = merge['item_description'].apply(lambda x: len(x))
print('[{}] FE 12/37'.format(time.time() - start_time))
merge['name_desc_len_ratio'] = merge['name_len']/merge['des_len']
print('[{}] FE 13/37'.format(time.time() - start_time))

# Word count related features
merge['desc_word_count'] = merge['item_description'].apply(lambda x: len(x.split()))
print('[{}] FE 14/37'.format(time.time() - start_time))
merge['mean_des'] = merge['item_description'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 15/37'.format(time.time() - start_time))
merge['name_word_count'] = merge['name'].apply(lambda x: len(x.split()))
print('[{}] FE 16/37'.format(time.time() - start_time))
merge['mean_name'] = merge['name'].apply(lambda x: 0 if len(x) == 0 else float(len(x.split())) / len(x)) * 10
print('[{}] FE 17/37'.format(time.time() - start_time))

# Characters per word features
merge['desc_letters_per_word'] = merge['des_len'] / merge['desc_word_count']
print('[{}] FE 18/37'.format(time.time() - start_time))
merge['name_letters_per_word'] = merge['name_len'] / merge['name_word_count']
print('[{}] FE 19/37'.format(time.time() - start_time))

# Upper/lowercase ratio features
merge['NameLowerRatio'] = merge['NameLower'] / merge['name_len']
print('[{}] FE 20/37'.format(time.time() - start_time))
merge['DescriptionLowerRatio'] = merge['DescriptionLower'] / merge['des_len']
print('[{}] FE 21/37'.format(time.time() - start_time))
merge['NameUpperRatio'] = merge['NameUpper'] / merge['name_len']
print('[{}] FE 22/37'.format(time.time() - start_time))
merge['DescriptionUpperRatio'] = merge['DescriptionUpper'] / merge['des_len']
print('[{}] FE 23/37'.format(time.time() - start_time))

# Punctuation related features
merge['NamePunctCount'] = merge.name.str.count(RE_PUNCTUATION)
print('[{}] FE 24/37'.format(time.time() - start_time))
merge['DescriptionPunctCount'] = merge.item_description.str.count(RE_PUNCTUATION)
print('[{}] FE 25/37'.format(time.time() - start_time))
merge['NamePunctCountRatio'] = merge['NamePunctCount'] / merge['name_word_count']
print('[{}] FE 26/37'.format(time.time() - start_time))
merge['DescriptionPunctCountRatio'] = merge['DescriptionPunctCount'] / merge['desc_word_count']
print('[{}] FE 27/37'.format(time.time() - start_time))

# Number related features
merge['NameDigitCount'] = merge.name.str.count('[0-9]')
print('[{}] FE 28/37'.format(time.time() - start_time))
merge['DescriptionDigitCount'] = merge.item_description.str.count('[0-9]')
print('[{}] FE 29/37'.format(time.time() - start_time))
merge['NameDigitCountRatio'] = merge['NameDigitCount'] / merge['name_word_count']
print('[{}] FE 30/37'.format(time.time() - start_time))
merge['DescriptionDigitCountRatio'] = merge['DescriptionDigitCount']/merge['desc_word_count']
print('[{}] FE 31/37'.format(time.time() - start_time))

# Stopword and special character related features
merge['stopword_ratio_desc'] = merge['item_description'].apply(lambda x: len([w for w in x.split() if w in stopwords])) / merge['desc_word_count']
print('[{}] FE 32/37'.format(time.time() - start_time))
merge['num_sum'] = merge['item_description'].apply(sum_numbers)  # Sum of numbers in description
print('[{}] FE 33/37'.format(time.time() - start_time))
merge['weird_characters_desc'] = merge['item_description'].str.count(non_alphanumpunct)  # Count of special characters
print('[{}] FE 34/37'.format(time.time() - start_time))
merge['weird_characters_name'] = merge['name'].str.count(non_alphanumpunct)
print('[{}] FE 35/37'.format(time.time() - start_time))

# Price related keyword features
merge['prices_count'] = merge['item_description'].str.count('[rm]')  # Count of price indicator characters (rm)
print('[{}] FE 36/37'.format(time.time() - start_time))
merge['price_in_name'] = merge['item_description'].str.contains('[rm]', regex=False).astype('int')  # Price indicator presence
print('[{}] FE 37/37'.format(time.time() - start_time))

# Feature normalization
cols = set(merge.columns.values)
basic_cols = {'name', 'item_condition_id', 'brand_name',
 'shipping', 'item_description', 'gencat_name',
 'subcat1_name', 'subcat2_name', 'name_first', 'is_train'}

# Separate columns to normalize and keep 
cols_to_normalize = cols - basic_cols - {'price_in_name'}
other_cols = basic_cols | {'price_in_name'}

# Perform Min-Max normalization
merge_to_normalize = merge[list(cols_to_normalize)]
merge_to_normalize = (merge_to_normalize - merge_to_normalize.mean()) / (merge_to_normalize.max() - merge_to_normalize.min())
print('[{}] FE Normalized'.format(time.time() - start_time))

# Merge normalized features and basic features
merge = merge[list(other_cols)]
merge = pd.concat([merge, merge_to_normalize], axis=1)
print('[{}] FE Merged'.format(time.time() - start_time))

# Memory cleanup
del(merge_to_normalize)
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# Split train/test data
df_test = merge.loc[merge['is_train'] == 0]
df_train = merge.loc[merge['is_train'] == 1]
del merge
gc.collect()
df_test = df_test.drop(['is_train'], axis=1)
df_train = df_train.drop(['is_train'], axis=1)

# Split validation data (if not in submit mode)
if SUBMIT_MODE:
   y_train = y
   del y
   gc.collect()
else:
   df_train, df_test, y_train, y_test = train_test_split(df_train, y, test_size=0.2, random_state=144)

print('[{}] Splitting completed.'.format(time.time() - start_time))

# Process name text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
   "hash_ngrams": 2,  # Use up to 2-grams
   "hash_ngrams_weights": [1.5, 1.0],  # Weights for unigram and bigram
   "hash_size": 2 ** 29,  # Hash size
   "norm": None,  # No normalization
   "tf": 'binary',  # Use binary TF
   "idf": None,  # Don't use IDF
}), procs=8)
wb.dictionary_freeze = True
X_name_train = wb.fit_transform(df_train['name'])
X_name_test = wb.transform(df_test['name'])
del(wb)

# Remove low frequency features
mask = np.where(X_name_train.getnnz(axis=0) > 3)[0]
X_name_train = X_name_train[:, mask]
X_name_test = X_name_test[:, mask]
print('[{}] Vectorize `name` completed.'.format(time.time() - start_time))

# Process item description text using WordBatch
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, {
   "hash_ngrams": 2,
   "hash_ngrams_weights": [1.0, 1.0],
   "hash_size": 2 ** 28,
   "norm": "l2",  # Use L2 normalization
   "tf": 1.0,  # Use actual frequency
   "idf": None
}), procs=8)
wb.dictionary_freeze = True
X_description_train = wb.fit_transform(df_train['item_description'])
X_description_test = wb.transform(df_test['item_description'])
del(wb)

# Remove low frequency features
mask = np.where(X_description_train.getnnz(axis=0) > 3)[0]
X_description_train = X_description_train[:, mask]
X_description_test = X_description_test[:, mask]
print('[{}] Vectorize `item_description` completed.'.format(time.time() - start_time))

# Ridge regression modeling for description text
# Split data in half for cross validation
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_description_train, y_train,
                                                             test_size=0.5,
                                                             shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))

# First Ridge model
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train desc ridge (1)'.format(time.time() - start_time))
desc_ridge_preds1 = model.predict(X_train_2)
desc_ridge_preds1f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (1)'.format(time.time() - start_time))

# Second Ridge model
model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train desc ridge (2)'.format(time.time() - start_time))
desc_ridge_preds2 = model.predict(X_train_1)
desc_ridge_preds2f = model.predict(X_description_test)
print('[{}] Finished to predict desc ridge (2)'.format(time.time() - start_time))

# Combine Ridge predictions
desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(desc_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
   print('RMSLE TEST: {}'.format(rmse(desc_ridge_preds_test, y_test)))

# Ridge regression modeling for name text (same process as above)
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_name_train, y_train,
                                                             test_size=0.5,
                                                             shuffle=False)
print('[{}] Finished splitting'.format(time.time() - start_time))

model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_1, y_train_1)
print('[{}] Finished to train name ridge (1)'.format(time.time() - start_time))
name_ridge_preds1 = model.predict(X_train_2)
name_ridge_preds1f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (1)'.format(time.time() - start_time))

model = Ridge(solver="sag", fit_intercept=True, random_state=205, alpha=3.3)
model.fit(X_train_2, y_train_2)
print('[{}] Finished to train name ridge (2)'.format(time.time() - start_time))
name_ridge_preds2 = model.predict(X_train_1)
name_ridge_preds2f = model.predict(X_name_test)
print('[{}] Finished to predict name ridge (2)'.format(time.time() - start_time))

name_ridge_preds_oof = np.concatenate((name_ridge_preds2, name_ridge_preds1), axis=0)
name_ridge_preds_test = (name_ridge_preds1f + name_ridge_preds2f) / 2.0
print('RMSLE OOF: {}'.format(rmse(name_ridge_preds_oof, y_train)))
if not SUBMIT_MODE:
   print('RMSLE TEST: {}'.format(rmse(name_ridge_preds_test, y_test)))

# Memory cleanup
del X_train_1
del X_train_2
del y_train_1
del y_train_2
del name_ridge_preds1
del name_ridge_preds1f
del name_ridge_preds2
del name_ridge_preds2f
del desc_ridge_preds1
del desc_ridge_preds1f
del desc_ridge_preds2
del desc_ridge_preds2f
gc.collect()
print('[{}] Finished garbage collection'.format(time.time() - start_time))

# Process categorical variables
# Label encode brand names
lb = LabelBinarizer(sparse_output=True)
X_brand_train = lb.fit_transform(df_train['brand_name'])
X_brand_test = lb.transform(df_test['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))

# Label encode categories
X_cat_train = lb.fit_transform(df_train['gencat_name'])
X_cat_test = lb.transform(df_test['gencat_name'])
X_cat1_train = lb.fit_transform(df_train['subcat1_name'])
X_cat1_test = lb.transform(df_test['subcat1_name'])
X_cat2_train = lb.fit_transform(df_train['subcat2_name'])
X_cat2_test = lb.transform(df_test['subcat2_name'])
print('[{}] Finished label binarize categories'.format(time.time() - start_time))

# Create dummy variables for numeric features
X_dummies_train = csr_matrix(
   pd.get_dummies(df_train[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
                  sparse=True).values)
print('[{}] Create dummies completed - train'.format(time.time() - start_time))

X_dummies_test = csr_matrix(
   pd.get_dummies(df_test[list(cols - (basic_cols - {'item_condition_id', 'shipping'}))],
                  sparse=True).values)
print('[{}] Create dummies completed - test'.format(time.time() - start_time))

# Combine all feature matrices
sparse_merge_train = hstack((X_dummies_train, X_description_train, X_brand_train, X_cat_train,
                            X_cat1_train, X_cat2_train, X_name_train)).tocsr()
del X_description_train, lb, X_name_train, X_dummies_train
gc.collect()
print('[{}] Create sparse merge train completed'.format(time.time() - start_time))

sparse_merge_test = hstack((X_dummies_test, X_description_test, X_brand_test, X_cat_test,
                            X_cat1_test, X_cat2_test, X_name_test)).tocsr()
del X_description_test, X_name_test, X_dummies_test
gc.collect()
print('[{}] Create sparse merge test completed'.format(time.time() - start_time))

# Set number of iterations for FM_FTRL model training
if SUBMIT_MODE:
   iters = 3
else:
   iters = 1
   rounds = 3

# Define FM_FTRL model
model = FM_FTRL(alpha=0.035, beta=0.001, L1=0.00001, L2=0.15, D=sparse_merge_train.shape[1],
               alpha_fm=0.05, L2_fm=0.0, init_fm=0.01,
               D_fm=100, e_noise=0, iters=iters, inv_link="identity", threads=4)

# Train and predict with FM_FTRL model
if SUBMIT_MODE:
   model.fit(sparse_merge_train, y_train)
   print('[{}] Train FM completed'.format(time.time() - start_time))
   predsFM = model.predict(sparse_merge_test)
   print('[{}] Predict FM completed'.format(time.time() - start_time))
else:
   # In validation mode, repeat multiple times to check performance
   for i in range(rounds):
       model.fit(sparse_merge_train, y_train)
       predsFM = model.predict(sparse_merge_test)
       print('[{}] Iteration {}/{} -- RMSLE: {}'.format(time.time() - start_time, i + 1, rounds, rmse(predsFM, y_test)))

del model
gc.collect()
if not SUBMIT_MODE:
   print("FM_FTRL dev RMSLE:", rmse(predsFM, y_test))

# Feature selection (SelectKBest)
fselect = SelectKBest(f_regression, k=48000)
train_features = fselect.fit_transform(sparse_merge_train, y_train)
test_features = fselect.transform(sparse_merge_test)
print('[{}] Select best completed'.format(time.time() - start_time))

# Memory cleanup
del sparse_merge_train
del sparse_merge_test
gc.collect()
print('[{}] Garbage collection'.format(time.time() - start_time))

# TF-IDF vectorization (name)
tv = TfidfVectorizer(max_features=250000,
                    ngram_range=(1, 3),
                    stop_words=None)
X_name_train = tv.fit_transform(df_train['name'])
print('[{}] Finished TFIDF vectorize `name` (1/2)'.format(time.time() - start_time))
X_name_test = tv.transform(df_test['name'])
print('[{}] Finished TFIDF vectorize `name` (2/2)'.format(time.time() - start_time))

# TF-IDF vectorization (item description)
tv = TfidfVectorizer(max_features=500000,
                    ngram_range=(1, 3),
                    stop_words=None)
X_description_train = tv.fit_transform(df_train['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (1/2)'.format(time.time() - start_time))
X_description_test = tv.transform(df_test['item_description'])
print('[{}] Finished TFIDF vectorize `item_description` (2/2)'.format(time.time() - start_time))

# Prepare dataset for LightGBM model
d_train = lgb.Dataset(train_features, label=y_train)
del train_features; gc.collect()
if SUBMIT_MODE:
   watchlist = [d_train]
else:
   d_valid = lgb.Dataset(test_features, label=y_test)
   watchlist = [d_train, d_valid]

# Set LightGBM parameters
params = {
   'learning_rate': 0.15,
   'application': 'regression',
   'max_depth': 13,
   'num_leaves': 400,
   'verbosity': -1,  # Don't print training progress
   'metric': 'RMSE',
   'data_random_seed': 1,
   'bagging_fraction': 0.8,  # Data sampling ratio for bagging
   'feature_fraction': 0.6,  # Feature ratio to use in each tree
   'nthread': 4,  # Number of CPU threads to use
   'lambda_l1': 10,  # L1 regularization
   'lambda_l2': 10   # L2 regularization
}
print('[{}] Finished compiling LGB'.format(time.time() - start_time))

# Train LightGBM model
modelL = lgb.train(params,
                 train_set=d_train,
                 num_boost_round=1350,  # Number of boosting iterations
                 valid_sets=watchlist,
                 verbose_eval=50)  # Print evaluation results every 50 iterations

# LightGBM prediction
predsL = modelL.predict(test_features)
predsL[predsL < 0] = 0  # Adjust negative predictions to 0

if not SUBMIT_MODE:
   print("LGB RMSLE:", rmse(predsL, y_test))

# Memory cleanup
del d_train
del modelL
if not SUBMIT_MODE:
   del d_valid
gc.collect()

# Combine FM_FTRL and LightGBM predictions (weighted average)
preds_final = predsFM * 0.33 + predsL * 0.67
if not SUBMIT_MODE:
   print('Final RMSE: ', rmse(preds_final, y_test))

# Save final prediction results
if SUBMIT_MODE:
   preds_final = np.expm1(preds_final)  # Reverse log transformation
   submission['price'] = preds_final
   submission.to_csv('lgb_and_fm_separate_train_test.csv', index=False)
   print('[{}] Writing submission done'.format(time.time() - start_time))

 

3. Combining Ridge predictions

desc_ridge_preds_oof = np.concatenate((desc_ridge_preds2, desc_ridge_preds1), axis=0)
desc_ridge_preds_test = (desc_ridge_preds1f + desc_ridge_preds2f) / 2.0
  • OOF Predictions:
    • Each data point is predicted by a model that didn't use it for training
    • Can obtain predictions for the entire training data without overfitting
    • These predictions can be used as features for the next level models
  • Test Predictions Average:
    • More stable predictions by averaging predictions from two models
    • Offsets errors from individual models
    • Applies the basic principle of ensemble learning
  • However, this part in this kernel is only used for printing sample prediction result with simple ridge model in the middle of the process. NOT FOR FINAL PREDICTION!!!

Stay focused on your goals and don't let distractions derail you from your path.
- Max Holloway -
반응형