Category: case study

  • Comparison of NLP Data Sampling Strategies

    Comparison of NLP Data Sampling Strategies

    What is Sampling?

    The sample is a collection of people, things, or things used in the study that is taken for analysis from a wider population. To enable us to extrapolate the research sample’s findings to the entire population, the sample must be representative of the population.

    Let’s go through a real-world scenario.

    We’re looking for Mumbai’s adult population’s average annual salary. Up till 2022, Mumbai has a population of about 30 million. Males and females in this population would roughly be split 1:1 (these are simple generalizations), and they might have different averages. Similarly, there are numerous more ways in which various adult population groupings may have varying income levels. As you may guess, it is incredibly difficult to determine the average adult income in Mumbai.

    Since it’s impossible to reach every adult in the whole population, what can be the solution? We can collect numerous samples and determine the average height of the people in the chosen samples.

    How can we take a Sample?

    Taking the same scenario from above, imagine we only take samples from the people in managerial positions. This won’t be regarded as a decent sample because, on generalizing, a manager earns more than the average adult, and it will provide us with a poor estimation of the income of the average adult. A sample must accurately reflect the universe from which it was drawn.

    There are various different potential solutions, but we’ll be looking at three major techniques.

    Sampling strategies :

    • Most uncertain probability
    • Most certain data points
    • The basic mixture from different confidence intervals

    Most Uncertain Probability

    The aim behind uncertainty sampling is to focus on the data item that the present predictor is least certain about. To put it another way, uncertainty sampling typically finds points that are located near thWhat is Sampling?

    The sample is a collection of people, things, or things used in the study that is taken for analysis from a wider population. To enable us to extrapolate the research sample’s findings to the entire population, the sample must be representative of the population.

    Let’s go through a real-world scenario.

    We’re looking for Mumbai’s adult population’s average annual salary. Up till 2022, Mumbai has a population of about 30 million. Males and females in this population would roughly be split 1:1 (these are simple generalizations), and they might have different averages. Similarly, there are numerous more ways in which various adult population groupings may have varying income levels. As you may guess, it is incredibly difficult to determine the average adult income in Mumbai.

    Since it’s impossible to reach every adult in the whole population, what can be the solution? We can collect numerous samples and determine the average height of the people in the chosen samples.

    How can we take a Sample?

    Taking the same scenario from above, imagine we only take samples from the people in managerial positions. This won’t be regarded as a decent sample because, on generalizing, a manager earns more than the average adult, and it will provide us with a poor estimation of the income of the average adult. A sample must accurately reflect the universe from which it was drawn.

    There are various different potential solutions, but we’ll be looking at three major techniques.

    Sampling strategies :

    • Most uncertain probability
    • Most certain data points
    • The basic mixture from different confidence intervals

    Most Uncertain Probability

    The aim behind uncertainty sampling is to focus on the data item that the present predictor is least certain about. To put it another way, uncertainty sampling typically finds points that are located near the decision boundary of the current model.

    Uncertainty Sampling

    Assume that a student is preparing for an exam and has 1000 questions to go through. The student only has time to go through 100 of them. Naturally, the student should prepare 100 questions on which the individual is least confident. With the new questions, students should get smarter, and faster.

    Most Certain Data Points

    This method chooses the data points with the highest certainty ie. data points that are predicted by the model with the highest confidence. These data points have maximum chances of getting correctly predicted by the model. Such data points may or may not add a lot of new information to the model learning.

    Basic mixture of different Confidence Intervals

    Data points are grouped according to their confidence scores, and sampling is done from all of these intervals or groups. This way, we can make sure that no kind of data is missed out upon. This ensures that the sampled data points are having a balance of certain and uncertain data points. This way the model can learn the decision boundary well without missing out on already learned information.

    Code

    Now, let’s use these sampling methods and see their application using a simple code in Python!

    We’ll be working with a binary classification problem, using two datasets:

    1. IMDB Movie Review Dataset for sentiment analysis. Two classes in this dataset: Positive, Negative
    2. Emotion Dataset. Two classes in this dataset: Joy, Sadness

    Download the datasets:

    1. https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
    2. https://huggingface.co/datasets/emotion

    We have performed this experiment on Jupyter Notebook.

    Loading Data & Preprocessing

    The availability of data is always a determining factor in the field of machine learning, so loading data should be done first. After loading the dataset and the necessary modules, the dataframe should be looking like this.

    Clean the data by replacing all occurrences of breaks with single white space.

    for idx in range(len(df['review'])):
        df['review'][idx] = df['review'][idx].replace('<br /><br />', ' ')

    For ease of the experiment, we’re using 10k paragraphs out of the whole dataset of 50k paragraphs.

    frac = 1/5
    df_new = df.sample(frac = frac, random_state = 0)
    df_new.shape
    [Out]: (10000, 2)

    Data Splitting

    A train-test split is created, so the test split can be used to evaluate the performance of the model trained using the train split.

    df_train, df_test = train_test_split(df_new, test_size=0.2, random_state = 0)
    df_train = df_train.reset_index(drop = True)
    df_train.shape, df_test.shape
    [Out]: ((8000, 2), (2000, 2))

    Let’s separate out 100 paragraphs for the training purpose and the remaining 7900 paragraphs for testing.

    df_stage1_test = df_train[~df_train.isin(df_100_train)].dropna(how = 'all').reset_index(drop = True)

    Classifier Training

    Since the train and test sets have been constructed, the pipeline can be instantiated. The pipeline consists of three steps: data transformation, resampling, and model creation at the end.

    # The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)
    vectorizer = CountVectorizer(ngram_range=(1, 5))
    X_100_train = vectorizer.fit_transform(df_100_train.review)
    X_stage1_test = vectorizer.transform(df_stage1_test.review)
    X_test = vectorizer.transform(df_test.review)
    
    labelencoder = LabelEncoder()
    df_100_train['sentiment'] = labelencoder.fit_transform(df_100_train['sentiment'])
    df_stage1_test['sentiment'] = labelencoder.transform(df_stage1_test['sentiment'])
    df_test['sentiment'] = labelencoder.transform(df_test['sentiment'])

    Before moving on to sampling strategies, an initial model is trained

    logreg = LogisticRegression()
    logreg.fit(X=X_100_train, y=df_100_train['sentiment'].to_list())

    Calculating the Predicted Probabilities

    Predicted probability/ confidence scores are calculated on the test data.

    pred_proba = logreg.predict_proba(X=X_stage1_test)
    df_proba = pd.DataFrame()
    df_proba['review'] = df_stage1_test['review']
    df_proba['sentiment'] = df_stage1_test['sentiment']
    df_proba['predict_proba'] = list(pred_proba[:,0]
    df_proba

    Most Uncertain Probability Sampling

    1000 or more paragraphs are picked from a window of probability with the highest degree of uncertainty. These 1000 paragraphs are sorted increasingly in a dataframe. Then we compute the predicted probability’s mean value.

    The index of the row with the predicted probability value closest to the mean value is calculated.

    The paragraph sets are chosen using the mean value index row (Half of them from greater than part and half of them from less than part of the probability). To choose the most uncertain sets of paragraphs, use the same method as minimizing and maximizing the uncertain probability window range.

    #window between 0.45 to 0.55
    df_uncertain = df_proba[(df_proba['predict_proba'] >= 0.44) & (df_proba['predict_proba'] <= 0.55) ]
    df_uncertain_sorted = df_uncertain.sort_values(by = ['predict_proba'], ascending = False)
    df_uncertain_sorted	
    #index of the row closest to the mean value of predicted probability
    
    mid_idx = int(len(df_uncertain_sorted)/2)
    mean_idx = mid_idx-12
    df_uncertain_sorted['predict_proba'].mean()
    [Out]: 0.4936057560087512
    num_of_para = [100,200,300,400,500,600,700,800,900,1000]
    score_uncertain_list = []
    
    for para in num_of_para:
        
        para_idx = int(para/2)
        
        #training set
        df_uncertain_new = df_uncertain_sorted.iloc[mean_idx-para_idx:mean_idx+para_idx]
        
        #preprocessing
        X_train_uncertain = vectorizer.transform(df_uncertain_new.review)
        
        #defining the classifier
        logreg_uncertain = LogisticRegression()
        
        #training the classifier
        logreg_uncertain.fit(X=X_train_uncertain, y=df_uncertain_new['sentiment'].to_list())    
        #calculating the accuracy score on the test set
        score_uncertain = logreg_uncertain.score(X_test, df_test['sentiment'].to_list())
        score_uncertain_list.append(score_uncertain)
    
    score_uncertain_list
    [Out]: [0.547, 0.5845, 0.584, 0.612, 0.6215, 0.6335, 0.6415, 0.663, 0.659, 0.6755]

    Most Certain Probability Sampling

    The dataframe with 7900 paragraphs is sorted in descending order of their predicted probabilities. The top [100,200,300,400,500,600,700,800,900,1000] sets of paragraphs are selected as the most certain paragraphs.

    df_proba_sorted = df_proba.sort_values(by = ['predict_proba'], ascending = False)
    df_proba_sorted
    num_of_para = [100,200,300,400,500,600,700,800,900,1000]
    score_certain_list = []
    
    for para in num_of_para:
        
        #training set
        df_certain = df_proba_sorted[:para]
        
        #preprocessing
        X_train_certain = vectorizer.transform(df_certain.review)
        
        #defining the classifier
        logreg_certain = LogisticRegression()
        
        #training the classifier
        logreg_certain.fit(X=X_train_certain, y=df_certain['sentiment'].to_list())
        
        #calculating the accuracy score on the test set
        score_certain = logreg_certain.score(X_test, df_test['sentiment'].to_list())
        
        score_certain_list.append(score_certain)
    
    score_certain_list
    [Out]: [0.5215, 0.54, 0.536, 0.5755, 0.5905, 0.6245, 0.641, 0.6735, 0.7355, 0.7145]

    Confidence Interval Grouping Sampling

    In this method, the 25th and 75th percentile of the predicted probabilities are calculated. Then the 7900 paragraphs are separated into 3 groups.

    • Predicted probabilities > 75 percentile — Group 1
    • 25 percentile < Predicted probabilities < 75 percentile — Group 2
    • Predicted probabilities < 25 percentile — Group 3

    From these 3 groups [100,200,300,400,500.600,700.800.900,1000] sets of paragraphs are sampled out according to these fractions:

    #calculating the 25th and 75th percentile
    
    proba_arr = df_proba['predict_proba']
    percentile_75 = np.percentile(proba_arr, 75)
    percentile_25 = np.percentile(proba_arr, 25)
    
    print("25th percentile of arr : ",
           np.percentile(proba_arr, 25))
    [Out]: 25th percentile of arr :  0.28084100127515504
    print("75th percentile of arr : ",
           np.percentile(proba_arr, 75))
    [Out]: 75th percentile of arr :  0.7063559972435552
    
    #grouping of the paragraphs for following window 
    # group 1 : >= 75
    df_group_1 = df_proba[df_proba['predict_proba'] >= percentile_75]
    # group 2 : <75 and >= 25
    df_group_2 = df_proba[(df_proba['predict_proba'] >= percentile_25) & (df_proba['predict_proba'] < percentile_75)]
    # group 3 : < 25
    df_group_3 = df_proba[(df_proba['predict_proba'] < percentile_25)]
    
    df_group_1.shape, df_group_2.shape, df_group_3.shape
    [Out]: ((1975, 3), (3950, 3), (1975, 3))

    Four different models are then trained on each set of paragraphs for each of the 3 sampling techniques. [total 10 x 3 = 30 models]. The accuracy score is calculated for each of the cases by fitting the models on the 2000-paragraph test set.

    num_of_para = [100,200,300,400,500,600,700,800,900,1000]
    score_conf_list = []
    
    #fractions
    frac1 = 0.4
    frac2 = 0.3
    frac3 = 0.3
    
    #sampling paragraphs from the 3 groups
    df_group_1_frac = df_group_1.sample(frac=frac1, random_state=1).reset_index(drop = True)
    df_group_2_frac = df_group_2.sample(frac=frac2, random_state=1).reset_index(drop = True)
    df_group_3_frac = df_group_3.sample(frac=frac3, random_state=1).reset_index(drop = True)
    
    for para in num_of_para:
        
        #sampling paragraphs from the 3 groups to build the training set
        df_group_1_new = df_group_1_frac[:int(frac1 * para)]
        df_group_2_new = df_group_2_frac[:int(frac2 * para)]
        df_group_3_new = df_group_3_frac[:int(frac3 * para)]
        
        df_list = [df_group_1_new, df_group_2_new, df_group_3_new]
        
        #training set
        df_conf = pd.concat(df_list).reset_index(drop = True)
        
        #preprocessing
        X_train_conf = vectorizer.transform(df_conf.review)
        
        #defining the classifier
        logreg_conf = LogisticRegression()
        
        #training the classifier
        logreg_conf.fit(X=X_train_conf, y=df_conf['sentiment'].to_list())
        
        #calculating the accuracy score on the test set
        score_conf = logreg_conf.score(X_test, df_test['sentiment'].to_list())
        score_conf_list.append(score_conf)
    
    score_conf_list
    [Out]: [0.6525, 0.6835, 0.7235, 0.7525, 0.766, 0.7735, 0.778, 0.7875, 0.796, 0.807]

    Results and Conclusion

    The accuracies of the three sampling strategies can now be compared, and it is clear that a combination of different confidence intervals performs better than the others. This shows that along with learning new information from uncertain paragraphs the model also requires retaining the previously learned information. Therefore a balance of data from different confidence intervals helps the model learn, maximizing the resulting overall accuracy.