DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data

DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data Active learning resembles conventional supervised learning more closely. Given that models are trained to utilize both labeled and unlabeled data, it is a form of semi-supervised learning. The concept behind semi-supervised learning is that classifying a small sample of data may produce results that are as accurate as or even more accurate than training data that has been fully labeled. Finding that sample is the only difficult part. In active learning, data is incrementally and dynamically labeled during training so that the algorithm can determine which label would be most helpful for it to learn from.

Experiment

Let’s compare the two classifiers, Sequential and Active Learning, through this easy experiment.

We’ll be using the Contract Understanding Atticus Dataset (CUAD) for the purpose of this experiment. The dataset contains 4414 paragraphs. We have randomly selected 10 types of legal clause classes for the text classification task.

Download the dataset: https://www.atticusprojectai.org/cuad

Loading data & preprocessing

We’ll start by loading the data. The features and labels are then extracted from the data, and a train-test split is made, allowing the test split to be used to assess how well the model trained using the train split performed.

#train test split

df_train, df_test = train_test_split(df_total, test_size=0.2, random_state = 0)

# The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)

vectorizer = CountVectorizer(ngram_range=(1, 5))

X_train = vectorizer.fit_transform(train_df.para)

X_test = vectorizer.transform(df_test.para)

X_train.shape[Out] : (4082, 1802587)

X_test.shape[Out] : (1021, 1802587)

With Sequential Learning

The Sequential Learning classifier will be trained twice only, once with 100 paragraphs and the next time with 1000 paragraphs. For the next 1000 paragraphs, the total number of correct and wrong predictions is calculate

#Defining the classifier
classifier_seq = MultinomialNB()

#preprocessing of training set
x_100_train = vectorizer.transform(df_train[:100]['para'])
y_100_train = df_train[:100]['label']

#preprocessing of testing set
x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
y_1000_test = df_train[100:1100]['label']

#training the model
classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())

#calculating the accuracy on the test set
classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())
[Out] : 0.681

Code for calculating the accuracy of suggestion for every 10 paras sequentially

#the number of paragraphs in validation set for each iteration
n_instances = 10

#first 100 paragraphs
n_start = 100 
n_para = 100

#defining the classifier
classifier_seq = MultinomialNB()
#preprocessing of training set
x_100_train = vectorizer.transform(df_train[:100]['para'])
y_100_train = df_train[:100]['label']

#preprocessing of testing set
x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
y_1000_test = df_train[100:1100]['label']

#training the classifier
classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())
#calculating the accuracy on the testing set
acc_1000 = classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())

accuracies_seq = []
seq_correct = 0
seq_wrong = 0
y_test_list = []

while n_para <= 1100:
    
    x_test =  vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
    y_test = df_train[n_start:n_start+n_instances]['label']
    y_test_list.append(y_test.to_list())
    
    y_pred = classifier_seq.predict(X=x_test)
    y_pred = list(y_pred)
    
    #calculation of total number of correct and wrong prediction for the   next 1000 paras
    if len(y_test_list) <= 100:
        for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
            if ele_pred == ele_test:
                seq_correct += 1
            else:
                seq_wrong += 1
    
    score = classifier_seq.score(X=x_test, y=y_test.to_list())
    accuracies_seq.append(score)
    
    n_start = n_start + n_instances
    n_para += n_instances
   
#total number of correctly classified and incorrectly classified samples
seq_correct, seq_wrong
[Out] : (681, 319)

With Active Learning

The Active Learning classifier will be trained iteratively.

After training using the first 100 paragraphs, accuracy on the next 10 paragraphs is predicted. We take note of correct and incorrect paragraphs predicted on unseen data. These 10 paragraphs will now be used in training. This will happen subsequently until all 1000 paragraphs are used in training. Similar to the sequential learning classifier, the total number of correct and wrong predictions for the next 1000 paragraphs are checked.

#the number of paragraphs in validation set for each iteration
n_instances = 10

#first 100 paragraphs
n_start = 100 

#defining the classifier
classifier = MultinomialNB()

#preprocessing of training set
x_train = vectorizer.transform(df_train[:100]['para'])
y_train = df_train[:100]['label']

#training the classifier
classifier.fit(X=x_train, y=y_train.to_list())

accuracies = []
test_accuracies = []

correct = 0
wrong = 0

n_para = 100
y_test_list = []

while x_train.shape[0] <= 1100:
    
    #calculating test accuracy for each iteration
    test_score = classifier.score(X=X_test, y=df_test.label.to_list())
    test_accuracies.append(test_score)
    
    #calculating validation accuracy for every next 10 paras
    x_test = vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
    y_test = df_train[n_start:n_start+n_instances]['label']
    y_test_list.append(y_test.to_list())
    y_pred = classifier.predict(X=x_test)
    y_pred = list(y_pred)
    
    ##calculation of total number of correct and wrong prediction for the next 1000 paras
    if len(y_test_list) <= 100:
        for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
            if ele_pred == ele_test:
                correct += 1
            else:
                wrong += 1

    score = classifier.score(X=x_test, y=y_test.to_list())
    accuracies.append(score)
    #sequentially increasing the number of para
    x_train = vectorizer.transform(df_train[:n_start+n_instances]['para'])
    y_train = df_train[:n_start+n_instances]['label']
    classifier = MultinomialNB()
    classifier.fit(X=x_train, y=y_train.to_list())
    
    n_start = n_start + n_instances
    n_para += n_instances

#total number of correctly classified and incorrectly classified samples
correct, wrong
[Out] : (844, 156)

Results

Let’s take a look at the results we obtained from the experiments:

The first graph indicates that with more paragraphs used for training, the accuracy of the active learning classifier increases. Whereas, for the Sequential Learning algorithm the accuracy does not improve.

The second graph shows the accuracy of the two classifiers after the addition of every 10 paragraphs over the initial 100 paragraphs of the unseen testing dataset.

Conclusion

The results show that the Active Learning Classifier outperforms the Sequential Learning Classifier. This is because Active Learning learns through intermediate validations, whereas the Sequential Learning classifier only learns at 100 and 1100 paragraphs only and does not understand anything in between. This is how the active Learning classifier can achieve higher accuracy while using less labeled data.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *