DataNeuron helps you accelerate and automate human-in-loop annotation for developing AI solutions. Powered by a data-centric platform, we automate data labeling, the creation of models, and end-to-end lifecycle management of ML.
Ingest
Upload Visualization
Users can upload the entire data available with them without performing any filteration to remove out of scope paragraphs.
The data can be uploaded in various file formats supported by the platform.
The platform has an in-built feature that can handle out-of-scope paragraphs and separate them from the classification data. This functionality is optional and can be toggled on or off anytime during the process.
Structure
Structure Visualization
The next step is the creation of the project structure.
Instead of a simple flat structure with just the classes defined, we provide the user with the option to create a multi-level (hierarchical) structure so that he can extract the data grouped into domains, subdomains, and indefinitely continue dividing into further subparts depending on his needs.
Any of the defined nodes can be marked as a class for the data to be classified into irrespective of the level on which it is in the hierarchy. This provides flexibility to create any level of ontology for classification.
Validate
Validate Visualization
User does not have to go through the entire dataset to sort out paragraphs that belong to a certain class and label them to provide training data for the model, which can be a tedious and difficult task.
We propose a validation based approach:
The platform provides the users with suggestions of paragraphs that are most likely to belong to a certain category/class based on an efficient context-based filtering criterion.
The user simply has to validate the suggestions, i.e, check whether or not the suggested class is correct.
This reduces the effort put in by the user in filtering out the paragraphs belonging to a category from the entire dataset by a large margin.
The strategic annotation technique allows the user to adopt a ‘one-vs-all’ strategy, which makes the task far easier than having to take into consideration all the defined classes, which can be a large number depending on the problem at hand, while tagging a paragraph.
Our intelligent filtering algorithm ensures “edge-case” paragraphs, i.e paragraphs that do not have obvious correlation with a class but still belong to that class, are not left out.
This step is broken down into 2 stages:
The validation done by the user in the first stage is used for determining the annotation suggestions offered in the second stage.
Each batch of annotation is used to improve the accuracy of the filtering algorithm for the next batch.
The platform also provide a summary screen after each batch of validation which provides the user with an idea as to many more paragraphs he might need to validate per class in order to achieve a higher accuracy.
It also helps determine when to stop the validation for a specific class and focus more on a class for which the platform projects low confidence.
Train
Train Visualization
User invests virtually no effort into the model training step and the model training can be initiated with the simple click of a button.
The complete training process is automatic which performs preprocessing, feature engineering, model selection, model training, optimization and k-fold validation.
After the final model is trained, the platform shows a detailed report of the trained model is presented to the user which includes the overall accuracy of the model as well as the accuracy achieved per class.
Iterate
Iterate Visualisation
Once the model has been trained, we provide the user with 2 options:
Continue to the deployment stage if the trained model matches their expectations.
If the model does not achieve the desired results, the user can choose to go back and provide more training paragraphs (by validating more paragraphs or uploading seed paragraphs) or alter the project structure to remove some classes and then retrain the model to achieve better results.
Deploy (“No-Code” Prediction Service)
Deploy Visualisation
Apart from providing the final annotations on the data uploaded by the user using the trained model, we also provide a prediction service which can be used to make a prediction on any new paragraphs in exchange for a very minimal fee.
This does not require any knowledge of coding and users can utilize this service for any input data from the platform.
This can also be integrated into other platforms by making use of the exposed prediction API or the deployed Python package.
No Requirement for a Data Science/Machine Learning Expert
The DataNeuron ALP is designed in such a way that no prerequisite knowledge of data science or machine learning is required to utilize the platform to its maximum potential.
For some very specific use cases, a Subject Matter Expert might be required but for the majority of use cases, an SME is not required in the DataNeuron Pipeline.
The sample is a collection of people, things, or things used in the study that is taken for analysis from a wider population. To enable us to extrapolate the research sample’s findings to the entire population, the sample must be representative of the population.
Let’s go through a real-world scenario.
We’re looking for Mumbai’s adult population’s average annual salary. Up till 2022, Mumbai has a population of about 30 million. Males and females in this population would roughly be split 1:1 (these are simple generalizations), and they might have different averages. Similarly, there are numerous more ways in which various adult population groupings may have varying income levels. As you may guess, it is incredibly difficult to determine the average adult income in Mumbai.
Since it’s impossible to reach every adult in the whole population, what can be the solution? We can collect numerous samples and determine the average height of the people in the chosen samples.
How can we take a Sample?
Taking the same scenario from above, imagine we only take samples from the people in managerial positions. This won’t be regarded as a decent sample because, on generalizing, a manager earns more than the average adult, and it will provide us with a poor estimation of the income of the average adult. A sample must accurately reflect the universe from which it was drawn.
There are various different potential solutions, but we’ll be looking at three major techniques.
Sampling strategies :
Most uncertain probability
Most certain data points
The basic mixture from different confidence intervals
Most Uncertain Probability
The aim behind uncertainty sampling is to focus on the data item that the present predictor is least certain about. To put it another way, uncertainty sampling typically finds points that are located near thWhat is Sampling?
The sample is a collection of people, things, or things used in the study that is taken for analysis from a wider population. To enable us to extrapolate the research sample’s findings to the entire population, the sample must be representative of the population.
Let’s go through a real-world scenario.
We’re looking for Mumbai’s adult population’s average annual salary. Up till 2022, Mumbai has a population of about 30 million. Males and females in this population would roughly be split 1:1 (these are simple generalizations), and they might have different averages. Similarly, there are numerous more ways in which various adult population groupings may have varying income levels. As you may guess, it is incredibly difficult to determine the average adult income in Mumbai.
Since it’s impossible to reach every adult in the whole population, what can be the solution? We can collect numerous samples and determine the average height of the people in the chosen samples.
How can we take a Sample?
Taking the same scenario from above, imagine we only take samples from the people in managerial positions. This won’t be regarded as a decent sample because, on generalizing, a manager earns more than the average adult, and it will provide us with a poor estimation of the income of the average adult. A sample must accurately reflect the universe from which it was drawn.
There are various different potential solutions, but we’ll be looking at three major techniques.
Sampling strategies :
Most uncertain probability
Most certain data points
The basic mixture from different confidence intervals
Most Uncertain Probability
The aim behind uncertainty sampling is to focus on the data item that the present predictor is least certain about. To put it another way, uncertainty sampling typically finds points that are located near the decision boundary of the current model.
Uncertainty Sampling
Assume that a student is preparing for an exam and has 1000 questions to go through. The student only has time to go through 100 of them. Naturally, the student should prepare 100 questions on which the individual is least confident. With the new questions, students should get smarter, and faster.
Most Certain Data Points
This method chooses the data points with the highest certainty ie. data points that are predicted by the model with the highest confidence. These data points have maximum chances of getting correctly predicted by the model. Such data points may or may not add a lot of new information to the model learning.
Basic mixture of different Confidence Intervals
Data points are grouped according to their confidence scores, and sampling is done from all of these intervals or groups. This way, we can make sure that no kind of data is missed out upon. This ensures that the sampled data points are having a balance of certain and uncertain data points. This way the model can learn the decision boundary well without missing out on already learned information.
Code
Now, let’s use these sampling methods and see their application using a simple code in Python!
We’ll be working with a binary classification problem, using two datasets:
IMDB Movie Review Dataset for sentiment analysis. Two classes in this dataset: Positive, Negative
Emotion Dataset. Two classes in this dataset: Joy, Sadness
We have performed this experiment on Jupyter Notebook.
Loading Data & Preprocessing
The availability of data is always a determining factor in the field of machine learning, so loading data should be done first. After loading the dataset and the necessary modules, the dataframe should be looking like this.
Clean the data by replacing all occurrences of breaks with single white space.
for idx in range(len(df['review'])):
df['review'][idx] = df['review'][idx].replace('<br /><br />', ' ')
For ease of the experiment, we’re using 10k paragraphs out of the whole dataset of 50k paragraphs.
Since the train and test sets have been constructed, the pipeline can be instantiated. The pipeline consists of three steps: data transformation, resampling, and model creation at the end.
# The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)
vectorizer = CountVectorizer(ngram_range=(1, 5))
X_100_train = vectorizer.fit_transform(df_100_train.review)
X_stage1_test = vectorizer.transform(df_stage1_test.review)
X_test = vectorizer.transform(df_test.review)
labelencoder = LabelEncoder()
df_100_train['sentiment'] = labelencoder.fit_transform(df_100_train['sentiment'])
df_stage1_test['sentiment'] = labelencoder.transform(df_stage1_test['sentiment'])
df_test['sentiment'] = labelencoder.transform(df_test['sentiment'])
Before moving on to sampling strategies, an initial model is trained
1000 or more paragraphs are picked from a window of probability with the highest degree of uncertainty. These 1000 paragraphs are sorted increasingly in a dataframe. Then we compute the predicted probability’s mean value.
The index of the row with the predicted probability value closest to the mean value is calculated.
The paragraph sets are chosen using the mean value index row (Half of them from greater than part and half of them from less than part of the probability). To choose the most uncertain sets of paragraphs, use the same method as minimizing and maximizing the uncertain probability window range.
#index of the row closest to the mean value of predicted probability
mid_idx = int(len(df_uncertain_sorted)/2)
mean_idx = mid_idx-12
df_uncertain_sorted['predict_proba'].mean()
[Out]: 0.4936057560087512
num_of_para = [100,200,300,400,500,600,700,800,900,1000]
score_uncertain_list = []
for para in num_of_para:
para_idx = int(para/2)
#training set
df_uncertain_new = df_uncertain_sorted.iloc[mean_idx-para_idx:mean_idx+para_idx]
#preprocessing
X_train_uncertain = vectorizer.transform(df_uncertain_new.review)
#defining the classifier
logreg_uncertain = LogisticRegression()
#training the classifier
logreg_uncertain.fit(X=X_train_uncertain, y=df_uncertain_new['sentiment'].to_list())
#calculating the accuracy score on the test set
score_uncertain = logreg_uncertain.score(X_test, df_test['sentiment'].to_list())
score_uncertain_list.append(score_uncertain)
score_uncertain_list
[Out]: [0.547, 0.5845, 0.584, 0.612, 0.6215, 0.6335, 0.6415, 0.663, 0.659, 0.6755]
Most Certain Probability Sampling
The dataframe with 7900 paragraphs is sorted in descending order of their predicted probabilities. The top [100,200,300,400,500,600,700,800,900,1000] sets of paragraphs are selected as the most certain paragraphs.
num_of_para = [100,200,300,400,500,600,700,800,900,1000]
score_certain_list = []
for para in num_of_para:
#training set
df_certain = df_proba_sorted[:para]
#preprocessing
X_train_certain = vectorizer.transform(df_certain.review)
#defining the classifier
logreg_certain = LogisticRegression()
#training the classifier
logreg_certain.fit(X=X_train_certain, y=df_certain['sentiment'].to_list())
#calculating the accuracy score on the test set
score_certain = logreg_certain.score(X_test, df_test['sentiment'].to_list())
score_certain_list.append(score_certain)
score_certain_list
[Out]: [0.5215, 0.54, 0.536, 0.5755, 0.5905, 0.6245, 0.641, 0.6735, 0.7355, 0.7145]
Confidence Interval Grouping Sampling
In this method, the 25th and 75th percentile of the predicted probabilities are calculated. Then the 7900 paragraphs are separated into 3 groups.
From these 3 groups [100,200,300,400,500.600,700.800.900,1000] sets of paragraphs are sampled out according to these fractions:
#calculating the 25th and 75th percentile
proba_arr = df_proba['predict_proba']
percentile_75 = np.percentile(proba_arr, 75)
percentile_25 = np.percentile(proba_arr, 25)
print("25th percentile of arr : ",
np.percentile(proba_arr, 25))
[Out]: 25th percentile of arr : 0.28084100127515504
print("75th percentile of arr : ",
np.percentile(proba_arr, 75))
[Out]: 75th percentile of arr : 0.7063559972435552
#grouping of the paragraphs for following window
# group 1 : >= 75
df_group_1 = df_proba[df_proba['predict_proba'] >= percentile_75]
# group 2 : <75 and >= 25
df_group_2 = df_proba[(df_proba['predict_proba'] >= percentile_25) & (df_proba['predict_proba'] < percentile_75)]
# group 3 : < 25
df_group_3 = df_proba[(df_proba['predict_proba'] < percentile_25)]
df_group_1.shape, df_group_2.shape, df_group_3.shape
[Out]: ((1975, 3), (3950, 3), (1975, 3))
Four different models are then trained on each set of paragraphs for each of the 3 sampling techniques. [total 10 x 3 = 30 models]. The accuracy score is calculated for each of the cases by fitting the models on the 2000-paragraph test set.
num_of_para = [100,200,300,400,500,600,700,800,900,1000]
score_conf_list = []
#fractions
frac1 = 0.4
frac2 = 0.3
frac3 = 0.3
#sampling paragraphs from the 3 groups
df_group_1_frac = df_group_1.sample(frac=frac1, random_state=1).reset_index(drop = True)
df_group_2_frac = df_group_2.sample(frac=frac2, random_state=1).reset_index(drop = True)
df_group_3_frac = df_group_3.sample(frac=frac3, random_state=1).reset_index(drop = True)
for para in num_of_para:
#sampling paragraphs from the 3 groups to build the training set
df_group_1_new = df_group_1_frac[:int(frac1 * para)]
df_group_2_new = df_group_2_frac[:int(frac2 * para)]
df_group_3_new = df_group_3_frac[:int(frac3 * para)]
df_list = [df_group_1_new, df_group_2_new, df_group_3_new]
#training set
df_conf = pd.concat(df_list).reset_index(drop = True)
#preprocessing
X_train_conf = vectorizer.transform(df_conf.review)
#defining the classifier
logreg_conf = LogisticRegression()
#training the classifier
logreg_conf.fit(X=X_train_conf, y=df_conf['sentiment'].to_list())
#calculating the accuracy score on the test set
score_conf = logreg_conf.score(X_test, df_test['sentiment'].to_list())
score_conf_list.append(score_conf)
score_conf_list
[Out]: [0.6525, 0.6835, 0.7235, 0.7525, 0.766, 0.7735, 0.778, 0.7875, 0.796, 0.807]
Results and Conclusion
The accuracies of the three sampling strategies can now be compared, and it is clear that a combination of different confidence intervals performs better than the others. This shows that along with learning new information from uncertain paragraphs the model also requires retaining the previously learned information. Therefore a balance of data from different confidence intervals helps the model learn, maximizing the resulting overall accuracy.
We compare DataNeuron‘s Unsupervised learning algorithms to the most popular Language Models in benchmark experiments. This benchmarking is primarily intended to compare models capable of data annotation without any prior domain knowledge or pre-training (Zero-Shot).
Dataset
3 domain-specific datasets were used to compare the algorithms’ accuracy:
The most capable GPT-3.5 model, optimized for chat, at one-tenth the cost of the text-davinci-003. Supports up to 4,096 tokens with training data until September 2021.
We have picked this GPT version for the benchmarking since ChatGPT hosts the same model version, which is available to the general public.
Using pip install OpenAI, install the OpenAI library in your Python environment. After installation, create an OpenAI API key. Then use Openai.api_key = ‘key-extracted-from-openai’ to authenticate the API key.
2. ChatCompletion
Use the Chat Completion Class to call the “create” function, which serves as a model for sending requests and receiving API responses.
Please label each of the following 100 paragraphs strictly into one of these 6 classes: 1. sadness 2. anger 3. love 4. surprise 5. fear 6. joy. Separate answers with , for each of the paragraphs [“paragraph1”,”paragraph2”,………,”paragraph100”]
4. Benchmarking
Classification Report was generated by comparing the ground truth to the GPT 3.5 Output after creating classes on paragraphs with OpenAI libraries.
DataNeuron Platform (Stage 1)
Stage 1 models predict paragraphs based on the user-defined Masterlist/ Taxonomy. is equivalent to providing prompts or scope of the classification to the Zero-Shot LLMs. Stage 1 consists of proprietary Unsupervised models for annotation and DSEAL algorithms for strategic data sampling.
Result
Conclusion
DataNeuron Stage 1 performed better than pre-trained LLMs BERT BASE and ROBERTA across all the 3 datasets. Further it has comparable accuracies to BART LARGE for OSHUMED and CUAD datasets while outperforming it significantly on EMOTIONS dataset. DataNeuron Stage 1 models also outperformed GPT 3.5 in the benchmarking on CUAD and EMOTIONS dataset.
It is critical to note DataNeuron Stage 1 model was not given any sample paragraphs for pre-training, implying that Stage 1 models can automatically annotate with high accuracy without any prior domain knowledge.
Since DataNeuron models are light-weight it scales much better for the large data annotation workflows when compared to LLMs. At the same time DataNeuron is able to achieve comparable/ better accuracies with the proprietary Unsupervised models and DSEAL algorithms when compared to pre-trained LLMs at lesser cost/ time.
The term Artificial General Intelligence (often abbreviated “AGI”) has no precise definition, but one of the most widely accepted ones is the capacity of an engineered system to display intelligence that is not tied to a highly specific set of tasks or generalize what it has learned, including generalization to contexts qualitatively very different from those it has seen before and take a broad view, interpret its tasks at hand in the context of the world at large and its relation thereto.
In essence, Artificial General Intelligence can be summarized as the ability of an intelligent agent to learn not just to do a highly specialized task but to use the skills it has learned to extract insight from data originating in multiple contexts or domains.
How does DataNeuron achieve Artificial General Intelligence?
The DataNeuron platform displays Artificial General Intelligence as it has the ability to perform well on:
NLP tasks belonging to multiple domains.
Text data originating from multiple contexts.
Masterlist: Machine Learning is not binary so we don’t rely on rules or predefined functions, we rely on the simpler structure which is the Masterlist where we allow classes to have overlap. Further, we support taxonomy or hierarchical ontologies on the Masterlist. The platform uses intelligent algorithms to assign paragraphs for each class making the data annotation process automated.
Advanced Masterlist: We are also launching Advanced Masterlist to support subjective labeling of datasets (where clear class distribution is missing).
Apart from the ability to perform auto-annotation on data, the platform also provides complete automation for model training including automatic data processing, feature engineering, model selection, hyperparameter optimization, and cross-validation of results.
The DataNeuron Platform automatically deploys the algorithm and provides APIs which can be integrated to build any application with real-time no-code prediction capabilities. It also provides a continuous feedback and retraining framework for updating the model for achieving the best performance. All these features make it one step closer to achieving Explainable AI.
The DataNeuron platform has produced exceptional results in extremely specialized domains like Document or Text classification in the Tax & Legal, Financial, and Life Sciences use cases, as well as general tasks like Document or Text Clustering in any given context. DataNeuron reduces the time and effort by ~95% required to label and create models, allowing users to extract up to ~99.98% insights. DataNeuron is an Advanced platform for complex data annotations, model training, prediction & lifecycle management. We have achieved a major breakthrough by fully automating data labeling with comparable accuracy to state-of-the-art solutions with just 2% of labeled data when compared to human-in-loop labeling on unseen data.
The impact created by DataNeuron’s General Intelligence
We observe that the DataNeuron platform can decrease the annotation time by up to ~98%. This vastly decreases the time and effort spent annotating huge amounts of data and allows teams to focus more on the task at hand by automating the process of data annotation and easing research.
Additionally, it can also help reduce the SME effort up to 96%, while incurring a fraction of the cost. Our platform also significantly reduces the overall cost of the project, by nearly eradicating the need for data labeling/annotation teams. In some cases, the need for an SME is also diminished as the process of annotation is much simpler and anyone with knowledge of the domain can be able to do it properly unless the project is too complex.
Testing On Various Datasets
How DataNeuron performs for various use cases
The difference in Paragraphs is Annotated. We observe it is possible to reduce annotation effort by up to 96%.
The difference in Time Required. We observe it is possible to reduce time required for annotation by up to 98%.
The above visualizations showcase the platform’s ability to perform extraordinarily in different domains. As opposed to the specialized systems that tend to perform well on only one type of task or domain, the DataNeuron platform breaks boundaries by performing exceptionally for a diversified set of domains.
What does it mean for the Future of AI?
As AI adoption has picked up among enterprises, the need for labelled and structured data has dramatically increased in order to remove the bottleneck in developing the AI solutions.
DataNeuron, powered by a data-centric platform provides a complete end-to-end platform from training to Ensemble Model APIs for faster deployment of AI.
Our research continues to be focused on the area of Artificial General Intelligence and further automation of Data Labeling / Validation and provide better explainability of AI.
Highlight: DataNeuron’s Automated Learning Platform achieves an accuracy of 82% F1 score which is comparable to the best performing model.
Overview: Because data preparation necessitates the use of expensive expert annotators, implementing AI in the intent classification domain is exceedingly difficult. Automation platforms like DataNeuron can help eliminate this barrier. We used Kaggle’s “Covid-19 Intent Recognition Twitter” dataset to see how DataNeuron identifies intent information like Covid-19 tweets. The entire Covid-19 dataset has 10558 paras divided into 5 classes that can be utilized for NLP tasks such as intent categorization. This data serves as a baseline against which DataNeuron’s output can be compared. More information on the dataset can be found in the appendix or on the internet.
A subset of https://crisisnlp.qcri.org/covid19 was used in this study. This dataset (which includes MERS (Middle East Respiratory Syndrome) and SARS (Severe Acute Respiratory Syndrome)) assisted us in compiling paragraphs on respiratory syndromes.
Class Description:
The masterlist is made up of classes and their associated keywords, which are intuitively written against their classes and serve as inputs for our platform. We train our model on the following classes: disease_signs_or_symptoms, disease_transmission, deaths_reports, prevention, and treatment.
Background on the DataNeuron Automated Learning Platform:
The Automated Learning Platform (ALP) from DataNeuron is designed to ingest unstructured data like these Covid-19 tweets, build AI models with minimal human validation, and predict labels with high accuracy. The diagram below depicts DataNeuron’s ALP Flow
The platform performs automatic annotation and provides the users with a list of annotated paragraphs. The users simply have to validate whether the annotation is correct or incorrect. Instead of making users go through the entire dataset to label paragraphs, DataNeuron offers this validation-based approach which reduces the time taken to annotate each paragraph. Based on our estimate, it takes approximately 30 seconds for a user to identify whether a paragraph belongs to a particular class.
The Covid-19 data was fed into DataNeuron’s ALP flow, where a combination of machine learning models was built based on the input. In the first stage, these models were able to identify irrelevant paragraphs out of the 10558 raw paragraphs. The remaining relevant paragraphs were strategically annotated with one of the 5 target classes. User validation was required on 477 annotated paragraphs to achieve remarkable accuracy.
Simplifying SME Task:
With this same data, SMEs for human-in-loop labeling would have to go through each paragraph in the entire dataset to label with a target class. This exercise would take a tremendous amount of time and effort. DataNeuron’s recognize vs recall approach simplifies the validator’s task to a large extent.
Manual Effort Reduction:
Conservatively assuming 45 seconds are needed to manually annotate each paragraph, it would take 132 hours to manually annotate the complete dataset. Assuming it takes 30 seconds to validate one paragraph on DataNeuron, 477 paragraphs will take 4 hours for complete validation. This calculates to a 97% reduction in human effort required.
Accuracy close to the best classifier model:
In another experiment, the SVM model was used to classify the paragraphs in this dataset. An overall 86.0% Precision at 85% Recall (85% F-1 score) was achieved. With 477 paragraphs for manual validation, which is just 4.51% of the raw data, DataNeuron was able to achieve a comparable F-1 score of 82%.
Calculating the Cost ROI:
The Annotation cost for the in-house data team is $1320 and for the Data-Science team, it is around $1405. Therefore, the total cost is $2725. The cost for annotating 477 paragraphs using the DataNeuron ALP is around $607.3. In this case, the reduction in cost is 77.71%, which is very significant with a cost ROI of 348.81%.
No requirement for a Data Science/Machine Learning Expert:
The DataNeuron ALP by design eliminates prerequisite knowledge of data science or machine learning to utilize the platform to its maximum potential. The only human intervention is to validate the output of the labeled data.
Conclusion:
From the above explanation, it is evident that automating data labeling using DataNeuron provides comparable accuracy with reduced human efforts and cost.
DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data Active learning resembles conventional supervised learning more closely. Given that models are trained to utilize both labeled and unlabeled data, it is a form of semi-supervised learning. The concept behind semi-supervised learning is that classifying a small sample of data may produce results that are as accurate as or even more accurate than training data that has been fully labeled. Finding that sample is the only difficult part. In active learning, data is incrementally and dynamically labeled during training so that the algorithm can determine which label would be most helpful for it to learn from.
Experiment
Let’s compare the two classifiers, Sequential and Active Learning, through this easy experiment.
We’ll be using the Contract Understanding Atticus Dataset (CUAD) for the purpose of this experiment. The dataset contains 4414 paragraphs. We have randomly selected 10 types of legal clause classes for the text classification task.
We’ll start by loading the data. The features and labels are then extracted from the data, and a train-test split is made, allowing the test split to be used to assess how well the model trained using the train split performed.
#train test split
df_train, df_test = train_test_split(df_total, test_size=0.2, random_state = 0)
# The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)
vectorizer = CountVectorizer(ngram_range=(1, 5))
X_train = vectorizer.fit_transform(train_df.para)
X_test = vectorizer.transform(df_test.para)
X_train.shape[Out] : (4082, 1802587)
X_test.shape[Out] : (1021, 1802587)
With Sequential Learning
The Sequential Learning classifier will be trained twice only, once with 100 paragraphs and the next time with 1000 paragraphs. For the next 1000 paragraphs, the total number of correct and wrong predictions is calculate
#Defining the classifier
classifier_seq = MultinomialNB()
#preprocessing of training set
x_100_train = vectorizer.transform(df_train[:100]['para'])
y_100_train = df_train[:100]['label']
#preprocessing of testing set
x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
y_1000_test = df_train[100:1100]['label']
#training the model
classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())
#calculating the accuracy on the test set
classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())
[Out] : 0.681
Code for calculating the accuracy of suggestion for every 10 paras sequentially
#the number of paragraphs in validation set for each iteration
n_instances = 10
#first 100 paragraphs
n_start = 100
n_para = 100
#defining the classifier
classifier_seq = MultinomialNB()
#preprocessing of training set
x_100_train = vectorizer.transform(df_train[:100]['para'])
y_100_train = df_train[:100]['label']
#preprocessing of testing set
x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
y_1000_test = df_train[100:1100]['label']
#training the classifier
classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())
#calculating the accuracy on the testing set
acc_1000 = classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())
accuracies_seq = []
seq_correct = 0
seq_wrong = 0
y_test_list = []
while n_para <= 1100:
x_test = vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
y_test = df_train[n_start:n_start+n_instances]['label']
y_test_list.append(y_test.to_list())
y_pred = classifier_seq.predict(X=x_test)
y_pred = list(y_pred)
#calculation of total number of correct and wrong prediction for the next 1000 paras
if len(y_test_list) <= 100:
for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
if ele_pred == ele_test:
seq_correct += 1
else:
seq_wrong += 1
score = classifier_seq.score(X=x_test, y=y_test.to_list())
accuracies_seq.append(score)
n_start = n_start + n_instances
n_para += n_instances
#total number of correctly classified and incorrectly classified samples
seq_correct, seq_wrong
[Out] : (681, 319)
With Active Learning
The Active Learning classifier will be trained iteratively.
After training using the first 100 paragraphs, accuracy on the next 10 paragraphs is predicted. We take note of correct and incorrect paragraphs predicted on unseen data. These 10 paragraphs will now be used in training. This will happen subsequently until all 1000 paragraphs are used in training. Similar to the sequential learning classifier, the total number of correct and wrong predictions for the next 1000 paragraphs are checked.
#the number of paragraphs in validation set for each iteration
n_instances = 10
#first 100 paragraphs
n_start = 100
#defining the classifier
classifier = MultinomialNB()
#preprocessing of training set
x_train = vectorizer.transform(df_train[:100]['para'])
y_train = df_train[:100]['label']
#training the classifier
classifier.fit(X=x_train, y=y_train.to_list())
accuracies = []
test_accuracies = []
correct = 0
wrong = 0
n_para = 100
y_test_list = []
while x_train.shape[0] <= 1100:
#calculating test accuracy for each iteration
test_score = classifier.score(X=X_test, y=df_test.label.to_list())
test_accuracies.append(test_score)
#calculating validation accuracy for every next 10 paras
x_test = vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
y_test = df_train[n_start:n_start+n_instances]['label']
y_test_list.append(y_test.to_list())
y_pred = classifier.predict(X=x_test)
y_pred = list(y_pred)
##calculation of total number of correct and wrong prediction for the next 1000 paras
if len(y_test_list) <= 100:
for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
if ele_pred == ele_test:
correct += 1
else:
wrong += 1
score = classifier.score(X=x_test, y=y_test.to_list())
accuracies.append(score)
#sequentially increasing the number of para
x_train = vectorizer.transform(df_train[:n_start+n_instances]['para'])
y_train = df_train[:n_start+n_instances]['label']
classifier = MultinomialNB()
classifier.fit(X=x_train, y=y_train.to_list())
n_start = n_start + n_instances
n_para += n_instances
#total number of correctly classified and incorrectly classified samples
correct, wrong
[Out] : (844, 156)
Results
Let’s take a look at the results we obtained from the experiments:
The first graph indicates that with more paragraphs used for training, the accuracy of the active learning classifier increases. Whereas, for the Sequential Learning algorithm the accuracy does not improve.
The second graph shows the accuracy of the two classifiers after the addition of every 10 paragraphs over the initial 100 paragraphs of the unseen testing dataset.
Conclusion
The results show that the Active Learning Classifier outperforms the Sequential Learning Classifier. This is because Active Learning learns through intermediate validations, whereas the Sequential Learning classifier only learns at 100 and 1100 paragraphs only and does not understand anything in between. This is how the active Learning classifier can achieve higher accuracy while using less labeled data.