How does DataNeuron compare to the rest?

Abstract

As the volume of data has exploded in recent years, collecting an accurate and extensive amount of annotated data is one of the biggest hurdles in the field of data science and machine learning [1]. There are various pre-existing methods which try to tackle this scarcity. However, these methods can still be costly and require SMEs. This paper proposes a novel active learning framework called Divisive Sampling and Ensemble Active Learning (DSEAL) which enables the users to create large, high-quality, annotated datasets with very less manual validations, thus strongly reducing annotation cost and effort. In this paper, we further provide a comparative study of performances of DSEAL with existing methods including Manual Annotation and Weak Supervision using a multiclass dataset. The results from the experiment show that DSEAL performs better than other studied solutions.

1 Introduction

Modern machine learning models have attained new state-of-the-art accuracies on a variety of traditionally difficult problems in recent years. The performances of these models are directly dependent on the quality and quantity of labeled training and testing datasets. With the advancement in the automation and commoditization of many ML models [2, 3, 4], researchers and organizations have shifted their focus in creating and acquiring high quality labeled datasets for various business purposes [3, 4].

However, the cost of labeling data quickly becomes a significant operational expense [5, 6]: collecting labels at scale requires carefully developing labeling instructions that cover a wide range of edge cases; training subject matter experts to carry out those instructions; waiting sometimes months or longer for the full results; and dealing with the rapid depreciation of datasets as applications shift and evolve.

To account for all the above problems, we are proposing a novel solution that can quickly label datasets using methods like divisive sampling and ensemble active learning

This paper is organized as follows: Section 2 discusses the techniques related to this work, including Manual Annotation and Weak Supervision. Section 3 describes our proposed framework and methodology. Section 4 describes the process of experiment with Section 5 presenting the results of our performance evaluation. Finally, Section 6 sets our conclusion.

2 Related Work

2.1 Manual Annotation

Manual Annotation is a human-supervised annotation technique where each example is labeled manually, by the SMEs. This requires a large number of SMEs to be hired specifically for labeling data, making it a costly solution. Further, the model may learn misleading patterns due to incorrect annotations arising from human errors and biases. Manual Annotation is also more prone to data privacy breaches as every data point is directly exposed to humans.

2.2 Weak Supervision

Weak Supervision is a part of the machine learning process, where higher-level and often noisier sources of supervision are used to programmatically generate a much larger set of labeled data that can be used in supervised learning [7]. The key idea in weak supervision is to train machine learning models utilizing information sources that are more accessible than hand-labeled data, even when that information is missing, inaccurate, or otherwise less reliable. Because labels are noisy, they are referred to as being “weak,” meaning that the data measurements they represent can have a margin of error which will have a significant negative impact on model performance as substantiated by a text classification experiment conducted by Zhang and Yang [12]. In [13], it is demonstrated that more samples are required for Probably Approximately Correct(PAC) identification of labels when uniform label noise is present in the framework.

Multiple weak learner labeling functions must be defined by the user for weak supervision. Labeling Functions are code snippets that map a data point to an estimated level. It will be hard to get decent results if a user does not have the appropriate labeling function heuristics or coding Knowledge.

3 DataNeuron Methodology – DSEAL

DSEAL stands for Divisive Sampling and Ensemble Active Learning. The cost of computation in semi-supervised learning is directly correlated with the volume of unlabeled data. Most of the time, data sizes can be very large, and it is not practical to use all of the unlabeled data. With DSEAL we can obtain high model performance even with a small portion of data. Our method of divisive sampling helps in strategically selecting only relevant data points. The number of validations decreases as more critical data points are suggested for validation by the model as it learns from each verified data point.

3.1 Divisive Sampling

DSEAL first trains machine learning algorithms on validated data and performs predictions on unlabelled data. Then a subset of paragraphs are selectively sampled from the unlabeled data. Strategy in sampling is to engage the most valuable data pints. In order to do so, we sample paragraphs from multiple classifiers. Multiple classifiers learn from each other’s mistakes and hence reduces the probability of error.

Further, we divide every classifier prediction into multiple groups. This is crucial since the model should perform well on any data seen in the real world, not just a certain sample. In our method, selections are made by dividing the data into groups of deciles on the basis of similarity scores. The probability of selection depends on which decile the paragraph resides in. Based on this, we perform random sampling for each decile.

Data with low confidence, closer to the decision boundary, will have higher probability for selection, and data with high confidence will have lower probability for selection. After sampling, validation is done by the user in a continuous manner. Each batch of validation is used to improve the accuracy of the filtering algorithm for the next batch. The human annotator only needs to validate a small number of the discriminative samples. This significantly minimizes the effort required for data labeling.

3.2 Ensemble Active Learning

The DSEAL method utilizes an ensemble Active Learning method consisting of semi-supervised and supervised machine learning algorithms. After every small batch of validation by user, all the model parameters are adjusted to take care of the user feedback. The recalibrated algorithms calculate prediction score from supervised model and similarity score from semi-supervised method for every unlabelled data point. In each batch, the models are additionally tested against a user validation. If a model consistently underperforms in successive batches, it is replaced by a new model trained completely from scratch using all available annotated data so far. Further a confidence score is assigned to every model in different batches, which is used to get final labeling for a new data point.

The annotations can be applied to all unlabeled data at once after the algorithmic annotations demonstrate good performance against user validations. If the model does not achieve the desired results after model training, the user can provide more training paragraphs or alter the project structure to remove some classes and then retrain the model to achieve better results. This method of learning through repetition in building models by making minor changes and progressively improving the performance helps DSEAL achieve better results with lesser efforts.

4 Experimental Procedure

4.1 The Dataset

For the purpose of the experiment, we have used the Contract Understanding Atticus Dataset (CUAD) [9]. The dataset contains corpus of commercial legal contracts that have been manually labeled under the supervision of experienced lawyers. We have randomly selected 10 types of legal clause classes for the text classification task. The dataset contains 4414 paragraphs.

A few (1000) noisy paragraphs were added to the dataset to make it more robust. The selected 10 classes are: Governing Law, Anti-assignment, Cap on Liability, Renewal Term, Insurance, IP Ownership Assignment, Change of Control, Volume Restriction, No-Solicit of Employees, Third Party Beneficiary. The class distribution is as follows:

4.2 Preparing the Train and Test Set

The dataset has all the similar classes grouped. Hence the dataset was shuffled to get paragraphs from all the classes if a random sample of the dataset was picked. The whole dataset is then split into training and testing sets: 80% for training and the rest of the 20% for testing.

4.3 Annotating the Paragraphs

In this step the paragraphs were annotated using three methods : Active Learning based on Uncertainty Measures, Weak Supervision, Manual Labeling, and DSEAL. We have used Snorkel (OSS) and Label Studio (Community) as the representative tools of Weak Supervision and Manual Labeling respectively. DataNeuron is the representative of DSEAL. Paragraphs were annotated using the three platforms separately.

At the end of the annotation process the same number of paragraphs annotated by each of the platforms were exported for training for a fair comparison. ⁠ 4.4 Model Building

Here we have used a common ML pipeline using default parameter settings of scikit-learn library to build the model for each of the annotated datasets. The same preprocessing and same feature transformation technique for the text data have been performed using TF-IDF Vectorizer provided by scikit-learn library for default parameter settings.

We have trained three popular ML classifiers, e.g. Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF). For each of the three annotated datasets, we train the classifiers with the first {400, 500, 600, 700, 800} sets of paragraphs along with their corresponding labels respectively.

4.5 Model Evaluation

For all of the three annotated datasets for each training set of paragraphs, the same test set was used to evaluate the model. We have calculated the Macro F1 score as the accuracy parameter for comparing the cases.

5 Results and Discussions

We plotted the Macro F1 score with the number of paras we have used for training the model for each of the three annotated datasets. Fig. 4 will give us a fair comparison visualization.

For all three cases, the Macro F1 score increases as the number of para increases in the training set. The overall performance of DSEAL far exceeds the performances of Weak Supervision and Manual Annotation.

As we can compare, DSEAL utilizes techniques such as Active Learning, Strategic Sampling and Iterative Annotation which is not fully used in Weak Supervision or Manual Annotation [10,11]. Moreover, Rule based annotation which is applied in Weak Supervision is not done in DSEAL.

We computed the time required for each approach according to the number of paragraphs needed for annotation. In Manual Annotation and Weak Supervision, 4414 and 1376 paragraphs were annotated, taking roughly 66 hours and 12 minutes and 20 hours and 38 minutes, respectively. Meanwhile, DSEAL annotated 1137 paragraphs in just 17 hours. We also observe that DSEAL requires 600 paragraphs taking 9 hours to reach the benchmark accuracy which is obtained with 1300 paragraphs in 19 hours and 30 minutes as per our experiments.

6 Conclusion

In this paper we have proposed an efficient approach for strategic data annotations using divisive sampling and ensemble active learning. We have conducted experiments to comparatively analyze DSEAL with Weak Supervision and Manual Annotation. We find that the accuracy for DSEAL is 8.85% and 16.68% higher than Manual Annotation and Weak Supervision respectively. There is also a reduction in effort of 74.32% from Manual Annotation and 17.63% from Weak Supervision. The projected number also shows a reduction of 53.84% effort compared to the fully annotated data for achieving similar accuracy.

These results reveal that the DSEAL outperforms Weak Supervision and Manual Annotation significantly. In DSEAL, annotations happen for strategically selected paragraphs unlike Manual Annotation and Weak Supervision which results in better performance.

7 References

[1] Clustering of non-annotated data url: link

[2] Nithya Sambasivan et al. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes Al”. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 2021, pp. 1-15.

[3] David Stodder and Why Data Preparation Matters. “Improving data preparation for business analytics”. In: Transforming Data With Intelligence 1.1 (2016), p. 41.

[4] url: link global-survey-the-state-of-ai-in-2020

[5] Pinar Donmez and Jaime G Carbonell. “Proactive learning: cost-sensitive active learning with multiple imperfect oracles”. In: Proceedings of the 17th ACM conference on Information and knowledge management. 2008, pp. 619-628.

[6] Google AI Platform Data Labeling Service pricing. url: https://cloud.google.com/ai-platform/data-labeling/pricing

[7] Weak Supervision: A New Programming Paradigm for Machine Learning. url: link

[8] A Tutorial on Active Learning. url: http://videolectures.net/ icml09_dasgupta_langford_actl/

[9] Dataset: CUAD. url: link atticus-open-contract-dataset-aok-beta

[10] Snorkel. url: https://snorkel.ai/

[11] Label Studio. url: https://labelstud.io/

[12] J. Zhang and Y. Yang, “Robustness of regularized linear classification methods text categorization,” in Proc. 26th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Toronto, ON, Canada, Jul./Aug. 2003, pp. 190-197.

[13] D. Angluin and P. Laird, “Learning from noisy examples,” Mach. Learn., vol. 2, no. 4, pp. 343-370, 1988

https://crn-dataneuron-prod-as1-cms.dataneuron.ai/content/images/2023/01/3.1-Divisive-Sampling.png https://crn-dataneuron-prod-as1-cms.dataneuron.ai/content/images/2023/01/Strategy-Annotation.webp https://media.graphassets.com/resize=width:599,height:680/R4JBWb9PTFm773Tj5PMr https://media.graphassets.com/hZEQdZcWS9MPbtQYtlsg https://crn-dataneuron-prod-as1-cms.dataneuron.ai/content/images/2023/01/Accuracy-2.png https://crn-dataneuron-prod-as1-cms.dataneuron.ai/content/images/2023/01/Accuracy-3.png https://crn-dataneuron-prod-as1-cms.dataneuron.ai/content/images/2023/01/Table.png