Blog

How well does the DataNeuron ALP handle the Finance Use Case?

This is the table that explains the dataset that was used to conduct this case study.

Explaining the DataNeuron Pipeline

This is the DataNeuron Pipeline. Ingest, Structure, Validate, Train, Predict, Deploy and Iterate.

Results of our Experiment

Results of our Experiment

Reduction in SME Labeling Effort

During an in-house project, the SMEs have to go through all the paragraphs present in the dataset in order to figure out which paragraphs actually belong to the 8 classes mentioned above. This would usually take a tremendous amount of time and effort.

When using DataNeuron ALP, the algorithm was able to perform strategic annotation on 15000 raw paragraphs and filter out the paragraphs that belonged to the 8 classes and provide 659 paragraphs to the user for validation.

Taking as little as 45 seconds to annotate each paragraph, an in-house project would take an estimate of 188 hours just to annotate all the paragraphs.

Difference in paragraphs annotated between an in-house solution and DataNeuron.

Advantage of Suggestion-Based Annotation

Instead of making users go through the entire dataset to label paragraphs that belong to a certain class, DataNeuron uses a validation-based approach to make the model training process considerably easier.

The platform provides the users with a list of annotated/labeled paragraphs that are most likely to belong to the same class by using context-based filtering and analyzing the masterlist. The users simply have to validate whether the system labeled paragraph belongs to the class mentioned.

This validation-based approach also reduces the time it takes to annotate each paragraph. Based on our estimate, it takes approximate 30 seconds for a user to identify whether a paragraph belongs to a particular class.

Based on this, it would take an estimate of 6 hours for the users to validate 659 paragraphs provided by the DataNeuron ALP. When compared to the 188 hours it would take for an in-house team to complete the annotation process, DataNeuron offers a staggering 96.8% reduction in time spent.

Difference in time spent annotating between an in-house solution and DataNeuron.

The Accuracy Tradeoff

When conducting this case study, the accuracy we achieved for the model trained by the DataNeuron ALP was 93.9% while the accuracy of model trained by the in-house project was 98.2%.

The difference in time spent annotating could offset this small difference in accuracy and the accuracy of the model trained by the DataNeuron ALP can be increased by validating more paragraphs.

Difference in accuracy between an in-house solution and DataNeuron.

Calculating the Cost ROI

The number of paragraphs that needs to be annotated in an in-house project is 15067 and it would cost approximately $3288.

The number of paragraphs that needs to be annotated when using the DataNeuron ALP is 659 since most of the paragraphs which did not belong to any of 8 classes were discarded using context-based filtering. The cost for annotating 659 paragraphs using the DataNeuron ALP is $575.

The reduction in cost is a significant 82.5% and the cost ROI is an estimated 471.82%.

Difference in cost between an in-house solution and DataNeuron.

No Requirement for a Data Science/Machine Learning Expert

The DataNeuron ALP is designed in such a way that no prerequisite knowledge of data science or machine learning is required to utilize the platform to its maximum potential.

For some very specific use cases, a Subject Matter Expert might be required but for the majority of use cases, an SME is not required in the DataNeuron Pipeline.

September 13, 2024
Artificial General Intelligence: DataNeuron is Redefining Data Labeling Across Domains
The term Artificial General Intelligence (often abbreviated “AGI”) has no precise definition, but one of the most widely accepted ones is the capacity of an engineered system to display intelligence that is not tied to a highly specific set of tasks or generalize what it has learned, including generalization to contexts qualitatively very different from those it has seen before and take a broad view, interpret its tasks at hand in the context of the world at large and its relation thereto.

In essence, Artificial General Intelligence can be summarized as the ability of an intelligent agent to learn not just to do a highly specialized task but to use the skills it has learned to extract insight from data originating in multiple contexts or domains.

How does DataNeuron achieve Artificial General Intelligence?

The DataNeuron platform displays Artificial General Intelligence as it has the ability to perform well on:
- NLP tasks belonging to multiple domains.
- Text data originating from multiple contexts.
Masterlist: Machine Learning is not binary so we don’t rely on rules or predefined functions, we rely on the simpler structure which is the Masterlist where we allow classes to have overlap. Further, we support taxonomy or hierarchical ontologies on the Masterlist. The platform uses intelligent algorithms to assign paragraphs for each class making the data annotation process automated.

Advanced Masterlist: We are also launching Advanced Masterlist to support subjective labeling of datasets (where clear class distribution is missing).

Apart from the ability to perform auto-annotation on data, the platform also provides complete automation for model training including automatic data processing, feature engineering, model selection, hyperparameter optimization, and cross-validation of results.

The DataNeuron Platform automatically deploys the algorithm and provides APIs which can be integrated to build any application with real-time no-code prediction capabilities. It also provides a continuous feedback and retraining framework for updating the model for achieving the best performance. All these features make it one step closer to achieving Explainable AI.

The DataNeuron platform has produced exceptional results in extremely specialized domains like Document or Text classification in the Tax & Legal, Financial, and Life Sciences use cases, as well as general tasks like Document or Text Clustering in any given context. DataNeuron reduces the time and effort by ~95% required to label and create models, allowing users to extract up to ~99.98% insights. DataNeuron is an Advanced platform for complex data annotations, model training, prediction & lifecycle management. We have achieved a major breakthrough by fully automating data labeling with comparable accuracy to state-of-the-art solutions with just 2% of labeled data when compared to human-in-loop labeling on unseen data.

The impact created by DataNeuron’s General Intelligence

We observe that the DataNeuron platform can decrease the annotation time by up to ~98%. This vastly decreases the time and effort spent annotating huge amounts of data and allows teams to focus more on the task at hand by automating the process of data annotation and easing research.

Additionally, it can also help reduce the SME effort up to 96%, while incurring a fraction of the cost. Our platform also significantly reduces the overall cost of the project, by nearly eradicating the need for data labeling/annotation teams. In some cases, the need for an SME is also diminished as the process of annotation is much simpler and anyone with knowledge of the domain can be able to do it properly unless the project is too complex.

Results Visualized

The above visualizations showcase the platform’s ability to perform extraordinarily in different domains. As opposed to the specialized systems that tend to perform well on only one type of task or domain, the DataNeuron platform breaks boundaries by performing exceptionally for a diversified set of domains.

What does it mean for the Future of AI?

As AI adoption has picked up among enterprises, the need for labeled and structured data has dramatically increased in order to remove the bottleneck in developing the AI solutions.

DataNeuron, powered by a data-centric platform provides a complete end-to-end platform from training to Ensemble Model APIs for faster deployment of AI.

Our research continues to be focused on the area of Artificial General Intelligence and further automation of Data Labeling / Validation and provide better explainability of AI.
May 19, 2024

DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data

DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data Active learning resembles conventional supervised learning more closely. Given that models are trained to utilize both labeled and unlabeled data, it is a form of semi-supervised learning. The concept behind semi-supervised learning is that classifying a small sample of data may produce results that are as accurate as or even more accurate than training data that has been fully labeled. Finding that sample is the only difficult part. In active learning, data is incrementally and dynamically labeled during training so that the algorithm can determine which label would be most helpful for it to learn from.

Experiment

Let’s compare the two classifiers, Sequential and Active Learning, through this easy experiment.

We’ll be using the Contract Understanding Atticus Dataset (CUAD) for the purpose of this experiment. The dataset contains 4414 paragraphs. We have randomly selected 10 types of legal clause classes for the text classification task.

Download the dataset: https://www.atticusprojectai.org/cuad

Loading data & preprocessing

We’ll start by loading the data. The features and labels are then extracted from the data, and a train-test split is made, allowing the test split to be used to assess how well the model trained using the train split performed.

#train test split

df_train, df_test = train_test_split(df_total, test_size=0.2, random_state = 0)

# The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)

vectorizer = CountVectorizer(ngram_range=(1, 5))

X_train = vectorizer.fit_transform(train_df.para)

X_test = vectorizer.transform(df_test.para)

X_train.shape[Out] : (4082, 1802587)

X_test.shape[Out] : (1021, 1802587)

With Sequential Learning

The Sequential Learning classifier will be trained twice only, once with 100 paragraphs and the next time with 1000 paragraphs. For the next 1000 paragraphs, the total number of correct and wrong predictions is calculate

#Defining the classifier
classifier_seq = MultinomialNB()

#preprocessing of training set
x_100_train = vectorizer.transform(df_train[:100]['para'])
y_100_train = df_train[:100]['label']

#preprocessing of testing set
x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
y_1000_test = df_train[100:1100]['label']

#training the model
classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())

#calculating the accuracy on the test set
classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())
[Out] : 0.681

Code for calculating the accuracy of suggestion for every 10 paras sequentially

#the number of paragraphs in validation set for each iteration
n_instances = 10

#first 100 paragraphs
n_start = 100 
n_para = 100

#defining the classifier
classifier_seq = MultinomialNB()
#preprocessing of training set
x_100_train = vectorizer.transform(df_train[:100]['para'])
y_100_train = df_train[:100]['label']

#preprocessing of testing set
x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
y_1000_test = df_train[100:1100]['label']

#training the classifier
classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())
#calculating the accuracy on the testing set
acc_1000 = classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())

accuracies_seq = []
seq_correct = 0
seq_wrong = 0
y_test_list = []

while n_para <= 1100:
    
    x_test =  vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
    y_test = df_train[n_start:n_start+n_instances]['label']
    y_test_list.append(y_test.to_list())
    
    y_pred = classifier_seq.predict(X=x_test)
    y_pred = list(y_pred)
    
    #calculation of total number of correct and wrong prediction for the   next 1000 paras
    if len(y_test_list) <= 100:
        for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
            if ele_pred == ele_test:
                seq_correct += 1
            else:
                seq_wrong += 1
    
    score = classifier_seq.score(X=x_test, y=y_test.to_list())
    accuracies_seq.append(score)
    
    n_start = n_start + n_instances
    n_para += n_instances
   
#total number of correctly classified and incorrectly classified samples
seq_correct, seq_wrong
[Out] : (681, 319)

With Active Learning

The Active Learning classifier will be trained iteratively.

After training using the first 100 paragraphs, accuracy on the next 10 paragraphs is predicted. We take note of correct and incorrect paragraphs predicted on unseen data. These 10 paragraphs will now be used in training. This will happen subsequently until all 1000 paragraphs are used in training. Similar to the sequential learning classifier, the total number of correct and wrong predictions for the next 1000 paragraphs are checked.

#the number of paragraphs in validation set for each iteration
n_instances = 10

#first 100 paragraphs
n_start = 100 

#defining the classifier
classifier = MultinomialNB()

#preprocessing of training set
x_train = vectorizer.transform(df_train[:100]['para'])
y_train = df_train[:100]['label']

#training the classifier
classifier.fit(X=x_train, y=y_train.to_list())

accuracies = []
test_accuracies = []

correct = 0
wrong = 0

n_para = 100
y_test_list = []

while x_train.shape[0] <= 1100:
    
    #calculating test accuracy for each iteration
    test_score = classifier.score(X=X_test, y=df_test.label.to_list())
    test_accuracies.append(test_score)
    
    #calculating validation accuracy for every next 10 paras
    x_test = vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
    y_test = df_train[n_start:n_start+n_instances]['label']
    y_test_list.append(y_test.to_list())
    y_pred = classifier.predict(X=x_test)
    y_pred = list(y_pred)
    
    ##calculation of total number of correct and wrong prediction for the next 1000 paras
    if len(y_test_list) <= 100:
        for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
            if ele_pred == ele_test:
                correct += 1
            else:
                wrong += 1

    score = classifier.score(X=x_test, y=y_test.to_list())
    accuracies.append(score)
    #sequentially increasing the number of para
    x_train = vectorizer.transform(df_train[:n_start+n_instances]['para'])
    y_train = df_train[:n_start+n_instances]['label']
    classifier = MultinomialNB()
    classifier.fit(X=x_train, y=y_train.to_list())
    
    n_start = n_start + n_instances
    n_para += n_instances

#total number of correctly classified and incorrectly classified samples
correct, wrong
[Out] : (844, 156)

Results

Let’s take a look at the results we obtained from the experiments:

The first graph indicates that with more paragraphs used for training, the accuracy of the active learning classifier increases. Whereas, for the Sequential Learning algorithm the accuracy does not improve.

The second graph shows the accuracy of the two classifiers after the addition of every 10 paragraphs over the initial 100 paragraphs of the unseen testing dataset.

Conclusion

The results show that the Active Learning Classifier outperforms the Sequential Learning Classifier. This is because Active Learning learns through intermediate validations, whereas the Sequential Learning classifier only learns at 100 and 1100 paragraphs only and does not understand anything in between. This is how the active Learning classifier can achieve higher accuracy while using less labeled data.

March 19, 2024

Blog

How well does the DataNeuron ALP handle the Finance Use Case?

Explaining the DataNeuron Pipeline

Results of our Experiment

Reduction in SME Labeling Effort

Advantage of Suggestion-Based Annotation

The Accuracy Tradeoff

Calculating the Cost ROI

No Requirement for a Data Science/Machine Learning Expert

Artificial General Intelligence: DataNeuron is Redefining Data Labeling Across Domains

How does DataNeuron achieve Artificial General Intelligence?

The impact created by DataNeuron’s General Intelligence

Results Visualized

What does it mean for the Future of AI?