Blog

  • How well does the DataNeuron ALP handle the Finance Use Case?

    How well does the DataNeuron ALP handle the Finance Use Case?

    This is the table that explains the dataset that was used to conduct this case study.

    Explaining the DataNeuron Pipeline

    This is the DataNeuron Pipeline. Ingest, Structure, Validate, Train, Predict, Deploy and Iterate.

    Results of our Experiment

    Results of our Experiment

    Reduction in SME Labeling Effort

    During an in-house project, the SMEs have to go through all the paragraphs present in the dataset in order to figure out which paragraphs actually belong to the 8 classes mentioned above. This would usually take a tremendous amount of time and effort.

    When using DataNeuron ALP, the algorithm was able to perform strategic annotation on 15000 raw paragraphs and filter out the paragraphs that belonged to the 8 classes and provide 659 paragraphs to the user for validation.

    Taking as little as 45 seconds to annotate each paragraph, an in-house project would take an estimate of 188 hours just to annotate all the paragraphs.

    Difference in paragraphs annotated between an in-house solution and DataNeuron.

    Advantage of Suggestion-Based Annotation

    Instead of making users go through the entire dataset to label paragraphs that belong to a certain class, DataNeuron uses a validation-based approach to make the model training process considerably easier.

    The platform provides the users with a list of annotated/labeled paragraphs that are most likely to belong to the same class by using context-based filtering and analyzing the masterlist. The users simply have to validate whether the system labeled paragraph belongs to the class mentioned.

    This validation-based approach also reduces the time it takes to annotate each paragraph. Based on our estimate, it takes approximate 30 seconds for a user to identify whether a paragraph belongs to a particular class.

    Based on this, it would take an estimate of 6 hours for the users to validate 659 paragraphs provided by the DataNeuron ALP. When compared to the 188 hours it would take for an in-house team to complete the annotation process, DataNeuron offers a staggering 96.8% reduction in time spent.

    Difference in time spent annotating between an in-house solution and DataNeuron.

    The Accuracy Tradeoff

    When conducting this case study, the accuracy we achieved for the model trained by the DataNeuron ALP was 93.9% while the accuracy of model trained by the in-house project was 98.2%.

    The difference in time spent annotating could offset this small difference in accuracy and the accuracy of the model trained by the DataNeuron ALP can be increased by validating more paragraphs.

    Difference in accuracy between an in-house solution and DataNeuron.

    Calculating the Cost ROI

    The number of paragraphs that needs to be annotated in an in-house project is 15067 and it would cost approximately $3288.

    The number of paragraphs that needs to be annotated when using the DataNeuron ALP is 659 since most of the paragraphs which did not belong to any of 8 classes were discarded using context-based filtering. The cost for annotating 659 paragraphs using the DataNeuron ALP is $575.

    The reduction in cost is a significant 82.5% and the cost ROI is an estimated 471.82%.

    Difference in cost between an in-house solution and DataNeuron.

    No Requirement for a Data Science/Machine Learning Expert

    The DataNeuron ALP is designed in such a way that no prerequisite knowledge of data science or machine learning is required to utilize the platform to its maximum potential.

    For some very specific use cases, a Subject Matter Expert might be required but for the majority of use cases, an SME is not required in the DataNeuron Pipeline.

  • Artificial General Intelligence: DataNeuron is Redefining Data Labeling Across Domains

    Artificial General Intelligence: DataNeuron is Redefining Data Labeling Across Domains

    The term Artificial General Intelligence (often abbreviated “AGI”) has no precise definition, but one of the most widely accepted ones is the capacity of an engineered system to display intelligence that is not tied to a highly specific set of tasks or generalize what it has learned, including generalization to contexts qualitatively very different from those it has seen before and take a broad view, interpret its tasks at hand in the context of the world at large and its relation thereto.

    In essence, Artificial General Intelligence can be summarized as the ability of an intelligent agent to learn not just to do a highly specialized task but to use the skills it has learned to extract insight from data originating in multiple contexts or domains.

    How does DataNeuron achieve Artificial General Intelligence?

    The DataNeuron platform displays Artificial General Intelligence as it has the ability to perform well on:

    • NLP tasks belonging to multiple domains.
    • Text data originating from multiple contexts.

    Masterlist: Machine Learning is not binary so we don’t rely on rules or predefined functions, we rely on the simpler structure which is the Masterlist where we allow classes to have overlap. Further, we support taxonomy or hierarchical ontologies on the Masterlist. The platform uses intelligent algorithms to assign paragraphs for each class making the data annotation process automated.

    Advanced Masterlist: We are also launching Advanced Masterlist to support subjective labeling of datasets (where clear class distribution is missing).

    Apart from the ability to perform auto-annotation on data, the platform also provides complete automation for model training including automatic data processing, feature engineering, model selection, hyperparameter optimization, and cross-validation of results.

    The DataNeuron Platform automatically deploys the algorithm and provides APIs which can be integrated to build any application with real-time no-code prediction capabilities. It also provides a continuous feedback and retraining framework for updating the model for achieving the best performance. All these features make it one step closer to achieving Explainable AI.

    The DataNeuron platform has produced exceptional results in extremely specialized domains like Document or Text classification in the Tax & Legal, Financial, and Life Sciences use cases, as well as general tasks like Document or Text Clustering in any given context. DataNeuron reduces the time and effort by ~95% required to label and create models, allowing users to extract up to ~99.98% insights. DataNeuron is an Advanced platform for complex data annotations, model training, prediction & lifecycle management. We have achieved a major breakthrough by fully automating data labeling with comparable accuracy to state-of-the-art solutions with just 2% of labeled data when compared to human-in-loop labeling on unseen data.

    The impact created by DataNeuron’s General Intelligence

    We observe that the DataNeuron platform can decrease the annotation time by up to ~98%. This vastly decreases the time and effort spent annotating huge amounts of data and allows teams to focus more on the task at hand by automating the process of data annotation and easing research.

    Additionally, it can also help reduce the SME effort up to 96%, while incurring a fraction of the cost. Our platform also significantly reduces the overall cost of the project, by nearly eradicating the need for data labeling/annotation teams. In some cases, the need for an SME is also diminished as the process of annotation is much simpler and anyone with knowledge of the domain can be able to do it properly unless the project is too complex.

    Results Visualized

    The above visualizations showcase the platform’s ability to perform extraordinarily in different domains. As opposed to the specialized systems that tend to perform well on only one type of task or domain, the DataNeuron platform breaks boundaries by performing exceptionally for a diversified set of domains.

    What does it mean for the Future of AI?

    As AI adoption has picked up among enterprises, the need for labeled and structured data has dramatically increased in order to remove the bottleneck in developing the AI solutions.

    DataNeuron, powered by a data-centric platform provides a complete end-to-end platform from training to Ensemble Model APIs for faster deployment of AI.

    Our research continues to be focused on the area of Artificial General Intelligence and further automation of Data Labeling / Validation and provide better explainability of AI.

  • DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data

    DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data

    DataNeuron’s Active Learning to NLP data labeling outperforms the competitors’ Sequential Learning approach by 20% with 50% less data Active learning resembles conventional supervised learning more closely. Given that models are trained to utilize both labeled and unlabeled data, it is a form of semi-supervised learning. The concept behind semi-supervised learning is that classifying a small sample of data may produce results that are as accurate as or even more accurate than training data that has been fully labeled. Finding that sample is the only difficult part. In active learning, data is incrementally and dynamically labeled during training so that the algorithm can determine which label would be most helpful for it to learn from.

    Experiment

    Let’s compare the two classifiers, Sequential and Active Learning, through this easy experiment.

    We’ll be using the Contract Understanding Atticus Dataset (CUAD) for the purpose of this experiment. The dataset contains 4414 paragraphs. We have randomly selected 10 types of legal clause classes for the text classification task.

    Download the dataset: https://www.atticusprojectai.org/cuad

    Loading data & preprocessing

    We’ll start by loading the data. The features and labels are then extracted from the data, and a train-test split is made, allowing the test split to be used to assess how well the model trained using the train split performed.

    #train test split
    
    df_train, df_test = train_test_split(df_total, test_size=0.2, random_state = 0)
    
    # The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)
    
    vectorizer = CountVectorizer(ngram_range=(1, 5))
    
    X_train = vectorizer.fit_transform(train_df.para)
    
    X_test = vectorizer.transform(df_test.para)
    
    X_train.shape[Out] : (4082, 1802587)
    
    X_test.shape[Out] : (1021, 1802587)

    With Sequential Learning

    The Sequential Learning classifier will be trained twice only, once with 100 paragraphs and the next time with 1000 paragraphs. For the next 1000 paragraphs, the total number of correct and wrong predictions is calculate

    #Defining the classifier
    classifier_seq = MultinomialNB()
    
    #preprocessing of training set
    x_100_train = vectorizer.transform(df_train[:100]['para'])
    y_100_train = df_train[:100]['label']
    
    #preprocessing of testing set
    x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
    y_1000_test = df_train[100:1100]['label']
    
    #training the model
    classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())
    
    #calculating the accuracy on the test set
    classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())
    [Out] : 0.681
    
    

    Code for calculating the accuracy of suggestion for every 10 paras sequentially

    #the number of paragraphs in validation set for each iteration
    n_instances = 10
    
    #first 100 paragraphs
    n_start = 100 
    n_para = 100
    
    #defining the classifier
    classifier_seq = MultinomialNB()
    #preprocessing of training set
    x_100_train = vectorizer.transform(df_train[:100]['para'])
    y_100_train = df_train[:100]['label']
    
    #preprocessing of testing set
    x_1000_test = vectorizer.transform(df_train[100:1100]['para'])
    y_1000_test = df_train[100:1100]['label']
    
    #training the classifier
    classifier_seq.fit(X=x_100_train, y=y_100_train.to_list())
    #calculating the accuracy on the testing set
    acc_1000 = classifier_seq.score(X=x_1000_test, y=y_1000_test.to_list())
    
    accuracies_seq = []
    seq_correct = 0
    seq_wrong = 0
    y_test_list = []
    
    while n_para <= 1100:
        
        x_test =  vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
        y_test = df_train[n_start:n_start+n_instances]['label']
        y_test_list.append(y_test.to_list())
        
        y_pred = classifier_seq.predict(X=x_test)
        y_pred = list(y_pred)
        
        #calculation of total number of correct and wrong prediction for the   next 1000 paras
        if len(y_test_list) <= 100:
            for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
                if ele_pred == ele_test:
                    seq_correct += 1
                else:
                    seq_wrong += 1
        
        score = classifier_seq.score(X=x_test, y=y_test.to_list())
        accuracies_seq.append(score)
        
        n_start = n_start + n_instances
        n_para += n_instances
       
    #total number of correctly classified and incorrectly classified samples
    seq_correct, seq_wrong
    [Out] : (681, 319)

    With Active Learning

    The Active Learning classifier will be trained iteratively.

    After training using the first 100 paragraphs, accuracy on the next 10 paragraphs is predicted. We take note of correct and incorrect paragraphs predicted on unseen data. These 10 paragraphs will now be used in training. This will happen subsequently until all 1000 paragraphs are used in training. Similar to the sequential learning classifier, the total number of correct and wrong predictions for the next 1000 paragraphs are checked.

    #the number of paragraphs in validation set for each iteration
    n_instances = 10
    
    #first 100 paragraphs
    n_start = 100 
    
    #defining the classifier
    classifier = MultinomialNB()
    
    #preprocessing of training set
    x_train = vectorizer.transform(df_train[:100]['para'])
    y_train = df_train[:100]['label']
    
    #training the classifier
    classifier.fit(X=x_train, y=y_train.to_list())
    
    accuracies = []
    test_accuracies = []
    
    correct = 0
    wrong = 0
    
    n_para = 100
    y_test_list = []
    
    while x_train.shape[0] <= 1100:
        
        #calculating test accuracy for each iteration
        test_score = classifier.score(X=X_test, y=df_test.label.to_list())
        test_accuracies.append(test_score)
        
        #calculating validation accuracy for every next 10 paras
        x_test = vectorizer.transform(df_train[n_start:n_start+n_instances]['para'])
        y_test = df_train[n_start:n_start+n_instances]['label']
        y_test_list.append(y_test.to_list())
        y_pred = classifier.predict(X=x_test)
        y_pred = list(y_pred)
        
        ##calculation of total number of correct and wrong prediction for the next 1000 paras
        if len(y_test_list) <= 100:
            for ele_pred, ele_test in zip(y_pred, y_test.to_list()):
                if ele_pred == ele_test:
                    correct += 1
                else:
                    wrong += 1
    
        score = classifier.score(X=x_test, y=y_test.to_list())
        accuracies.append(score)
        #sequentially increasing the number of para
        x_train = vectorizer.transform(df_train[:n_start+n_instances]['para'])
        y_train = df_train[:n_start+n_instances]['label']
        classifier = MultinomialNB()
        classifier.fit(X=x_train, y=y_train.to_list())
        
        n_start = n_start + n_instances
        n_para += n_instances
    
    #total number of correctly classified and incorrectly classified samples
    correct, wrong
    [Out] : (844, 156)

    Results

    Let’s take a look at the results we obtained from the experiments:

    The first graph indicates that with more paragraphs used for training, the accuracy of the active learning classifier increases. Whereas, for the Sequential Learning algorithm the accuracy does not improve.

    The second graph shows the accuracy of the two classifiers after the addition of every 10 paragraphs over the initial 100 paragraphs of the unseen testing dataset.

    Conclusion

    The results show that the Active Learning Classifier outperforms the Sequential Learning Classifier. This is because Active Learning learns through intermediate validations, whereas the Sequential Learning classifier only learns at 100 and 1100 paragraphs only and does not understand anything in between. This is how the active Learning classifier can achieve higher accuracy while using less labeled data.