Category: whitepaper

  • DataNeuron Feature Catalogue

    DataNeuron Feature Catalogue

    The DataNeuron Pipeline

    DataNeuron helps you accelerate and automate human-in-loop annotation for developing AI solutions. Powered by a data-centric platform, we automate data labeling, the creation of models, and end-to-end lifecycle management of ML.

    Ingest

    Upload Visualization

    Users can upload the entire data available with them without performing any filteration to remove out of scope paragraphs.

    The data can be uploaded in various file formats supported by the platform.

    The platform has an in-built feature that can handle out-of-scope paragraphs and separate them from the classification data. This functionality is optional and can be toggled on or off anytime during the process.

    Structure

    Structure Visualization

    The next step is the creation of the project structure.

    Instead of a simple flat structure with just the classes defined, we provide the user with the option to create a multi-level (hierarchical) structure so that he can extract the data grouped into domains, subdomains, and indefinitely continue dividing into further subparts depending on his needs.

    Any of the defined nodes can be marked as a class for the data to be classified into irrespective of the level on which it is in the hierarchy. This provides flexibility to create any level of ontology for classification.

    Validate

    Validate Visualization

    User does not have to go through the entire dataset to sort out paragraphs that belong to a certain class and label them to provide training data for the model, which can be a tedious and difficult task.

    We propose a validation based approach:

    • The platform provides the users with suggestions of paragraphs that are most likely to belong to a certain category/class based on an efficient context-based filtering criterion.
    • The user simply has to validate the suggestions, i.e, check whether or not the suggested class is correct.

    This reduces the effort put in by the user in filtering out the paragraphs belonging to a category from the entire dataset by a large margin.

    The strategic annotation technique allows the user to adopt a ‘one-vs-all’ strategy, which makes the task far easier than having to take into consideration all the defined classes, which can be a large number depending on the problem at hand, while tagging a paragraph.

    Our intelligent filtering algorithm ensures “edge-case” paragraphs, i.e paragraphs that do not have obvious correlation with a class but still belong to that class, are not left out.

    This step is broken down into 2 stages:

    • The validation done by the user in the first stage is used for determining the annotation suggestions offered in the second stage.
    • Each batch of annotation is used to improve the accuracy of the filtering algorithm for the next batch.

    The platform also provide a summary screen after each batch of validation which provides the user with an idea as to many more paragraphs he might need to validate per class in order to achieve a higher accuracy.

    It also helps determine when to stop the validation for a specific class and focus more on a class for which the platform projects low confidence.

    Train

    Train Visualization

    User invests virtually no effort into the model training step and the model training can be initiated with the simple click of a button.

    The complete training process is automatic which performs preprocessing, feature engineering, model selection, model training, optimization and k-fold validation.

    After the final model is trained, the platform shows a detailed report of the trained model is presented to the user which includes the overall accuracy of the model as well as the accuracy achieved per class.

    Iterate

    Iterate Visualisation

    Once the model has been trained, we provide the user with 2 options:

    • Continue to the deployment stage if the trained model matches their expectations.
    • If the model does not achieve the desired results, the user can choose to go back and provide more training paragraphs (by validating more paragraphs or uploading seed paragraphs) or alter the project structure to remove some classes and then retrain the model to achieve better results.

    Deploy (“No-Code” Prediction Service)

    Deploy Visualisation

    Apart from providing the final annotations on the data uploaded by the user using the trained model, we also provide a prediction service which can be used to make a prediction on any new paragraphs in exchange for a very minimal fee.

    This does not require any knowledge of coding and users can utilize this service for any input data from the platform.

    This can also be integrated into other platforms by making use of the exposed prediction API or the deployed Python package.

    No Requirement for a Data Science/Machine Learning Expert

    The DataNeuron ALP is designed in such a way that no prerequisite knowledge of data science or machine learning is required to utilize the platform to its maximum potential.

    For some very specific use cases, a Subject Matter Expert might be required but for the majority of use cases, an SME is not required in the DataNeuron Pipeline.

  • Comparison of NLP Data Sampling Strategies

    Comparison of NLP Data Sampling Strategies

    What is Sampling?

    The sample is a collection of people, things, or things used in the study that is taken for analysis from a wider population. To enable us to extrapolate the research sample’s findings to the entire population, the sample must be representative of the population.

    Let’s go through a real-world scenario.

    We’re looking for Mumbai’s adult population’s average annual salary. Up till 2022, Mumbai has a population of about 30 million. Males and females in this population would roughly be split 1:1 (these are simple generalizations), and they might have different averages. Similarly, there are numerous more ways in which various adult population groupings may have varying income levels. As you may guess, it is incredibly difficult to determine the average adult income in Mumbai.

    Since it’s impossible to reach every adult in the whole population, what can be the solution? We can collect numerous samples and determine the average height of the people in the chosen samples.

    How can we take a Sample?

    Taking the same scenario from above, imagine we only take samples from the people in managerial positions. This won’t be regarded as a decent sample because, on generalizing, a manager earns more than the average adult, and it will provide us with a poor estimation of the income of the average adult. A sample must accurately reflect the universe from which it was drawn.

    There are various different potential solutions, but we’ll be looking at three major techniques.

    Sampling strategies :

    • Most uncertain probability
    • Most certain data points
    • The basic mixture from different confidence intervals

    Most Uncertain Probability

    The aim behind uncertainty sampling is to focus on the data item that the present predictor is least certain about. To put it another way, uncertainty sampling typically finds points that are located near thWhat is Sampling?

    The sample is a collection of people, things, or things used in the study that is taken for analysis from a wider population. To enable us to extrapolate the research sample’s findings to the entire population, the sample must be representative of the population.

    Let’s go through a real-world scenario.

    We’re looking for Mumbai’s adult population’s average annual salary. Up till 2022, Mumbai has a population of about 30 million. Males and females in this population would roughly be split 1:1 (these are simple generalizations), and they might have different averages. Similarly, there are numerous more ways in which various adult population groupings may have varying income levels. As you may guess, it is incredibly difficult to determine the average adult income in Mumbai.

    Since it’s impossible to reach every adult in the whole population, what can be the solution? We can collect numerous samples and determine the average height of the people in the chosen samples.

    How can we take a Sample?

    Taking the same scenario from above, imagine we only take samples from the people in managerial positions. This won’t be regarded as a decent sample because, on generalizing, a manager earns more than the average adult, and it will provide us with a poor estimation of the income of the average adult. A sample must accurately reflect the universe from which it was drawn.

    There are various different potential solutions, but we’ll be looking at three major techniques.

    Sampling strategies :

    • Most uncertain probability
    • Most certain data points
    • The basic mixture from different confidence intervals

    Most Uncertain Probability

    The aim behind uncertainty sampling is to focus on the data item that the present predictor is least certain about. To put it another way, uncertainty sampling typically finds points that are located near the decision boundary of the current model.

    Uncertainty Sampling

    Assume that a student is preparing for an exam and has 1000 questions to go through. The student only has time to go through 100 of them. Naturally, the student should prepare 100 questions on which the individual is least confident. With the new questions, students should get smarter, and faster.

    Most Certain Data Points

    This method chooses the data points with the highest certainty ie. data points that are predicted by the model with the highest confidence. These data points have maximum chances of getting correctly predicted by the model. Such data points may or may not add a lot of new information to the model learning.

    Basic mixture of different Confidence Intervals

    Data points are grouped according to their confidence scores, and sampling is done from all of these intervals or groups. This way, we can make sure that no kind of data is missed out upon. This ensures that the sampled data points are having a balance of certain and uncertain data points. This way the model can learn the decision boundary well without missing out on already learned information.

    Code

    Now, let’s use these sampling methods and see their application using a simple code in Python!

    We’ll be working with a binary classification problem, using two datasets:

    1. IMDB Movie Review Dataset for sentiment analysis. Two classes in this dataset: Positive, Negative
    2. Emotion Dataset. Two classes in this dataset: Joy, Sadness

    Download the datasets:

    1. https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
    2. https://huggingface.co/datasets/emotion

    We have performed this experiment on Jupyter Notebook.

    Loading Data & Preprocessing

    The availability of data is always a determining factor in the field of machine learning, so loading data should be done first. After loading the dataset and the necessary modules, the dataframe should be looking like this.

    Clean the data by replacing all occurrences of breaks with single white space.

    for idx in range(len(df['review'])):
        df['review'][idx] = df['review'][idx].replace('<br /><br />', ' ')

    For ease of the experiment, we’re using 10k paragraphs out of the whole dataset of 50k paragraphs.

    frac = 1/5
    df_new = df.sample(frac = frac, random_state = 0)
    df_new.shape
    [Out]: (10000, 2)

    Data Splitting

    A train-test split is created, so the test split can be used to evaluate the performance of the model trained using the train split.

    df_train, df_test = train_test_split(df_new, test_size=0.2, random_state = 0)
    df_train = df_train.reset_index(drop = True)
    df_train.shape, df_test.shape
    [Out]: ((8000, 2), (2000, 2))

    Let’s separate out 100 paragraphs for the training purpose and the remaining 7900 paragraphs for testing.

    df_stage1_test = df_train[~df_train.isin(df_100_train)].dropna(how = 'all').reset_index(drop = True)

    Classifier Training

    Since the train and test sets have been constructed, the pipeline can be instantiated. The pipeline consists of three steps: data transformation, resampling, and model creation at the end.

    # The resulting matrices will have the shape of (`nr of examples`, `nr of word n-grams`)
    vectorizer = CountVectorizer(ngram_range=(1, 5))
    X_100_train = vectorizer.fit_transform(df_100_train.review)
    X_stage1_test = vectorizer.transform(df_stage1_test.review)
    X_test = vectorizer.transform(df_test.review)
    
    labelencoder = LabelEncoder()
    df_100_train['sentiment'] = labelencoder.fit_transform(df_100_train['sentiment'])
    df_stage1_test['sentiment'] = labelencoder.transform(df_stage1_test['sentiment'])
    df_test['sentiment'] = labelencoder.transform(df_test['sentiment'])

    Before moving on to sampling strategies, an initial model is trained

    logreg = LogisticRegression()
    logreg.fit(X=X_100_train, y=df_100_train['sentiment'].to_list())

    Calculating the Predicted Probabilities

    Predicted probability/ confidence scores are calculated on the test data.

    pred_proba = logreg.predict_proba(X=X_stage1_test)
    df_proba = pd.DataFrame()
    df_proba['review'] = df_stage1_test['review']
    df_proba['sentiment'] = df_stage1_test['sentiment']
    df_proba['predict_proba'] = list(pred_proba[:,0]
    df_proba

    Most Uncertain Probability Sampling

    1000 or more paragraphs are picked from a window of probability with the highest degree of uncertainty. These 1000 paragraphs are sorted increasingly in a dataframe. Then we compute the predicted probability’s mean value.

    The index of the row with the predicted probability value closest to the mean value is calculated.

    The paragraph sets are chosen using the mean value index row (Half of them from greater than part and half of them from less than part of the probability). To choose the most uncertain sets of paragraphs, use the same method as minimizing and maximizing the uncertain probability window range.

    #window between 0.45 to 0.55
    df_uncertain = df_proba[(df_proba['predict_proba'] >= 0.44) & (df_proba['predict_proba'] <= 0.55) ]
    df_uncertain_sorted = df_uncertain.sort_values(by = ['predict_proba'], ascending = False)
    df_uncertain_sorted	
    #index of the row closest to the mean value of predicted probability
    
    mid_idx = int(len(df_uncertain_sorted)/2)
    mean_idx = mid_idx-12
    df_uncertain_sorted['predict_proba'].mean()
    [Out]: 0.4936057560087512
    num_of_para = [100,200,300,400,500,600,700,800,900,1000]
    score_uncertain_list = []
    
    for para in num_of_para:
        
        para_idx = int(para/2)
        
        #training set
        df_uncertain_new = df_uncertain_sorted.iloc[mean_idx-para_idx:mean_idx+para_idx]
        
        #preprocessing
        X_train_uncertain = vectorizer.transform(df_uncertain_new.review)
        
        #defining the classifier
        logreg_uncertain = LogisticRegression()
        
        #training the classifier
        logreg_uncertain.fit(X=X_train_uncertain, y=df_uncertain_new['sentiment'].to_list())    
        #calculating the accuracy score on the test set
        score_uncertain = logreg_uncertain.score(X_test, df_test['sentiment'].to_list())
        score_uncertain_list.append(score_uncertain)
    
    score_uncertain_list
    [Out]: [0.547, 0.5845, 0.584, 0.612, 0.6215, 0.6335, 0.6415, 0.663, 0.659, 0.6755]

    Most Certain Probability Sampling

    The dataframe with 7900 paragraphs is sorted in descending order of their predicted probabilities. The top [100,200,300,400,500,600,700,800,900,1000] sets of paragraphs are selected as the most certain paragraphs.

    df_proba_sorted = df_proba.sort_values(by = ['predict_proba'], ascending = False)
    df_proba_sorted
    num_of_para = [100,200,300,400,500,600,700,800,900,1000]
    score_certain_list = []
    
    for para in num_of_para:
        
        #training set
        df_certain = df_proba_sorted[:para]
        
        #preprocessing
        X_train_certain = vectorizer.transform(df_certain.review)
        
        #defining the classifier
        logreg_certain = LogisticRegression()
        
        #training the classifier
        logreg_certain.fit(X=X_train_certain, y=df_certain['sentiment'].to_list())
        
        #calculating the accuracy score on the test set
        score_certain = logreg_certain.score(X_test, df_test['sentiment'].to_list())
        
        score_certain_list.append(score_certain)
    
    score_certain_list
    [Out]: [0.5215, 0.54, 0.536, 0.5755, 0.5905, 0.6245, 0.641, 0.6735, 0.7355, 0.7145]

    Confidence Interval Grouping Sampling

    In this method, the 25th and 75th percentile of the predicted probabilities are calculated. Then the 7900 paragraphs are separated into 3 groups.

    • Predicted probabilities > 75 percentile — Group 1
    • 25 percentile < Predicted probabilities < 75 percentile — Group 2
    • Predicted probabilities < 25 percentile — Group 3

    From these 3 groups [100,200,300,400,500.600,700.800.900,1000] sets of paragraphs are sampled out according to these fractions:

    #calculating the 25th and 75th percentile
    
    proba_arr = df_proba['predict_proba']
    percentile_75 = np.percentile(proba_arr, 75)
    percentile_25 = np.percentile(proba_arr, 25)
    
    print("25th percentile of arr : ",
           np.percentile(proba_arr, 25))
    [Out]: 25th percentile of arr :  0.28084100127515504
    print("75th percentile of arr : ",
           np.percentile(proba_arr, 75))
    [Out]: 75th percentile of arr :  0.7063559972435552
    
    #grouping of the paragraphs for following window 
    # group 1 : >= 75
    df_group_1 = df_proba[df_proba['predict_proba'] >= percentile_75]
    # group 2 : <75 and >= 25
    df_group_2 = df_proba[(df_proba['predict_proba'] >= percentile_25) & (df_proba['predict_proba'] < percentile_75)]
    # group 3 : < 25
    df_group_3 = df_proba[(df_proba['predict_proba'] < percentile_25)]
    
    df_group_1.shape, df_group_2.shape, df_group_3.shape
    [Out]: ((1975, 3), (3950, 3), (1975, 3))

    Four different models are then trained on each set of paragraphs for each of the 3 sampling techniques. [total 10 x 3 = 30 models]. The accuracy score is calculated for each of the cases by fitting the models on the 2000-paragraph test set.

    num_of_para = [100,200,300,400,500,600,700,800,900,1000]
    score_conf_list = []
    
    #fractions
    frac1 = 0.4
    frac2 = 0.3
    frac3 = 0.3
    
    #sampling paragraphs from the 3 groups
    df_group_1_frac = df_group_1.sample(frac=frac1, random_state=1).reset_index(drop = True)
    df_group_2_frac = df_group_2.sample(frac=frac2, random_state=1).reset_index(drop = True)
    df_group_3_frac = df_group_3.sample(frac=frac3, random_state=1).reset_index(drop = True)
    
    for para in num_of_para:
        
        #sampling paragraphs from the 3 groups to build the training set
        df_group_1_new = df_group_1_frac[:int(frac1 * para)]
        df_group_2_new = df_group_2_frac[:int(frac2 * para)]
        df_group_3_new = df_group_3_frac[:int(frac3 * para)]
        
        df_list = [df_group_1_new, df_group_2_new, df_group_3_new]
        
        #training set
        df_conf = pd.concat(df_list).reset_index(drop = True)
        
        #preprocessing
        X_train_conf = vectorizer.transform(df_conf.review)
        
        #defining the classifier
        logreg_conf = LogisticRegression()
        
        #training the classifier
        logreg_conf.fit(X=X_train_conf, y=df_conf['sentiment'].to_list())
        
        #calculating the accuracy score on the test set
        score_conf = logreg_conf.score(X_test, df_test['sentiment'].to_list())
        score_conf_list.append(score_conf)
    
    score_conf_list
    [Out]: [0.6525, 0.6835, 0.7235, 0.7525, 0.766, 0.7735, 0.778, 0.7875, 0.796, 0.807]

    Results and Conclusion

    The accuracies of the three sampling strategies can now be compared, and it is clear that a combination of different confidence intervals performs better than the others. This shows that along with learning new information from uncertain paragraphs the model also requires retaining the previously learned information. Therefore a balance of data from different confidence intervals helps the model learn, maximizing the resulting overall accuracy.

  • DataNeuron vs Pre-Trained LLMs for Data Labeling

    DataNeuron vs Pre-Trained LLMs for Data Labeling

    Experiment

    We compare DataNeuron‘s Unsupervised learning algorithms to the most popular Language Models in benchmark experiments. This benchmarking is primarily intended to compare models capable of data annotation without any prior domain knowledge or pre-training (Zero-Shot).

    Dataset

    3 domain-specific datasets were used to compare the algorithms’ accuracy:

    1. Ohsumed: is a medical abstracts from MeSH categories from 1991 (https://paperswithcode.com/dataset/ohsumed)
    2. Contract Understanding Atticus Dataset (CUAD): is a legal contract dataset with extensive expert annotations (https://www.atticusprojectai.org/cuad)
    3. Emotions: Dataset with classes such as “sadness, anger, love, surprise, fear, and joy.” (https://www.kaggle.com/datasets/parulpandey/emotion-dataset)

    Open-source Large Language Models (LLMs)

    We have selected three of the most popular LLMs that are capable of performing Zero-Shot learning:

    1. Nb-bert-base-mnli: Trained on MNLI dataset (https://huggingface.co/NbAiLab/nb-bert-base-mnli)
    2. Bart-large-mnli: Trained on the MultiNLI (MNLI) dataset (https://huggingface.co/facebook/bart-large-mnli)
    3. Nli-roberta-base: Trained on the SNLI and MultiNLI datasets (https://huggingface.co/cross-encoder/nli-roberta-base)

    GPT 3.5 Turbo

    The most capable GPT-3.5 model, optimized for chat, at one-tenth the cost of the text-davinci-003. Supports up to 4,096 tokens with training data until September 2021.

    We have picked this GPT version for the benchmarking since ChatGPT hosts the same model version, which is available to the general public.

    Colab Link

    1. API Setup

    Using pip install OpenAI, install the OpenAI library in your Python environment. After installation, create an OpenAI API key. Then use Openai.api_key = ‘key-extracted-from-openai’ to authenticate the API key.

    2. ChatCompletion

    Use the Chat Completion Class to call the “create” function, which serves as a model for sending requests and receiving API responses.

    
    chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", 
    messages=[{"role": "user", "content": "Hello world"}])

    Request Message

    [{“role”: “user”, “content”: “Hello world”}]

    Content: Prompt Example

    Please label each of the following 100 paragraphs strictly into one of these 6 classes: 1. sadness 2. anger 3. love 4. surprise 5. fear 6. joy. Separate answers with , for each of the paragraphs [“paragraph1”,”paragraph2”,………,”paragraph100”]

    4. Benchmarking

    Classification Report was generated by comparing the ground truth to the GPT 3.5 Output after creating classes on paragraphs with OpenAI libraries.

    DataNeuron Platform (Stage 1)

    Stage 1 models predict paragraphs based on the user-defined Masterlist/ Taxonomy. is equivalent to providing prompts or scope of the classification to the Zero-Shot LLMs. Stage 1 consists of proprietary Unsupervised models for annotation and DSEAL algorithms for strategic data sampling.

    3.png

    Result

    Conclusion

    DataNeuron Stage 1 performed better than pre-trained LLMs BERT BASE and ROBERTA across all the 3 datasets. Further it has comparable accuracies to BART LARGE for OSHUMED and CUAD datasets while outperforming it significantly on EMOTIONS dataset. DataNeuron Stage 1 models also outperformed GPT 3.5 in the benchmarking on CUAD and EMOTIONS dataset.

    It is critical to note DataNeuron Stage 1 model was not given any sample paragraphs for pre-training, implying that Stage 1 models can automatically annotate with high accuracy without any prior domain knowledge.

    Since DataNeuron models are light-weight it scales much better for the large data annotation workflows when compared to LLMs. At the same time DataNeuron is able to achieve comparable/ better accuracies with the proprietary Unsupervised models and DSEAL algorithms when compared to pre-trained LLMs at lesser cost/ time.