Experiment
We compare DataNeuron‘s Unsupervised learning algorithms to the most popular Language Models in benchmark experiments. This benchmarking is primarily intended to compare models capable of data annotation without any prior domain knowledge or pre-training (Zero-Shot).
Dataset
3 domain-specific datasets were used to compare the algorithms’ accuracy:
- Ohsumed: is a medical abstracts from MeSH categories from 1991 (https://paperswithcode.com/dataset/ohsumed)
- Contract Understanding Atticus Dataset (CUAD): is a legal contract dataset with extensive expert annotations (https://www.atticusprojectai.org/cuad)
- Emotions: Dataset with classes such as “sadness, anger, love, surprise, fear, and joy.” (https://www.kaggle.com/datasets/parulpandey/emotion-dataset)
Open-source Large Language Models (LLMs)
We have selected three of the most popular LLMs that are capable of performing Zero-Shot learning:
- Nb-bert-base-mnli: Trained on MNLI dataset (https://huggingface.co/NbAiLab/nb-bert-base-mnli)
- Bart-large-mnli: Trained on the MultiNLI (MNLI) dataset (https://huggingface.co/facebook/bart-large-mnli)
- Nli-roberta-base: Trained on the SNLI and MultiNLI datasets (https://huggingface.co/cross-encoder/nli-roberta-base)
GPT 3.5 Turbo
The most capable GPT-3.5 model, optimized for chat, at one-tenth the cost of the text-davinci-003. Supports up to 4,096 tokens with training data until September 2021.
We have picked this GPT version for the benchmarking since ChatGPT hosts the same model version, which is available to the general public.
1. API Setup
Using pip install OpenAI, install the OpenAI library in your Python environment. After installation, create an OpenAI API key. Then use Openai.api_key = ‘key-extracted-from-openai’ to authenticate the API key.
2. ChatCompletion
Use the Chat Completion Class to call the “create” function, which serves as a model for sending requests and receiving API responses.
chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello world"}])
Request Message
[{“role”: “user”, “content”: “Hello world”}]
Content: Prompt Example
Please label each of the following 100 paragraphs strictly into one of these 6 classes: 1. sadness 2. anger 3. love 4. surprise 5. fear 6. joy. Separate answers with , for each of the paragraphs [“paragraph1”,”paragraph2”,………,”paragraph100”]
4. Benchmarking
Classification Report was generated by comparing the ground truth to the GPT 3.5 Output after creating classes on paragraphs with OpenAI libraries.
DataNeuron Platform (Stage 1)
Stage 1 models predict paragraphs based on the user-defined Masterlist/ Taxonomy. is equivalent to providing prompts or scope of the classification to the Zero-Shot LLMs. Stage 1 consists of proprietary Unsupervised models for annotation and DSEAL algorithms for strategic data sampling.
Result
Conclusion
DataNeuron Stage 1 performed better than pre-trained LLMs BERT BASE and ROBERTA across all the 3 datasets. Further it has comparable accuracies to BART LARGE for OSHUMED and CUAD datasets while outperforming it significantly on EMOTIONS dataset. DataNeuron Stage 1 models also outperformed GPT 3.5 in the benchmarking on CUAD and EMOTIONS dataset.
It is critical to note DataNeuron Stage 1 model was not given any sample paragraphs for pre-training, implying that Stage 1 models can automatically annotate with high accuracy without any prior domain knowledge.
Since DataNeuron models are light-weight it scales much better for the large data annotation workflows when compared to LLMs. At the same time DataNeuron is able to achieve comparable/ better accuracies with the proprietary Unsupervised models and DSEAL algorithms when compared to pre-trained LLMs at lesser cost/ time.
Leave a Reply