Explainer: how do you train your chatbot?

January 31, 2025

Imma Perfetto

Cosmos science journalist

A hand holds a smartphone displaying the deepseek chatbot app, which reads: "hi, i'm deepseek. How can i help you today? " — The DeepSeek app. Credit: Justin Sullivan/Getty Images

Chinese startup DeepSeek made waves this week, releasing its new open source chatbot DeepSeek-R1 instantly disrupting the field of artificial intelligence.

DeepSeek claims it was able to achieve similar or equivalence performance as OpenAI’s latest model – OpenAI-o1-1217 – but took a fraction of the cost to develop.

“…$5.6 million compared to the undisclosed billions that OpenAI consumes in building their models,” Daswin De Silva, professor of AI and analytics at La Trobe University, told Cosmos.

Wolfgang Mayer, associate professor of STEM at The University of South Australia, told Cosmos this was partly due to the smaller datasets required to train the model.

“In addition, DeepSeek also uses several engineering techniques and clever ways to create larger models from smaller models to make the training more efficient on less powerful computer chips,” says Mayer.

So, how did they do it?

What is DeepSeek-R1?

DeepSeek-R1 is a type of generative artificial intelligence (AI) known as a large language model (LLM). LLMs learn language to generate text output in the form of conversations, brainstorming, summarising, and content creation.

“It’s learning language through correlations or frequently occurring patterns in sentences,” says De Silva, who is also deputy director of the Centre for Data Analytics and Cognition (CDAC) at La Trobe.

“Given a series of words, can you predict the next word? This is the learning task for the large language model, and it gets it right most of the time.”

How do you “train” LLMs?

Training an LLM is a several stage process. In the first — pre-training – the model learns language patterns by processing massive amounts of text data.

Mayer says the LLM adjusts billions of internal parameters to minimise errors when predicting the next word in a sentence or answering questions.

“This phase uses a self-supervised learning approach: the model predicts the next word in a sentence – e.g., ‘The cat sat on the ___’ might predict ‘mat’ – and adjusts millions of internal parameters to minimise prediction errors,” he explains.

But, while pre-training gives the LLM a broad understanding of language, it must be fine-tuned before it can perform specific tasks, like chatting with a human. This is where human feedback comes in, to retrain the language model into a more conversational style chatbot.

“The model is trained further on curated datasets tailored to specific goals, like improving accuracy, adhering to ethical guidelines, or focusing on particular tasks or domains of interest,” says Mayer.

One way this is done is through supervised fine-tuning (SFT), which requires human trainers to provide the LLM with “labelled data”. According to Mayer, this may involve “input-output pairs”, such as crafting example responses to user questions.

“This ensures the model learns how to behave in specific scenarios, such as writing polite responses or solving technical problems,” he says.

“Chain-of-thought (CoT) prompting is sometimes incorporated to teach the model to reason step-by-step, breaking down complex tasks into logical sequences.”

Finally, reinforcement learning adds another layer of refinement.

“One common method is reinforcement learning from human feedback (RLHF), where humans rank the quality of the model’s responses,” Mayer says.

“These rankings guide the model to produce more helpful, accurate, and contextually appropriate outputs over time. In this process, the model generates responses, receives feedback, and updates its parameters to align with what users or evaluators find most useful.”

But large amounts of labelled data are needed to train an LLM into a chatbot, and it is expensive to gather. In 2023, TIME Magazine reported that OpenAI outsourced data labelling to a San Francisco-based firm, Sama, which employed workers in Kenya to do the job for less than $2 per hour.

How is the DeepSeek-R1 chatbot different?

Starting with a pre-trained general LLM, DeepSeek used a relatively small dataset to fine-tune the model.

“In some ways, they’ve automated that second step, potentially because of funding limits,” says Da Silva.

“What seems to be DeepSeek’s approach is to drop the human feedback part and just do reinforcement learning.

“They’ve created a data set of problems from math, from code and from logic or reasoning, and they’ve used this to conduct the reinforcement learning.”

In reinforcement learning, a model receives a reward or penalty based on its answer and learns towards increasing this reward.

According to Mayer, these general rules were used to evaluate whether the answer was in the correct format, whether it made sense, and matched the writing styles expected.

“Next, the best results generated by the model were added to the training data, and supervised fine-tuning (SFT) was employed to improve the model further. That is, the model generated part of its own training data,” says Mayer.

This process repeated until DeepSeek-R1 produced responses comparable to those from OpenAI-o1. However, Mayer cautions that we don’t yet understand where DeepSeek-R1’s responses are better or worse than other LLMs.

Datasets used to train DeepSeek-R1 remain unknown

While the DeepSeek-R1 model has been released as open source, meaning anyone can download and use it on their own computer, the training datasets have not been disclosed (as in many other commercial systems).

“Training data tends to be treated as proprietary or intellectual property, and it’s very opaque how the training data is collected, how it is pre-processed to improve the quality of what’s received by the model,” says Da Silva.

“For most AI companies, their success or failure is to some extent dependent on the quality of the training data sets.

“So we don’t know exactly what data sets they used.”

Mayer adds that this can make it challenging to understand the capabilities and biases that the system may exhibit.

“Initial experiments with DeepSeek show key differences in responses compared to other models, such as ChatGPT, which suggests that the training data deviated from that used to train other models in selected areas of concern, such as sensitive political topics,” says Mayer.

Ultimately, Mayer predicts DeepSeek-R1’s release may lead to more efficient and profitable AI systems and may also help organisations with fewer resources compete with the dominant players in the AI field.

“It will be interesting to see how the well-known AI companies respond and how DeepSeek develops further in the coming months,” he adds.

Explainer: how do you train your chatbot?

Imma Perfetto

What is DeepSeek-R1?

How do you “train” LLMs?

How is the DeepSeek-R1 chatbot different?

Datasets used to train DeepSeek-R1 remain unknown

Asking ChatGPT a health-related question? Better keep it simple

AI can churn out 17,000 words of disinformation per hour

Addressing the massive climate and energy costs of AI

Generative AI could automate sexual abuse and child grooming, eSafety Commissioner says

Imma Perfetto

What is DeepSeek-R1?

How do you “train” LLMs?

The key to reducing Large Language Model bias for better AI

How is the DeepSeek-R1 chatbot different?

Datasets used to train DeepSeek-R1 remain unknown