“AI Tutors” is touted as a way to revolutionize education.
The idea is that generative artificial intelligence tools (such as ChatGPT) can be adapted to any teaching style set by teachers. AI can guide students through the questions step by step and provide tips without giving answers. They can then provide accurate and immediate feedback tailored to the individual learning gaps of students.
Despite enthusiasm, there is a limited extent to how well AI works in educational settings, especially within structured university courses.
Our new research has developed a unique AI tool for university law classes. We wanted to know, can it really support personalized learning, or do we expect too much?
Our research
In 2022, we developed SmartTest, a customizable education chatbot, as part of a broader project to democratize access to AI tools in education.
Unlike typical chatbots, SmartTest is dedicated to educators and allows you to embed questions, model answers and prompts. This means that chatbots can ask relevant questions, provide accurate and consistent feedback, and minimize hallucinations (or mistakes). SmartTest is also instructed to use Socrates’ methods, urging students to think rather than supplying answers with a spoon.
I tried SmartTest in 2023 with my Criminal Law Course (one of us had coordinated) at the University of Wollongong in 5 test cycles.
Each cycle introduced a variety of complexities. The first three cycles used a short virtual criminal law scenario (for example, who was guilty of theft in this scenario). In the last two cycles, we used simple short answer questions (for example, what is the biggest sentencing discount for a guilty plea?).
An average of 35 students interacted with SmartTest in each cycle throughout several criminal law tutorials. Participation was voluntary and anonymous, and students interacted with SmartTest on their devices for up to 10 minutes per session. Student conversations with SmartTest – their attempts to answer questions, and the immediate feedback they received from the chatbot – were recorded in the database.
After the final test cycle, students were investigated about their experiences.
Reproduction with permission from Snowflake Inc. Provided by the author (no reuse)
What we found
SmartTest has made its commitment to guide students and help them identify gaps in their understanding.
However, in the first three cycles (question for question senario), there were at least examples of inaccurate, misleading or false feedback between 40% and 54% of the conversation.
When we shifted to a much simpler, shorter answer format in cycles 4 and 5, the error rate dropped significantly from 6% to 27%. However, even these best performance cycles lasted several errors. For example, you might also check for incorrect answers before SmartTest provides the correct answer.
An important revelation was the effort required to make the chatbot work effectively in the test. Far from a time-saving silver bullet, the integration of SmartTest included urgent rapid engineering and rigorous manual assessments from educators (in this case, the US). This paradox raises questions about the practical benefits of already timeless educators when tools promoted as labor savings demand critical labor.
Contradictions are a core issue
SmartTest was also unpredictable. Under the same conditions, we may provide excellent feedback, and at other times we have provided false, confusing, or misleading information.
For educational tools responsible for supporting students’ learning, this raises serious concerns about reliability and reliability.
To assess whether the new model improves performance, we have replaced the underlying Generated AI Power Smart Test (ChatGPT-4) with newer models, such as the CHATGPT-4.5, which was released in 2025.
We tested these models by replicating instances where SmartTest provides insufficient feedback to students in our study. The newer models were not consistently superior to the older models. Sometimes their responses have been more accurate or useful from an educational standpoint. Therefore, newer, advanced AI models will not automatically be converted to better educational outcomes.
What does this mean for students and teachers?
The impact on students and university staff is mixed.
Generic AI may support low stake, formative learning activities. However, our study failed to provide the reliability, nuance, and subject depth required for many educational contexts.
On the positive side, the findings show that students appreciated SmartTest’s immediate feedback and conversation tone. Some said it reduced anxiety and expressed uncertainty more comfortably. However, this advantage is caught. Incorrect or misleading answers can be as easily reinforced as to clarify misunderstandings.
Most students (76%) preferred access to SmartTest rather than having no opportunity to practice the questions. However, 27% preferred AI when they chose to receive immediate feedback from AI or wait more than a day for feedback from human tutors. Almost half were late and preferred human feedback, while the rest were indifferent.
This suggests an important issue. Students enjoy the convenience of AI tools, but still have greater trust in human educators.
Care must be taken
Our findings suggest that generative AI should still be treated as experimental educational aid.
The possibilities are real, but so are the restrictions. Relying heavily on AI without strict assessments risks undermining the highly educational outcomes we aim to strengthen.