Introduction

In dentistry, the interpretation of radiological images is one of the most critical stages of the clinical decision-making process. Yet this process does not always proceed with full consistency. The fact that the same periapical radiograph can be interpreted differently by different clinicians is a phenomenon measured in the literature. Especially for findings such as caries, periapical lesions and periodontal bone loss, inter-clinician variability shows that diagnostic quality depends not only on the image but also on experience, attention level and interpretation standard.

In this context, AI-based image analysis systems are drawing increasing interest in dentistry. But the real question here is not whether these systems "exist" but what the current performance levels actually say. The data show AI has reached clinically meaningful accuracy in some areas; however, it still faces significant limits around trust, integration, interpretability and workflow fit. The topic should therefore be evaluated less through technological excitement and more through diagnostic performance and clinical applicability.

1. Current Performance: What Level Has AI Reached?

Systematic reviews and meta-analyses published in recent years show that AI in dental radiological image analysis has now moved beyond the experimental boundary. An umbrella review focused on caries detection and covering 137 primary studies reported a pooled sensitivity of 0.85, specificity of 0.90 and AUC of 0.86. These data indicate meaningful diagnostic capacity, especially at the screening and preliminary-evaluation level.

Bitewing approximal caries — sensitivity 0.94
Bitewing approximal caries — specificity 0.91

Source: Approximal caries meta-analysis, ScienceDirect 2024

A 2024 meta-analysis on the detection of approximal caries in bitewing radiographs offers even stronger results: sensitivity of 0.94 and specificity of 0.91. This shows AI can deliver high performance in ruling out healthy surfaces and flagging likely caries areas. However, the wide range in positive predictive value across studies suggests that detected findings still require clinician confirmation.

The picture is more heterogeneous for periapical lesions. According to systematic review data, the accuracy of CNN-based models ranges from 70% to 99.65%. The wide range in sensitivity and specificity shows that performance in this area is highly sensitive to variables such as dataset quality, labelling standards and model architecture. In periodontal bone loss, performance is more stable: in large-scale studies, AUC ranges from 0.884 to 0.913, and sensitivity reaches 88.8–90.7%. Still, relatively lower specificity means false-positive burden should not be overlooked.

2. Why Is Inter-Clinician Variability So Important?

To understand AI performance, it must be compared not only to a theoretical "ideal diagnosis" but to actual clinical practice. Here, inter-clinician variability is one of the key references. In a reliability study where 14 dentists evaluated 150 radiographs, the inter-rater Cohen's kappa for caries was 0.659–0.704. For periapical lesions this dropped to 0.611–0.643, and for periodontal bone loss to 0.454–0.482.

The meaning is clear: image interpretation does not produce high absolute agreement, especially in areas like bone loss and periapical inflammation. More experienced clinicians have been shown to deliver more consistent evaluations. AI systems therefore become important not only as tools for "finding the correct finding" but also as a standardisation layer that can reduce inter-clinician variability.

"The better question is not 'does AI outperform the expert?' but 'does AI reduce variation in clinical practice?'"

This contribution is more meaningful for less experienced clinicians in particular. The literature shows that AI can partially close the experience gap in lesion detection and improve decision confidence in some pathologies. For this reason, the performance debate should not be reduced to "does AI surpass the expert?" The better question is: "Does AI reduce variation in clinical practice?"

3. Diagnostic Error, Delay and Reporting Quality

Diagnostic error in radiological analysis is not just a theoretical performance issue; it has direct clinical consequences. Missing early caries, periapical lesions or periodontal bone loss can lead to delays in treatment planning, more invasive interventions and deterioration of patient experience. For this reason, reporting quality is as important as diagnostic accuracy.

In a 2025 clinical audit of dental students' intraoral periapical radiology reports, only 60% of cases documented caries localisation, restoration details were documented with 42% accuracy, and the retake rate reached 65%. These findings show that the problem is not only image recognition but also standardised reporting and process discipline.

This is exactly where AI has the potential to add value. Because systems can do more than flag findings; they can contribute via report standardisation, surfacing missing fields and acting as a second reader in the workflow. This becomes especially critical in busy clinics or teams with heterogeneous training levels.

4. AI: Independent Diagnostic Tool or Decision Support System?

Current data suggests AI's strongest position is not as an independent decision-maker but as a complementary decision support system. Systematic reviews emphasise that in caries detection in particular, AI is strong at ruling out healthy surfaces but positive findings require expert confirmation. This framing points to a more realistic clinical use model.

The regulatory side supports this trend. The fact that the largest share of FDA-cleared dental AI/ML devices between 2011 and 2024 is concentrated in oral radiology shows the field is maturing. However, an increasing number of approved devices does not mean large-scale clinical adoption is occurring at the same pace. The gap between technical accuracy and day-to-day use is determined by factors such as integration, training and trust.

For this reason, AI's strongest role in the clinic appears to be prioritisation, second review, flagging areas that are hard to see, and supporting reporting standards — rather than replacing the clinician.

5. The Real Limits: Trust, Transparency and Integration

3.8%
Proportion of dentists who trust fully automated AI diagnostic decisions. Source: J. Medicine and Life / PMC, 2025

The key issue slowing adoption is not only performance but the trust architecture. In surveys, only 3.8% of dentists trust fully automated AI diagnostic decisions. The vast majority of respondents prefer that the final diagnosis be made by a human clinician. This data does not show that clinicians are entirely closed to AI, but that they place it on the periphery of the decision rather than at its centre.

There are several reasons for this hesitation. The first is the "black box" problem: deep learning models often cannot clearly show how they arrived at a conclusion. The second is the integration problem. Lack of standard compatibility between imaging devices, PACS-like systems and patient management software makes it difficult to translate technical value into workflow. The third is the regulatory and liability framework. When software is evaluated as a medical device, uncertainties emerge about how boundaries are drawn in case of error.

Therefore, the current picture points less to the conclusion that "AI is not good enough" and more to the conclusion that "even if it is good enough, how it will be used in a reliable and integrated way still needs to be clarified."

Conclusion

AI systems for radiological image analysis in dentistry have reached clinically meaningful performance levels, especially in caries detection, periapical lesion analysis and periodontal bone loss measurement. Sensitivity, specificity and AUC data are strong in many subdomains. Even so, this performance offers a better fit for decision-support systems that reinforce the human clinician rather than for fully autonomous and independent diagnosis.

The real value lies not in AI "solving" the invisible alone, but in reducing inter-clinician variability, strengthening reporting standards and creating a second-reader layer especially for less experienced users. In the coming period, the decisive question will be less about the performance race and more about how these systems are positioned in a trustworthy, explainable and workflow-integrated way.