AI agents have become an integral part of real enterprise workflows, but they still struggle with accuracy, failing one out of every three attempts on structured benchmarks. This gap between capability and reliability is the main operational challenge for IT leaders in 2026, as highlighted in Stanford HAI’s latest AI Index report.
Referred to as the “jagged frontier” by AI researcher Ethan Mollick, this uneven and unpredictable performance is where AI excels in certain areas but falls short suddenly. For example, while AI models can excel in challenging tasks like the International Mathematical Olympiad, they may still struggle with simple tasks like telling time.
In 2025, there were significant advancements in AI models across various fields. Frontier models showed a 30% improvement on Humanity’s Last Exam, which is a difficult test designed to challenge AI models. Leading models also excelled in multi-step reasoning tasks and knowledge-based exams.
Despite these advancements, AI models still face challenges in areas such as cybersecurity and video generation. While models are improving, they still struggle with basic perception tasks and multi-step reasoning workflows. Additionally, benchmarking AI progress has become more challenging due to reliability issues and discrepancies between developer-reported results and independent testing.
As AI capabilities surge, the reliability of these systems lags behind, leading to concerns about data quality and responsible AI practices. While AI models continue to improve, there is a growing need for transparency, reliability, and accountability in the development and deployment of AI technologies.
Overall, AI is evolving rapidly and reaching more people than ever before. However, the gap between what AI can do in a controlled setting versus its real-world performance remains a significant challenge for developers and IT leaders in 2026. As the field of AI continues to advance, ensuring the reliability and transparency of these systems will be crucial for their successful integration into enterprise workflows.
