OpenAIo1-previewAI诊断胜过医生

A new study suggests that OpenAI's o1-preview AI system may outperform human doctors in diagnosing complex medical cases. A team from Harvard Medical School and Stanford University conducted a comprehensive medical diagnosis test on o1-preview, revealing significant improvements over earlier versions.

The study found that o1-preview achieved a correct diagnosis rate of 78.3% across all tested cases. In a direct comparison of 70 specific cases, the system's accuracy was even higher at 88.6%, significantly surpassing its predecessor GPT-4's 72.9%. In terms of medical reasoning, o1-preview's performance was also impressive. Using the R-IDEA scale, a medical reasoning quality assessment criterion, the AI system scored full marks in 78 out of 80 cases. In comparison, experienced doctors achieved full marks in only 28 cases, while medical residents scored full marks in just 16 cases.

The researchers also acknowledge that o1-preview may contain some test cases in its training data. However, when tested on new cases, its performance only slightly decreased. Dr. Adam Rodman, one of the study's authors, emphasizes that although this is a benchmark study, the results have significant implications for medical practice.

o1-preview excelled in handling complex management cases designed by 25 experts. "Humans struggle with these difficult problems, but o1's performance is stunning," Rodman explains. In these complex cases, o1-preview scored 86%, while doctors using GPT-4 scored only 41%, and traditional tools only 34%.

However, o1-preview is not without its flaws. The system's performance in probability assessment did not improve significantly, for example, in assessing the likelihood of pneumonia, o1-preview estimated a 70% chance, far higher than the scientific range of 25%-42%. Researchers found that o1-preview performs exceptionally well in tasks requiring critical thinking but struggles in more abstract challenges, such as estimating probabilities.

Furthermore, o1-preview usually provides detailed answers, which may have contributed to its high scores. However, the study only focused on o1-preview working independently and did not evaluate its effectiveness when working with doctors. Some critics pointed out that o1-preview's suggested diagnostic tests are often costly and impractical.

Although OpenAI has released new versions of o1 and o3, which have performed well in complex reasoning tasks, these more powerful models still have not resolved the critics' concerns about practical applications and costs. Rodman calls for better evaluation methods for medical AI systems to capture complexity in real-world medical decision-making. He emphasizes that this study does not mean that doctors can be replaced, and actual medical care still requires human involvement.

[Image: ![image.png](https://www.qewen.com/wp-content/uploads/2024/12/1735118819-20241225092659-676bcfe320514.png)]

The paper: [https://arxiv.org/abs/2412.10849](https://arxiv.org/abs/2412.10849)

🌟 o1-preview achieved a diagnosis accuracy of 88.6%, surpassing doctors.
🧠 In medical reasoning, o1-preview scored full marks in 78 out of 80 cases, far outperforming doctors.
💰 Despite its strengths, o1-preview's high costs and impractical test recommendations in real-world applications need to be addressed.

相关推荐

暂无评论

发表评论