An Evaluation Framework for Fault Diagnosis Using Technical Manuals in Retrieval-Augmented Large Language Models

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published Oct 26, 2025
Sarah Lukens
Matthew Bishof Nadir Siddiqui Destiny West

Abstract

Fault diagnosis is a time-intensive maintenance task often reliant on the expertise of senior technicians. As this workforce ages and demand grows for digital tools, there is a growing need to capture and automate this knowledge while maintaining the precision required for technical applications. This study introduces an evaluation-driven framework for fault code recommendation, applied to a ground vehicle diagnosis system. Two tasks were designed to reflect potential system configurations: (1) a chat-style task simulating large language model (LLM) interaction, and (2) a label-constrained task using structured fault codes from technical manuals. Multiple retrieval-augmented generation (RAG) configurations were compared against LLM-only and retrieval-only baselines. Results showed that retrieval-based methods outperformed LLM-based ones for label-matching tasks, while the chat task showed challenges in linking observations to fault codes from the manual. These results highlight the importance of aligning task design with evaluation goals and considering retrieval-first approaches as viable alternatives to LLMs in technical language processing (TLP) applications. Beyond experimental findings, we outline industrial lessons learned: the importance of aligning system design to use case goals, adopting evaluation-first validation, and the need to pilot LLM-based systems under realistic conditions. These lessons provide practical guidance for developing effective diagnostic support systems in industrial contexts.

How to Cite

Lukens, S., Bishof, M., Siddiqui, N., & West, D. (2025). An Evaluation Framework for Fault Diagnosis Using Technical Manuals in Retrieval-Augmented Large Language Models. Annual Conference of the PHM Society, 17(1). https://doi.org/10.36001/phmconf.2025.v17i1.4549
Abstract 1 | PDF Downloads 0

##plugins.themes.bootstrap3.article.details##

Keywords

LLM, RAG, TLP, NLP, technical language processing, large language models, retrieval-augmented-generation, fault diagnosis, maintenance troubleshooting, recommendation system, ground vehicle diagnosis, technical manual

References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,Aleman, F. L., & others (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.
Adhikari, N. S., & Agarwal, S. (2024). A comparative study of pdf parsing tools across diverse document categories. arXiv preprint arXiv:2410.09871.
Alghamdi, E., Halvey, M., & Nicol, E. (2024). System and user strategies to repair conversational breakdowns of spoken dialogue systems: a scoping review. In Proceedings of the 6th ACM Conference on Conversational User Interfaces (pp. 1–13).
Azevedo, N., Aquino, G., Nascimento, L., Camelo, L., Figueira, T., Oliveira, J., Figueiredo, I., Printes, A., Torne, I. & Figueiredo, C. (2023).
A novel methodology for developing troubleshooting chatbots applied to ATM technical maintenance support. Applied Sciences, 13(11), 6777.
Chen, A., Tian, Y., Zhang, J., Li, C., & Zhang, H. (2025). LLM-based intelligent Q&A system for railway locomotive maintenance standardization. Scientific Reports, 15(1), 12953.
Dave, A. J., Nguyen, T. N., & Vilim, R. B. (2024). Integrating llms for explainable fault diagnosis in complex
systems. arXiv preprint arXiv:2402.06695.
Doris, A. C., Grandi, D., Tomich, R., Alam, M. F., Cheong, H., & Ahmed, F. (2024). DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation. arXiv preprint arXiv:2404.07917.
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Osazuwa Ness, R. & Larson, J. (2024).
From Local to Global: A Graph RAG Approach to Query- Focused Summariza- tion. arXiv preprint arXiv:2404.16130.
Eleti, H. J., Atty, & Kilpatrick, L. (2023, June).
Function Calling and Other API Updates. https://openai.com/index/function calling-and-other-api-updates/. (Ac-
cessed: 2024-06-25)
Ferdousi, R., Hossain, M. A., Yang, C., & Saddik, A. E. (2024). Defecttwin: When llm meets digital twin for railway defect inspection. arXiv preprint arXiv:2409.06725.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J. & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
He, H., Huang, J., Li, Q., Wang, X., Zhang, F., Yang, K., Meng, L. & Chu, F. (2024). Maintagt: Sim2real-guided multimodal large model for intelligent maintenance with chain-of-thought reasoning. arXiv preprint arXiv:2412.00481.
Hodkiewicz, M., Kl¨uwer, J. W., Woods, C., Smoker, T., & Low, E. (2021). An ontology for reasoning over engineering textual data stored in fmea spreadsheet tables. Computers in Industry, 131, 103496. Retrieved from https:// www.sciencedirect.com/science/article/pii/S0166361521001032 doi: https://doi.org/10.1016/j.compind.2021.103496
Huang, H., Shah, T., Karigiannis, J., & Evans, S. (2024). Physics and data collaborative root cause analysis: Integrating pretrained large language models and data-driven ai for trustworthy asset health management. In Annual Conference of the PHM Society (Vol. 16).
Jadon, A., Patil, A., & Kumar, S. (2025). Enhancing domain-specific retrieval-augmented generation: Synthetic data generation and evaluation using reasoning models. arXiv preprint arXiv:2502.15854.
Karray, M. H., Ameri, F., Hodkiewicz, M., & Louge, T. (2019). ROMAIN: Towards a BFO compliant refer-
ence ontology for industrial maintenance. Applied Ontology, 14(2), 155–177.
Khan, A., Nahar, R., Chen, H., Flores, G. E. C., & Li, C. (2025). Faultexplainer: Leveraging large language models for interpretable fault detection and diagnosis. Computers & Chemical Engineering, 109152.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., . . . others (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33,
9459–9474.
Liang, Z., Xu, Y., Hong, Y., Shang, P., Wang, Q., Fu, Q., & Liu, K. (2024). A survey of multimodel large language models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering (pp. 405–409).
Lowenmark, K. (2025). Technical Language Supervision and Agentic AI for Condition Monitoring (Unpublisheddoctoral dissertation). Lule˚a University of Technology.
Lowenmark, K., Str¨ombergsson, D., Liu, C., Liwicki, M., & Sandin, F. (2025). Agent-based condition monitoring assistance with multimodal industrial database retrieval augmented generation. arXiv preprint arXiv:2506.09247.
Lowenmark, K., Taal, C., Schnabel, S., Liwicki, M., & Sandin, F. (2021). Technical language supervision for intelligent fault diagnosis in process industry. arXiv preprint arXiv:2112.07356.
Lowhagen, N., Schwendener, P., & Netland, T. (2025). Can a troubleshooting ai assistant improve task performance in industrial contexts? International Journal of Production Research, 1–22.
Lu, W., Luu, R. K., & Buehler, M. J. (2025). Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities. npj Computational Materials, 11(1), 84.
Lukens, S., McCabe, L. H., Gen, J., & Ali, A. (2024). Large Language Model agents as prognostics and health management copilots. In Proceedings of the Annual Conference of the PHM Society (Vol. 15).
Meng, X., Jing, B., Wang, S., Pan, J., Huang, Y., & Jiao, X. (2023). Fault knowledge graph construction and plat-
form development for aircraft PHM. Sensors, 24(1), 231.
Mezzetti, D. (2020). txtai: the all-in-one embeddings database. Retrieved from https://github.com/neuml/txtai
Peng, H., & Yang, W. (2024). Knowledge graph construction method for commercial aircraft fault diagnosis based on logic diagram model. Aerospace, 11(9), 773.
Qaid, H. A., Zhang, B., Li, D., Ng, S.-K., & Li, W. (2024). Fd-llm: Large language model for fault diagnosis of machines. arXiv preprint arXiv:2412.01218.
Rajpathak, D. G. (2013). An ontology based text mining system for knowledge discovery from the diagnosis
data in the automotive domain. Computers in Industry, 64(5), 565-580. Retrieved from https://www.sciencedirect.com/science/article/pii/S0166361513000456 doi: https://doi.org/10.1016/j.compind.2013.03.001
Rauber, J. X., & Inc., A. S. (2024). PyMuPDF -Python bindings for MuPDF. https://pymupdf.readthedocs.io/. (Version 1.23.25)
Schaafstal, A., Schraagen, J. M., & Van Berl, M. (2000). Cognitive task analysis and innovation of training: The case of structured troubleshooting. Human factors,42(1), 75–86.
Sentence Transformers. (2021). all-MiniLM-L6-v2: Sentence transformers model. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
Shin, H., Tien, K.-W., & Prabhu, V. (2019). Modeling the maintenance time considering the experience of the technicians. In IFIP International Conference on Advances in Production Management Systems (pp. 716–721).
Soboroff, I. (2025). Don’t use LLMs to make relevance judgments. Information retrieval research journal, 1(1),10–54195.
Tang, X., Chi, G., Cui, L., Ip, A. W., Yung, K. L., & Xie, X. (2023). Exploring research on the construction and application of knowledge graphs for aircraft fault diagnosis. Sensors, 23(11), 5295.
Tao, L., Liu, H., Ning, G., Cao, W., Huang, B., & Lu, C. (2025). Mechanical Systems and Signal Processing, 224,112127.
Trilla, A., Yiboe, O., Mijatovic, N., & Vitri `a, J. (2024). Industrial-grade smart troubleshooting through causal technical language processing: a proof of concept.arXiv preprint arXiv:2407.20700.
U.S. Department of the Army. (1998, June). Technical Manual TM 9-2320-365-10: M1078 Series Operators Manual [Computer software manual]. Retrieved 2025-06-21, from https://www.steelsoldiers.com/upload/M1078/m1078 TM%209-2320-36510.pdf (Accessed via SteelSoldiers.com)
U.S. Department of the Army. (2008, July). MIL-STD-3031: Preparation of Digital Technical Information for Equipment Maintenance. https://quicksearch.dla.mil/. (Military Standard, Department of Defense, United States)
Vachtsevanos, G. J., Lewis, F., Roemer, M., Hess, A., & Wu, B. (2006). Intelligent fault diagnosis and prognosis for engineering systems (Vol. 456). Wiley Online Library. Vidyaratne, L., Lee, X. Y., Kumar, A., Watanabe, T., Farahat, A., & Gupta, C. (2024). Generating troubleshooting trees for industrial equipment using large language models (llm). In 2024 ieee international conference on prognostics and health management (icphm) (pp. 116–125).
Vitale, M., Youssef, A., Mishra, P., Shetty, S., Sharma, M., Vanzo, G., . . . Bettini, A. (2024). Harnessing generative ai for interactive system failure diagnostics: A user-centric approach to streamlined problem solving and maintenance. In Abu Dhabi International Petroleum Exhibition and Conference (p. D011S020R006).
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Advances in Neural Information Processing Systems, 33, 5776–5788.
Woods, C., French, T., Hodkiewicz, M., & Bikaun, T. (2023). An ontology for maintenance procedure documentation. Applied Ontology, 1–38.
Woods, C., Selway, M., Bikaun, T., Stumptner, M., & Hodkiewicz, M. (2023). An ontology for maintenance activities and its application to data quality. Applied Ontology, 1–34. doi: 10.3233/SW-233299
Xie, X., Tang, X., Gu, S., & Cui, L. (2025). An intelligent guided troubleshooting method for aircraft based on hybirdrag. Scientific Reports, 15(1), 17752.
Zheng, S., Pan, K., Liu, J., & Chen, Y. (2024). Empirical study on fine-tuning pre-trained large language mod-
els for fault diagnosis of complex systems. Reliability Engineering & System Safety, 252, 110382.
Zhu, K., Luo, Y., Xu, D., Yan, Y., Liu, Z., Yu, S., Wang, R., Wang, S., Li, Y., Zhang, N., Han, X., Liu, Z. & Sun, M. (2024). RAGEval: Scenario specific rag evaluation dataset generation framework. arXiv preprint arXiv:2408.01262.
Section
Industry Experience Papers

Most read articles by the same author(s)