Hybrid Question-Answering System: A FAISS and BM25 Approach for Extracting Information from Technical Document

Özlem Hakdağlı

doi:10.56038/oprd.v5i1.535

Back to Journal

Research Article Open AccessOrclever Native

Hybrid Question-Answering System: A FAISS and BM25 Approach for Extracting Information from Technical Document

Özlem Hakdağlı¹

¹Teracity Yazılım Teknolojileri A.Ş.

Published:December 31, 2024

DOI: 10.56038/oprd.v5i1.535

Vol. 5, No. 1 · pp. 226–237

Abstract

In this study, a hybrid question-answering system was developed to accelerate access to information contained in corporate technical documents and to generate appropriate responses to user queries. The system combines dense vector-based retrieval (FAISS) and sparse text-based retrieval (BM25) methods, integrated with the XLM-RoBERTa Large model. Evaluations conducted on a dataset consisting of 23 technical documents demonstrated the system's effectiveness in responding to both semantic and keyword-based queries. This study presents an innovative approach that enables fast and accurate access to information from technical documents, enhancing the efficiency of corporate knowledge management processes.

Keywords

Bilgi ÇıkarmaSoru-Cevap SistemleriFAISSBM25Teknik DokümanlarKurumsal Bilgi Yönetimi

References

1.C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
2.A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
3.J. Devlin, M.-W. Chang, K. Lee ve K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, ABD, Haz. 2019, ss. 4171–4186. [Çevrimiçi]. Erişim: https://aclanthology.org/N19-1423/Link
4.Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto ve P. Fung, "Survey of Hallucination in Natural Language Generation," ACM Computing Surveys, cilt 55, sayı 12, s. 1–38, Şub. 2022. [Çevrimiçi]. Erişim: https://arxiv.org/pdf/2202.03629v1Link
5.P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel ve D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," arXiv preprint arXiv:2005.11401, 2020. [Çevrimiçi]. Erişim: https://arxiv.org/abs/2005.11401Link
6.J. Johnson, M. Douze, and H. Jégou, "Billion-Scale Similarity Search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021. [Online]. Available: https://ieeexplore.ieee.org/document/8733051.Link
7.N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3982–3992. [Online]. Available: https://aclanthology.org/D19-1410.Link
8.S. E. Robertson and H. Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond," Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
9.C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
10.A. Conneau et al., "Unsupervised Cross-lingual Representation Learning at Scale," in Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8440–8451

Download PDF

Cite This Article

Hakdağlı, Ö. (2024). Hybrid Question-Answering System: A FAISS and BM25 Approach for Extracting Information from Technical Document. *Orclever Proceedings of Research and Development*, 5(1), 226-237. https://doi.org/10.56038/oprd.v5i1.535

Bibliographic Info

JournalOrclever Proceedings of Research and Development

Volume5

Issue1

Pages226–237

PublishedDecember 31, 2024

eISSN2980-020X