BM25 vs. Sesame Street: assessing Sentence-BERT models for Information Retrieval within the Brazilian legislative scenario

Keywords: information retrieval, legislative documents, language models, BERT, BM25

Abstract

BERT-based models have been largely used, becoming the state-of-the-art for many Natural Language Processing tasks and for Information Retrieval. The Sentence-BERT architecture allowed these models to be easily used for the semantic search of documents, as it generates contextual embeddings that can be compared using similairty measures. To further investigate the application of BERT-based models for Information Retrieval, this work assessed 12 publicly available Sentence-BERT models for documents retrieval within the Brazilian legislative scenario. Two BM25 variants were used as baseline: Okapi BM25 and BM25L. BM25L achieved better results, considering statistical significance, even in the scenario in which the documents were not preprocessed, while only one language model, fine-tuned using Brazilian legislative data, could reach a similar performance for one of the three used datasets

Published
2025-06-30
How to Cite
Vitório, D., Souza, E., dos Santos, J. A., André Carlos Ponce de Leon Ferreira, Oliveira, A. L. I., & F. da Silva, N. F. (2025). BM25 vs. Sesame Street: assessing Sentence-BERT models for Information Retrieval within the Brazilian legislative scenario. Linguamática, 17(1), 17-33. https://doi.org/10.21814/lm.17.1.474
Section
Research Articles