BM25 x Vila Sésamo: avaliando modelos Sentence-BERT para Recuperação de Informação no cenário legislativo brasileiro

Douglas Vitório; Ellen Souza; José Antônio dos Santos; André Carlos Ponce de Leon Ferreira; Adriano L. I. Oliveira; Nádia F. F. da Silva

doi:10.21814/lm.17.1.474

BM25 vs. Sesame Street: assessing Sentence-BERT models for Information Retrieval within the Brazilian legislative scenario

Douglas Vitório Centro de Informática, Universidade Federal de Pernambuco https://orcid.org/0000-0003-2285-574X
Ellen Souza Universidade Federal Rural de Pernambuco https://orcid.org/0000-0002-7706-4809
José Antônio dos Santos Universidade de Pernambuco https://orcid.org/0000-0002-1917-3003
André Carlos Ponce de Leon Ferreira Universidade de São Paulo https://orcid.org/0000-0002-4765-6459
Adriano L. I. Oliveira Centro de Informática, Universidade Federal de Pernambuco https://orcid.org/0000-0002-5614-229X
Nádia F. F. da Silva Universidade Federal de Goiás https://orcid.org/0000-0002-3875-2211

DOI: https://doi.org/10.21814/lm.17.1.474

Keywords: information retrieval, legislative documents, language models, BERT, BM25

Abstract

BERT-based models have been largely used, becoming the state-of-the-art for many Natural Language Processing tasks and for Information Retrieval. The Sentence-BERT architecture allowed these models to be easily used for the semantic search of documents, as it generates contextual embeddings that can be compared using similairty measures. To further investigate the application of BERT-based models for Information Retrieval, this work assessed 12 publicly available Sentence-BERT models for documents retrieval within the Brazilian legislative scenario. Two BM25 variants were used as baseline: Okapi BM25 and BM25L. BM25L achieved better results, considering statistical significance, even in the scenario in which the documents were not preprocessed, while only one language model, fine-tuned using Brazilian legislative data, could reach a similar performance for one of the three used datasets

PDF (Português (Portugal))

Published

2025-06-30

How to Cite

Vitório, D., Souza, E., dos Santos, J. A., André Carlos Ponce de Leon Ferreira, Oliveira, A. L. I., & F. da Silva, N. F. (2025). BM25 vs. Sesame Street: assessing Sentence-BERT models for Information Retrieval within the Brazilian legislative scenario. Linguamática, 17(1), 17-33. https://doi.org/10.21814/lm.17.1.474

Download Citation

Issue

Vol. 17 No. 1

Section

Research Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).