Answer Span Extraction With Transformer Neural Networks

 

Question answering is an essential part of various applications, including emerging reading assistants. Since answers generated by Large Language Models (LLMs) can sometimes be hallucinated, users need evidence to verify the information provided. However, reading through entire retrieved passages to find answers is time-consuming. In contrast, highlighting answers within those passages is a more user-friendly solution. This is why I was excited to explore this topic in my Master’s thesis.

Answer Span Extraction

The Dataset

 

I chose the English version of the BiPaR dataset, which contains 14,668 question-answer-passage triples, for my project. The passages had been collected from three English and three Chinese novels translated into English. The results of my analysis, which only considered the training and development sets while excluding the publicly available test set to avoid impacting the subsequent learning process, aligned with the statistics from Jing et al. (2019), which included all the data.

Most of the passages are taken from The Duke of the Mount Deer (53%) by Jin Yong, Harry Potter (22%) by J. K. Rowling, and The Three-Body Problem (14%) by Liu Cixin. The proportion of the three other novels, The Great Gatsby (7%) by F. Scott Fitzgerald, The Old Man and the Sea (2%) by Ernest Hemingway, and Demi-Gods and Semi-Devils (2%) by Jin Yong, is relatively small. A variety of writing styles from different authors got me really interested in this dataset.

My analysis also confirmed that 15.80% of the questions were why questions and 9.58% were all forms of how questions, as shown in the pie chart below. This category of questions, especially non-factoid ones, is usually more challenging to answer.

Question Categories

I also determined answer categories using the spaCy library. This categorization was based on manually developed rules applied in a fixed order and determining whether specific part-of-speech (POS) tags, syntactic dependency labels, or named entity types were used in the answers. The resulting answer categorization was not always accurate because the en_core_web_lg model from spaCy was used without fine-tuning due to the lack of labels. As illustrated in the pie chart below, 64% of the answers in BiPaR are clauses, which are longer and more complex to identify than simple entities (11.92%) or numerical data (13.74%).

Answer Categories

The difference between factoid and non-factoid questions is that the former ask for simple facts, which can be described in a few words (Jurafsky and Martin, 2024, p. 293), while the latter require long-form answers, such as explanations or opinions (Bolotova et al., 2022, p. 1196). Who, what, when, and where questions are typically associated with the factoid group, while how and why questions are usually linked to the non-factoid group (Dzendzik et al., 2021, p. 8785), although counterexamples exist (Bolotova et al., 2022, p. 1197). I mapped the question categories in the BiPaR dataset to their corresponding answer categories to identify any noticeable tendencies. I also examined the distribution of answer lengths across each question category. The results of this analysis are shown in the figure below. The who, whom, whose, and where questions in the dataset typically have short answers, mostly consisting of entities or noun phrases. The answers to why questions are mainly clauses and tend to be quite long.

Distribution of Answer Categories Within Question Categories

 

Evaluation Metrics

 

To evaluate the performance of models on the span prediction task, I used two metrics proposed by Rajpurkar et al. (2016): Exact Match (EM) and the macro-averaged F1 score, which incorporates precision and recall. To compute the F1 score, predicted and ground truth answers were treated as bags of words, as illustrated in the example below. Before calculating the evaluation metrics, articles, punctuation, and extra whitespace were removed from answers, which were then converted to lowercase (see the implementation for the SQuAD dataset).

Evaluation Metrics for Answer Span Extraction

 

Fine-Tuning

 

In the first approach, I fine-tuned several encoders (Vaswani et al., 2017), including BERT, RoBERTa, and ALBERT. The figure below visualizes supervised fine-tuning for span-based question answering. Only the start vector for recognizing the first token of the answer is presented. The use of the end vector to detect the last token of the answer is analogous. The weights of both vectors are learned from scratch during fine-tuning, while the weights of the other pretrained layers are further adjusted for the current task. Typically, the question and passage are separated by a special token, such as [SEP], as shown in the figure. Bidirectional self-attention is computed between the question and passage and vice versa. However, only the passage is relevant for extracting answer spans and is therefore processed through the additional output layer. The dot product between each contextual token embedding from the passage and the start vector is calculated. The softmax function is then applied to obtain the probability distribution over the passage tokens. Subsequently, the cross-entropy loss for classification with multiple classes is computed, where the ground truth is equal to 1 for the class, or the token, that the answer begins with and to 0 for all other classes. The loss for the end of the answer is determined in the same manner. The negative log probabilities of the correct start and end tokens are summed (Devlin et al., 2019, p. 4176).

Fine-Tuning an Encoder for Extractive Question Answering

In addition, I conducted several experiments to enhance the fine-tuning results because the amount of training data with 11,668 question-answer-passage triples was relatively small. These experiments were tracked with Weights & Biases (wandb). The results are publicly available on my profile. Additional fine-tuning on the first version of the SQuAD dataset, which includes 87,599 training examples, followed by subsequent fine-tuning on BiPaR led to improved results compared to learning solely from the BiPaR training set. I also performed continued pretraining using the Masked Language Modeling (MLM) objective and data augmentation with GPT 3.5, both of which did not yield noticeable improvements.

 

Prompting

 

While fine-tuning typically requires several thousand labeled examples to achieve satisfactory results, prompting LLMs or decoders yields meaningful outcomes with zero or only a few examples. The challenge lies in crafting natural language prompts that elicit correct answers from LLMs. The figure below illustrates how a decoder operates during inference for a given passage, question, and prefix, which introduces an answer to be generated step by step or token by token. The token sampled at each time step is fed back into the model as input for the next step, resulting in autoregressive text generation (Jurafsky & Martin, 2024, p. 234). Unlike encoders, the self-attention in decoders is masked to ensure that only the preceding context is visible at each position. Although no predictions are required for the passage and question on the left side of the figure, the keys and values of their tokens are still necessary for calculating the attention of the generated answer tokens. In the decoder architecture, the keys and values from previous steps are computed once and reloaded during the attention calculation of subsequent steps.

Autoregressive Answer Generation With a Decoder

I prompted two LLMs: one from the GPT-4 family (gpt-4-1106-preview with an unknown number of parameters) and one from the Llama 2 family (meta-llama/Llama-2-70b-chat-hf with 70B parameters). Since Llama 2 is available for research use in English, it could be downloaded and run on the NVIDIA A100-PCIe-80GB GPU in bwHPC after the 4-bit quantization, performed using the bitsandbytes library. This reduced the memory footprint to 35GB, compared to 280GB for full precision and 140GB for half precision. Therefore, I was able to perform multiple experiments on meta-llama/Llama-2-70b-chat-hf to improve prompting results, as shown in the graphic below. Systematic exploration of all possible parameter combinations, including prompts, was unfeasible because of the near-infinite search space. Instead, one parameter was tested at each step, while the others were set to fixed values. For subsequent steps, this parameter was set to its best-found value. The BiPaR development set was used to search for the best parameter values. The following findings were valuable for my work:

– Various strategies for token selection can be used during text generation by decoders. One such strategy, temperature sampling, changes the shape of the probability distribution over the vocabulary to steer text generation either toward a more predictable or creative outcome, as introduced in Holtzman et al. (2020). Other strategies, such as top-k and top-p sampling, truncate the distribution by excluding rare tokens based on a fixed number of tokens or a specified probability mass, respectively (Jurafsky & Martin, 2024, pp. 235-236). To extract exact answer spans from given passages, a low temperature of 0.1 and a top-p value of 95% proved most suitable for this use case.

– I also experimented with various delimiters and prompt formulations. meta-llama/Llama-2-70b-chat-hf proved to be highly sensitive to wording and the choice of delimiters in prompts. As shown in the wandb report, different prompt formulations resulted in up to a 5.30% variation in recall. In terms of precision, the variation was as high as 9.87%.

– Initially, few-shot examples did not yield better results than the zero-shot prompt baseline because meta-llama/Llama-2-70b-chat-hf had already been instruction-tuned on various topics, including reading comprehension (Touvron et al., 2023, pp. 8-9). However, transforming few-shot prompting into automated dialogs led to considerable improvements. As shown in another wandb report, the F1 score and EM of the 2-shot trial in dialog format were 3.85% and 8.47% higher, respectively, than those of the zero-shot baseline.

Prompting of meta-llama/Llama-2-70b-chat-hf

 

Results

 

The following conclusions can be drawn from the figure below:

– Regarding precision, the F1 score, and EM, the large RoBERTa encoder with only 355M parameters, additionally fine-tuned on the first version of the SQuAD dataset, outperformed the significantly larger decoders.

– In terms of recall, the 4-shot prompted gpt-4-1106-preview model achieved the best results.

– Furthermore, while the recall and precision of the large RoBERTa model are nearly equal, the recall of both LLMs exceeds their precision. This suggests that generative models are more effective at identifying relevant words than excluding irrelevant ones in their responses.

– As for why and how questions, a closer examination of the F1 score across question categories reveals a slight tendency. A large proportion of these questions is answered with longer clauses in the BiPaR dataset and is therefore non-factoid, as shown in the previous section The Dataset. While the large RoBERTa encoder performs the worst in answering why and how questions compared to the other question categories in its internal F1 score ranking, these questions are placed higher in the corresponding rankings of the two LLMs.

Comparison Between Fine-Tuning and Prompting Results

 

References

 

– Bolotova, V., Blinov, V., Scholer, F., Croft, W. B., & Sanderson, M. (2022). A Non-Factoid Question-Answering Taxonomy. In E. Amigo, P. Castells, & J. Gonzalo (Eds.), ACM Digital Library, Sigir ’22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval : July 11-15, 2022, Madrid, Spain (pp. 1196–1207).
– Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.
– Dzendzik, D., Foster, J., & Vogel, C. (2021). English Machine Reading Comprehension Datasets: A Survey. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8784–8804.
– Holtzman, A., Buys, J., Li, D., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. International Conference on Learning Representations (ICLR).
– Jing, Y., Xiong, D., & Zhen, Y. (2019). BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross-lingual Reading Comprehension on Novels. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2452–2462.
– Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Third Edition draft).
– Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2383–2392.
– Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., . . . Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint.
– Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008.

 

Acknowledgments

 

I would like to acknowledge support by the State of Baden-Württemberg through High Performance Computing bwHPC. My numerous experiments would not have been possible without this environment.

 

© 2025 – Anastasiya Nonenmacher ⋅ All rights reserved