To automatically summarise scientific publications and make them accessible to as many people as possible. This is the ambitious objective of a collaboration combining the knowledge of Juan-Manuel Torres-Moreno, from the Laboratoire Informatique d’Avignon (LIA) (Computer Science Laboratory), at Avignon Université, and the science communication skills of Sabine Louët, founder of the digital publishing company SciencePOD. It is a way of disseminating research progress on a large scale, and of battling misinformation.
How can we keep abreast of scientific knowledge in a world where information related to research is proliferating? The number of scientific studies published annually worldwide has increased from 972,000 in 1996 to 2.5 million in 2018, according to US National Science Foundation figures. However, not all of these papers contain a summary. And if they do, they often contain too many terms that are inaccessible to an uninformed public. In this context, the possibility of using artificial intelligence (AI) to generate automatic summaries is fully relevant.
This ambition to summarise texts does not solely stem from the digital publishing era. The French essayist Joseph Joubert (1754-1824) already expressed this desire in his work Pensées: “If there is a man tormented by the cursed ambition to put a whole book in a page, a whole page in a sentence, and this sentence in a word, it is me” (1) .
The 1950s onwards saw the development of the scientific basis for automated solutions. Significant advances in Natural Language Processing (NLP) and information retrieval (IR) occurred at that time. Among the noteworthy advances, include the founding work of the German computer scientist Hans Peter Luhn, in 1950, and that of the British researcher Karen Spärck Jones, in 1980, devoted to the summary of scientific texts. Let’s not forget, of course, the work of the American Julian Kupiec, who developed even more advanced information extraction systems in the 1990s.
These advances consolidated statistical summarisation by extraction from the 2000s onwards. Then, followed what has been called neural algorithms for generative summarisation, since 2016. The first identifies — without the need for machine learning — the most prominent sentences’ fragments in a text by using features used for representing sentences, such as words, their co-occurrences and probabilities, as well as the document structure, the sentence position, etc. It then extracts and assembles these fragments into a single sentence to form a usable summary. By contrast, neural algorithms (or deep learning algorithms) use complex representations of words and sentences using artificial neural networks organised in interconnected units, themselves divided into layers. Training these artificial intelligence networks requires large training corpora, to establish correspondences between inputs (i.e. sentences in a text) and outputs of the network (e.g. the context of words or the generation of a relevant sentence regarding the content) (2).
However, even the most powerful algorithms are still unable to analyse and understand a text in the same way as humans do. This is because different languages have a syntactic and semantic structure supported by sentences, which are themselves made up of words and complex linguistic structures; written language also contains redundancies and even mistakes (in grammar, syntax or content), which make it difficult to learn.
Transposing comprehension into the realm of the calculable
Is it helpful for an algorithm to understand a text in-depth to produce a useful extractive summary? The answer is no. It simply needs to be able to identify informative areas – pieces of text containing interesting linguistic objects such as action verbs, proper nouns and named entities – to extract relevant information, prioritise and organise it in order to generate summaries. The important thing is to find a balance between the information one wants to retain and what can be extracted from a text. Yet, to automate this process in a computer, the problem of comprehension and summarisation must be transposed into the realm of the computable. This requires substituting linguistic objects with an abstract depiction that can be understood by machines, while preserving the information contained in the text.
In the case of automatic summarisation, how can the problem of comprehension be transposed into the computable domain? To do this, we need to establish an adequate representation of the text, which can be a vector model. In a vector space, the words ‘w’ constituting the lexicon of the document are embedded in a space of sentences ‘s’. This constitutes a matrix of [s * w] dimensions. Linguistic analysis makes it possible to normalise and reduce word variations (verb declensions) or to eliminate words or symbols that are so-called “hollow” or carry little information (articles, conjunctions, punctuation). Rare or too frequent words are treated statistically in order to weigh their importance appropriately.
The text thus becomes an abstract object that can be processed by mathematical and statistical operations and probabilistic methods. Moreover, this abstract representation allows working on several languages with the same algorithms. Indeed, these methods have the advantage of being fairly language-independent, as an appropriate pre-processing is applied. Thus, the linguistic features (verb declension, gender, tense) are normalised and hollow words are deleted. Then, the occurrence of the remaining words is numerically transformed with each word being represented by a weighted quantity. The information conveyed by the text is thus present, but transposed into a different representation, not in a textual but in a mathematical space. Graph approaches are very useful in this description, because the links between the rows of the matrix (the sentences) by means of the columns (the words) can be deduced or calculated by co-occurrence. Although the order of the words may be lost — in this so-called “bag-of-words” interpretation — this transformation presents many advantages. While coarse from a linguistic point of view, it is very efficient algorithmically, as it preserves important features of the lexicon that carries the information of the text, such as frequency, co-occurrences or rarity.
Many automatic summarisation programmes have been developed on this model. However, few of these “ready-made” solutions are optimised to process texts as complex as scientific studies! This is why the computer science laboratory of Avignon Université (LIA) has entered into a collaboration with SciencePOD, a publishing company specialised in scientific communication, in 2019. Our objective: to make the content of scientific articles accessible to larger audiences based on large quantities of documents processed in parallel. In our opinion, this required the development of algorithms capable of generating so-called contextualised summaries in just a few hundredths of a second.
Producing context for lay readers
To obtain this type of summary, we need to ensure that the algorithms extract the most important sentences from the source document in a selective and hierarchical way. To do this, we have chosen to use graph-based models: the n-grams (see section on How to assess the quality of summaries) represent the vertices and the arcs their statistical relationships (co-occurrence of words, probability of occurrence, statistical weighting or deletion of words, calculation of entropy, etc.). Linguistic methods are also used to normalise words (verbs in the infinitive, nouns in the singular, etc.) or to calculate their grammatical category (noun, verb, adjective, adverb, punctuation). Next, we designed our algorithm in such a way that it provides context for readers who are not familiar with the subject matter of the study. To be specific, we automated keyword extraction, acronym elucidation and added short definitions of technical words.
To complement this approach, we have developed a second type of algorithms designed to provide instant access to the key information of each study. In this case, by guiding the algorithm to locate certain terms, we programmed it to extract relevant meta-information about the author, his or her institution, etc. We focused the structure of this type of summary on answering the following questions: when and where was the study published? Who are the authors? Where do they work? What did they find? How was the research conducted and what are the avenues for future work?
Another issue we were interested in was to produce contextualised summaries of articles from various scientific fields. We chose to combine several complementary approaches to AI. These include AI, automatic language processing, statistical analysis methods, information retrieval methods, shallow parsing and the construction and use of open source structured linguistic resources (ontologies, Wikipedia, MeSH thesaurus for the medical domain).
More ambitious than extractive summarisation, abstractive summarisation is based on deep learning neural models. These produce new sentences — absent from the original text — that are supposed to be closer to a summary written by a human. But the drawback is that this method requires massive learning data (big data) and a significant learning time. Another limitation concerns the size of the input documents for the neural network: recent studies show that they should be limited to 2,000 words (or tokens) (3). This excludes many scientific articles; otherwise, they will be truncated for the algorithms to process. Furthermore, the abstraction may result in agrammatical sentences or sentences that are not entirely accurate with respect to the source document. Thus, in our experience, machine learning is not the best strategy for summarising scientific studies.
In contrast, in adopting contextualised extractive summaries, we have chosen a type of summary that seeks to retain the prominent sentences. After extracting these key sentences, we further compress them by eliminating constituents that are not essential for understanding the sentence, following a discourse analysis. Then, we assemble these sentences to give a logical order to the summary without merely delivering fragments of key sentences as many generic ‘summarisers’ do. Finally, an appropriate context and metadata complete this summary.
In fact, this extractive approach is easier to implement and faster, as it does not require machine learning. It is also highly qualitative, as it is based on well-established classical statistical and linguistic algorithms, improved thanks to our know-how. Moreover, since the extraction is done on complete sentences, they are grammatically correct, although this does not guarantee the consistency of the final summary. While they are still far from human performance, the automatic summaries are already usable. They allow the reader to decide whether or not to read the source text in its entirety. In addition, automatic summaries can be used in computerised information processing systems, such as document indexing systems, combining summaries, terms and keywords to make them easier to find by Internet search engines. Lastly, they can help scientific publishers to pre-select studies for peer review. Major publishers such as Elsevier, Springer Nature and Wiley have launched experiments to obtain summaries of their publications.
We are now planning to refine the summaries by going even further in information extraction and AI techniques. Above all, at a time when scientific publications are published in open access, we want to offer automated summaries and make them available on a large scale. Beware, though! Just because a scientific text has been summarised and its content understood, it does not mean that its scientific validity is guaranteed! The number of so-called predatory publishers is steadily increasing, and many published articles that appear respectable at first glance have limited scientific content. Automatic summarisation saves time in selecting relevant studies, but it is no substitute for individual judgement. Human acumen still has many years to go!
Juan-Manuel Torres-Moreno, Associated Researcher, Laboratoire Informatique d’Avignon, Avignon Université, specialist in Natural Language Processing.
Sabine Louët, Editor and Entrepreneur, Founder of the digital publishing company SciencePOD, based in Dublin (Ireland).
Case Study: Summarising This Article
In order to test the effectiveness of our method, we made an extractive summary of this article. Here is the result:
Follow the story of a collaboration combining the automatic summarisation knowledge of Juan-Manuel Torres-Moreno from the Avignon computer laboratory and the scientific communication skills of Sabine Louët, founder of the digital publishing company SciencePOD. Imagine: the number of scientific studies published annually has risen from 972,000 in 1996 to 2.5 million in 2018, according to figures from the National Science Foundation in the United States! These studies do not necessarily have a summary or, when they do, they often contain too many technical terms that are unintelligible to a non-expert audience. It is, therefore, natural to wonder whether artificial intelligence could be useful in generating automatic summaries.
How to assess the quality of a summary?
The question of how to assess the quality of an abstract remains an unresolved problem, with researchers providing only partial solutions. There are two main evaluation methods: manual and automatic. The former employ human evaluators, who read and score an abstract according to pre-established criteria, coherence, grammaticality, relevance, etc. This is an objective approach, but impractical and expensive. The second approach is to use algorithms that assess the quality of the automatic summary using reference summaries created by humans. Another approximate solution is to bring this problem back into the realm of the computable, by counting elements in the automatic summary then in the human summaries, in order to establish appropriate statistics for measuring lexical proximity.
Which elements can we count on? Word sequences or n-grams. If n = 1, they are unigrams (words); if n = 2, they are bigrams (pairs of words). If summaries created by humans are available, they are used as references. For example, a reference summary R = “I have a feather… it’s pretty! My aunt’s feather.” and two automatic summaries: r1 = “it’s pretty” and r2 = “my aunt’s feather.”. R, r1 and r2 will be represented by their n-grams. For simplicity, we will use only bigrams. Then we have : R = {I have, have a, a feather, feather it, it is, is pretty, my aunt, aunt is, is feather, feather.}, with a total of 10 bigrams; r1 = {it is, is pretty} and r2 = {my aunt, aunt is, is feather, feather.}. r1 shares 2/10 bigrams with R and r2 shares 4/10, so r2 has a “better quality” or closeness to the human summary than r1. Of course, the calculations in the real world are more complex as they involve several statistics, but the spirit remains the same. The method has been implemented in an algorithm called ROUGE (i) and is widely used within the scientific community. Other methods without reference exist, the idea being to measure the semantic content of a summary by means of approximations (lexical or mixed) to the source document (ii).
(i) Ch.-Y. Lin, Text Summarization Branches Out, ACL, 74, 2004.
(ii) A. Louis and A. Nenkova, Computational Linguistics, 39, 267, 2013.
( 1 ) Joseph Joubert, Pensées, essais, maximes, 1842 (on gallica.bnf.fr).
(2) J. Zhang et al., PEGASUS: “Pre-training with Extracted Gap-sentences for Abstractive Summarization”. Proceedings 37th International Conference on Machine of the Learning, PMLR 119, 2020.
( 3 ) A. Cohan et al, ‘Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies’, vol. 2, doi: 10.18653/v1/N18-2097, 2018.
TO FIND OUT MORE
J.-M. Torres-Moreno, Automatic Text Summarization, Wiley, 2014.
Reproduced with kind permission from La Recherche. See the original article here
Sign up for our Publishing Newsletter and stay updated