Artificial Intelligence as a tool to summarise scientific publications
To automatically summarise scientific publications and make them accessible to as many people as possible. This is the ambitious objective of a collaboration combining the knowledge of Juan-Manuel Torres-Moreno, from the Laboratoire Informatique d’Avignon (LIA) (Computer Science Laboratory), at Avignon Université, and the science communication skills of Sabine Louët, founder of the digital publishing company SciencePOD. It is a way of disseminating research progress on a large scale, and of battling misinformation.
How can we keep abreast of scientific knowledge in a world where information related to research is proliferating? The number of scientific studies published annually worldwide has increased from 972,000 in 1996 to 2.5 million in 2018, according to US National Science Foundation figures. However, not all of these papers contain a summary. And if they do, they often contain too many terms that are inaccessible to an uninformed public. In this context, the possibility of using artificial intelligence (AI) automatically summarise scientific publications in a digestible manner.
This ambition to summarise texts does not solely stem from the digital publishing era. The French essayist Joseph Joubert (1754-1824) already expressed this desire in his work Pensées: “If there is a man tormented by the cursed ambition to put a whole book in a page, a whole page in a sentence, and this sentence in a word, it is me (1).
The 1950s onwards saw the development of the scientific basis for automated solutions. Significant advances in natural language processing (NLP) and information retrieval (IR) occurred at that time. Among the noteworthy advances, include the founding work of the German computer scientist Hans Peter Luhn, in 1950, and that of the British researcher Karen Spärck Jones, in 1980, devoted to the summary of scientific texts. Let’s not forget, of course, the work of the American Julian Kupiec, who developed even more advanced information extraction systems in the 1990s.
These advances consolidated statistical summarisation by extraction from the 2000s onwards. Then, followed what has been called neural algorithms for generative summarisation, since 2016. The first identifies — without the need for machine learning — the most prominent sentences’ fragments in a text by using features used for representing sentences, such as words, their co-occurrences and probabilities, as well as the document structure, the sentence position, etc. It then extracts and assembles these fragments into a single sentence to form a usable summary. By contrast, neural algorithms (or deep learning algorithms) use complex representations of words and sentences using artificial neural networks organised in interconnected units, themselves divided into layers. Training these artificial intelligence networks requires large training corpora, to establish correspondences between inputs (i.e. sentences in a text) and outputs of the network (e.g. the context of words or the generation of a relevant sentence regarding the content) (2).
However, even the most powerful algorithms are still unable to analyse and understand a text in the same way as humans do. This is because different languages have a syntactic and semantic structure supported by sentences, which are themselves made up of words and complex linguistic structures; written language also contains redundancies and even mistakes (in grammar, syntax or content), which make it difficult to learn.
Transposing comprehension into the realm of the calculable
Is it helpful for an algorithm to understand a text in-depth to produce a useful extractive summary? The answer is no. It simply needs to be able to identify informative areas – pieces of text containing interesting linguistic objects such as action verbs, proper nouns and named entities – to extract relevant information, prioritise and organise it in order to properly summaries scientific publications. The important thing is to find a balance between the information one wants to retain and what can be extracted from a text. Yet, to automate this process in a computer, the problem of comprehension and summarisation must be transposed into the realm of the computable. This requires substituting linguistic objects with an abstract depiction that can be understood by machines, while preserving the information contained in the text.