Associate Professor
University of A Coruña, Computer Science Faculty
I am an Associate Professor in the Computer Science Department at the University of A Coruña, Spain, and a member of the Information Retrieval Lab. My research interests include information retrieval, text mining, document engineering, text summarization, and recommender systems.
Recently, my work has focused on modeling item recommendation as a relevance ranking problem, exploring pseudo-relevance feedback for opinion mining, and applying text mining techniques for early risk prediction on the Internet. I have been organizing the eRisk Lab since 2017, where we specifically research early risk detection on social media.
For more information about my technology transfer activities, I invite you to visit the IRLab, which is part of our work with CITIC, the Research Center on Information and Communication Technologies at the University of A Coruña.
I completed my Ph.D. thesis under the supervision of Professor Álvaro Barreiro, focusing on new estimations and applications of Relevance-Based Language Models. After organizing the CERI 2014 conference in A Coruña, I was elected President of the Spanish Society for Information Retrieval (SERI). You can join us here.
University of A Coruña, Computer Science Faculty
University of A Coruña, Computer Science Faculty
Google Research, London, UK
University of A Coruña, Computer Science Faculty
Spanish Information Retrieval Society
Xunta de Galicia, University of A Coruña, Computer Science Faculty
University of A Coruña, Computer Science Faculty
Igalia Software Engineering
Ph.D. in Computer Science
University of A Coruña, Computer Science Department
Advanced Studies Diploma in CS&AI
University of A Coruña, Computer Science Department
Ingeniero en Informática
University of A Coruña/University of West of England, Faculty of Computer Science/Faculty of Environment and Technology
Most of my research in framed on the Information Retrieval (IR) area. IR techniques have become essential for the daily activity of most of the human beings. Nowadays the homepage of almost every web browser installed in personal computers points to a web search engine such as Google, Yahoo! or Bing, this is not only for marketing purposes, but also, and more importantly, it is because today the search engines are vital to access information. And those search engines would not be possible without the research efforts made on the Information Retrieval field. Information Retrieval is in fact the science of searching, or maybe a better description could be the science of finding
However, IR is not only searching for relevant documents, and neither my investigation is only on IR. In my research, I have also dealt with text mining and natural language processing techniques in tasks such as opinion mining and retrieval, text categorization, blog and news search, unsupervised and semi-supervised text categorization, pseudo-relevance feedback, item recommendation, document processing and engineering, retrieval of degraded information or text summarization.
I have contributed to the IR community reviewing articles for ACM SIGIR, ACM CIKM, ACM RecSys, WWW, BCS ECIR, SPIRE, ACM ICTIR, SERI CERI and journals such Elsevier Information Processing and Management, IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Information Systems, ACM Computing Surveys, Elsevier Data & Knowledge Engineering or the Journal of the American Society for Information Science and Technology. I am an editorial board member of Elsevier Information Processing and Management.
Most of my work would be not possible without the help of many group members and external collaborators.
One of the most pressing challenges in Information Access today is combating the spread of misinformation. Existing methods for misinformation detection employ techniques such as neural network models, statistical methods, linguistic analysis, and fact-checking strategies. However, the threat of false information has intensified with the emergence of highly creative language models. Misinformation on the web and social media poses significant social, economic, and political repercussions, leading to serious consequences such as election interference, polarization, and violence. This issue becomes particularly critical during global health crises, as misinformation surrounding the COVID-19 pandemic can result in catastrophic health outcomes. We have witnessed numerous myths proliferate on social media regarding COVID-19 treatments, the virality of the virus, and misleading narratives targeting marginalized communities. This challenge is especially pronounced in developing countries, where low literacy rates and limited exposure to technology hinder effective fake news detection. Nonetheless, increased access to affordable internet makes these populations more susceptible to believing and acting on misinformation.
Web search is a prevalent means of seeking online information, particularly regarding health-related advice. This area of web search is commonly referred to as Consumer Health Search. Accessing reliable health-related information necessitates retrieval algorithms capable of promoting trustworthy documents while filtering out unreliable ones. To achieve this, we aim to integrate various components, including query-document matching features, passage relevance estimation, reliability assessments, and suitable recommendation models. Our project seeks to establish a comprehensive pipeline for misinformation detection by fusing multiple features and complementary tools. We aspire to intelligently combine advanced techniques from diverse fields such as Information Retrieval, Text Classification, Recommendation, and Natural Language Processing to design effective content curation strategies for consumer health search tasks. This project is distinctly multidisciplinary, encompassing aspects of Text Processing (Information Retrieval, Automatic Text Classification, Personalization, and Recommender Systems), Computational Linguistics (Discourse Analysis, Advanced Natural Language Processing), and High-Performance Computing for Big Data. Furthermore, our team includes experts in Psychology, who will tackle challenges related to incorporating expert knowledge into the models, validating the resulting technologies, and applying the project outcomes in real-world contexts.
False rumors, fake news, and hate speech against vulnerable minorities on social media are increasingly recognized as significant threats to democracies. A comprehensive global strategy to combat disinformation is essential, as open democratic societies depend on free citizens who can access verifiable information to form their own opinions on various political issues.
The primary scientific objective of the HYBRIDS project is to equip researchers with the knowledge necessary to design strategies and tools to address disinformation based on an in-depth analysis of public discourse.
There have been notable advancements in the automatic detection of disinformation using natural language processing and emerging artificial intelligence techniques in the fields of machine and deep learning. However, this remains a complex task that demands a high level of natural language understanding, inference, and reasoning. To enhance strategies for countering disinformation, HYBRIDS will integrate structured knowledge from social and human sciences into natural language processing tools and deep learning algorithms to develop new hybrid intelligence systems. The concept of Hybrid Intelligence entails the combination of machine and human intelligence to overcome the limitations of current artificial intelligence methods.
While hybrid systems are expected to become increasingly critical in the near future, there are very few experts capable of designing and developing such systems. This scarcity primarily arises from the multidisciplinary nature of the hybrid strategy and the challenge of finding researchers who are fully trained in traditionally distinct disciplines, such as computer engineering, social sciences, or linguistics. We believe the time is ripe to establish a Doctoral Network equipped to train researchers in hybrid methodologies for their application in social studies, with a focus on sustaining good democratic practices across Europe.
Mental health is a critical component of the World Health Organization's definition of health. It directly influences how we think, feel, and behave. Mental disorders are complex and can manifest in various ways. In 2017, approximately 792 million people lived with some form of mental health issue, affecting more than one in ten individuals worldwide. Experts have recently warned that the aftermath of the COVID-19 pandemic could result in a global mental health crisis.
Despite the severity of these disorders, many individuals do not receive timely treatment. Early diagnosis is crucial for effective intervention, as it can significantly reduce the adverse effects of disorders and lower costs for public health and social services. However, tools for detecting mental health issues are limited due to the stigmatization surrounding mental illness.
Social media has emerged as a prominent communication platform, where many people share their emotions, thoughts, and feelings. The vast quantity of daily posts can enhance our understanding of individuals' mental states. Research indicates that analyzing language use in online data can help detect mental disorders. Social media provides a unique opportunity for individuals to express themselves anonymously, making it easier for them to share their true feelings and seek support from others.
Since 2017, we have been advancing this line of research through eRisk (https://erisk.irlab.org/), which explores evaluation methodologies, effectiveness metrics, and practical applications for early risk detection on the Internet. Over the last five years of this international competition, we have released numerous datasets related to risks such as depression, eating disorders, self-harm, and pathological gambling. Various international teams have contributed their models to promote this new area of research. Our ambition is to produce resources in evaluation methodologies, datasets, and models that can scale to the magnitude of social data. We envision that the results of this project will help develop the first generation of tools to assist social and health systems in early identification of individuals at risk.
To address the challenges beyond a laboratory setting, we have formed a solid interdisciplinary team composed of the lead organizers of the eRisk international competition, mental health professionals, and computer scientists with expertise in machine learning, information retrieval, natural language processing, and high-performance computing. The team, led by the University of A Coruña, includes the University of Santiago (a co-organizer of the eRisk competition) and Linknovate, a research-intensive start-up with strong ties to university teams.
This project focuses on several areas of Information Technology, including search, recommendation, massive data processing, and computational linguistics, alongside Psychology. The two subprojects each bring unique expertise in their respective domains.
Subproject 2 (UDC) will leverage experience in search, recommendation, and psychology to tackle a series of challenges and activities, including:
Recommender Systems (RecSys) aim, given a set of users, a set of items and a set of users' ratings to items, generate personalised item recommendations for users. Traditionally, RecSys can exploit information both from the past interaction of users and products and from the content of the items to generate new suggestions for users. These systems have proven key to facilitating access to information, products and services. Specifically, it is estimated that a significant percentage of e-commerce transactions are motivated by recommendations: for example, Amazon sales increased by 29% after integrating a recommendation engine.
The Spanish Strategy for Science, Technology and Innovation 2013-2020 establishes the need to provide Spanish companies with innovative models to increase efficiency and competitiveness of their processes of commercialisation of new products and services. Given the migration of our economy in the context of the digital society, these models must necessarily provide innovative technological solutions that will transform the way we do business and the sales channels or the mechanisms of relationship with the consumer. Given these objectives and challenges, Recommendation Systems for products and services play a central role. Thus, in this project, we want to advance the state of the art, proposing new models of recommendation that, with a solid formal probabilistic basis, may help increase sales and improve products and the satisfaction of buyers. These models and their translation into domains and instances of actual use in the business community contribute, through the quality of its recommendations, to the development of the digital economy.
A booming research area is the translation of classic Information Retrieval approaches to the Recommendation task. In particular, in this research project, we propose the use of probabilistic Language Models to the item recommendation task. Recently, we developed the first formalisations obtaining high effectiveness figures. Given the previous positive experience, we want to extend the predictability of these models beyond the collaborative filtering approach considering new estimates and models that include and integrate different content information, capturing contextual and temporal aspects. Furthermore, we propose the integration of Bayesian optimization techniques to develop models that not only generate tailored product suggestions but also generate them in a personalised manner, adapting the recommendation models to the particularities of the users. All these objectives are constrained by a common core objective which is transversal: efficiency, scalability and robustness of such methods in relation to their translation into real applications in the productive sector.
In many application domains, there is a growing need to exploit opinions expressed by people in the Web. Decision making processes in companies and organizations can be potentially enriched with software tools able to monitor the voice of the people about products/services, and able to estimate the customer satisfaction. Similarly, governments could promptly obtain the response of the citizens to political actions and, in broad terms, opinion-rich information can help in political decision making processes. On the other hand, users can take into account the opinions of others about products, services or any other issue that affects their information needs in public, private and professional domains.
In recent years, several research advances have been done in Web Information Retrieval (IR) and in the field of Opinion Mining and Sentiment Analysis. Analyzing and exploiting opinions from the web presents new challenges and needs techniques radically different from those of relevance-based retrieval, which is typical of web search. However, it is known that, for a sentiment mining and analysis system to be useful, effective topic retrieval should be available. This project proposes a number of complementary research lines in order to improve web retrieval and sentiment analysis. We will conduct research into models and techniques that have recently yielded promising results: improving pseudo-relevance feedback using three specific techniques: cluster based pseudo-relevance feedback, adaptive pseudo feedback and selective pseudo feedback; improving traditional techniques to detect opinions and to estimate polarity with new models of sentiment flow; and improving feature mining methods to associate opinions with aspects or properties of the reviewed objects. The team that proposes this project has experience on these research topics and has already made several contributions in these areas.
There is a growing realisation that relevant information will be accessible increasingly across media, across contexts and across modalities. The retrieval of such information will depend on factors such as time, place, and history of interaction, task in hand, current user interests, etc. To achieve this, Information Retrieval (IR) models that go beyond the usual relevance-oriented approach will need to be developed so that they can be deployed effectively to enhance retrieval performance. This is important to meet the information access demands of today's users. As a matter of fact, the growing need to deliver information on request in a form that can be readily and easily digested continues to be a challenge.
In this coordinated project with the University of Granada, and Universidad Autónoma de Madrdid, we tackled the IR problem from a multidimensional perspective. Besides the dimension of relevance, we studied how to endow the systems with advanced capabilities for novelty detection, redundancy filtering, subtopic detection, personalization and context-based retrieval. These dimensions have been not only considered for the basic retrieval task but also for other tasks such as automatic summarization, document clustering and categorization. This research tried therefore to open new ways to improve the quality of access to sources of information.
The objetive of this project was to improve NowOnWeb, and R&D platform for web-news retrieval developed by the IRLab. The tasks of the project were centred on one hand on efficiency improvements for faster indexing, more scalable query processing, and crawling engine improvements; and on the other hand on effectiveness improvements in terms of news relevance, construction of better summaries and exploitation of the query-logs.
The aim of this project is to improve the performance of systems for sentence retrieval and novelty. This task, located in the field of Information Retrieval (IR), is a step forward from the basic problem of document retrieval. Given a user query which retrieves an ordered set of documents, this set is processed to identify those sentences which are relevant to the query. This selection of sentences has to be done avoiding redundant material. The task defined in this way has been recently introduced in the field of IR (called “novelty task”) and it is highly related to other IR problems. Hence, lessons learned in novelty are potentially benefitial in other IR subareas. Moreover, the state of the art in novelty shows clearly the need of more research efforts for sentence retrieval and novelty detection. In this respect, several formalisms and tools which have been successfully applied in other IR problems are especially promising for novelty. This project will address the application of Language Models, fuzzy quantification and dimensionality reduction for sentence retrieval and novelty detection. We strongly believe that the variety of approaches taken is a good startpoint for improving the effectiveness of the novelty task. Furthermore, this facilitates crossfertilization between these research lines, which is an added value for this project. Across this document we will provide evidence on the adequacy of these models and techniques for novelty. The implementation of the proposals derived from this project will be based on research tools and platforms available for experimentation along with the development of our own code. The evaluation will be conducted with standard benchmarks and using the methodology of the field of IR.
This is a list of my research publications, you can find more information on my DBLP profile or my Google Scholar profile
Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It is a significant risk in digital environments. With the rise of Large Language Models (LLMs), there is concern about their potential to replicate hate speech patterns, given their training on vast amounts of unmoderated internet data. Understanding how LLMs respond to hate speech is crucial for their responsible deployment. However, the behaviour of LLMs towards hate speech has been limited compared. This paper investigates the reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis, we aim to reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing. Finally, we explore the models' responses to hate speech framed in politically correct language.
Automatic keyphrase labelling stands for the ability of models to retrieve words or short phrases that adequately describe documents' content. Previous work has put much effort into exploring extractive techniques to address this task; however, these methods cannot produce keyphrases not found in the text. Given this limitation, keyphrase generation approaches have arisen lately. This paper presents a keyphrase generation model based on the Text-to-Text Transfer Transformer (T5) architecture. Having a document's title and abstract as input, we learn a T5 model to generate keyphrases which adequately define its content. We name this model docT5keywords. We not only perform the classic inference approach, where the output sequence is directly selected as the predicted values, but we also report results from a majority voting approach. In this approach, multiple sequences are generated, and the keyphrases are ranked based on their frequency of occurrence across these sequences. Along with this model, we present a novel keyphrase filtering technique based on the T5 architecture. We train a T5 model to learn whether a given keyphrase is relevant to a document. We devise two evaluation methodologies to prove our model's capability to filter inadequate keyphrases. First, we perform a binary evaluation where our model has to predict if a keyphrase is relevant for a given document. Second, we filter the predicted keyphrases by several AKG models and check if the evaluation scores are improved. Experimental results demonstrate that our keyphrase generation model significantly outperforms all the baselines, with gains exceeding 100...
In the age of social media, user-generated content is critical for detecting early signs of mental disorders. In this study, we use thematic clustering to analyze the content of the social media platform Reddit. Our primary goal is to use clustering techniques for comprehensive topic discovery, with a focus on identifying common themes among user groups suffering from mental illnesses such as depression, anorexia, gambling addiction, and self-harm. Our findings show that certain clusters are more cohesive, eg, with a higher proportion of texts indicating depression. Furthermore, we discovered subreddits that are strongly linked to texts from the depressed user group. These findings shed light on how online interactions and subreddit themes may impact users’ mental health, paving the way for future research and more targeted interventions in the field of online mental health.
Hate speech represents a pervasive and detrimental form of online discourse, often manifested through an array of slurs, from hateful tweets to defamatory posts. As such speech proliferates, it connects people globally and poses significant social, psychological, and occasionally physical threats to targeted individuals and communities. Current computational linguistic approaches for tackling this phenomenon rely on labelled social media datasets for training. For unifying efforts, our study advances in the critical need for a comprehensive meta-collection, advocating for an extensive dataset to help counteract this problem effectively. We scrutinized over 60 datasets, selectively integrating those pertinent into MetaHate. This paper offers a detailed examination of existing collections, highlighting their strengths and limitations. Our findings contribute to a deeper understanding of the existing datasets, paving the way for training more robust and adaptable models. These enhanced models are essential for effectively combating the dynamic and complex nature of hate speech in the digital realm.
Users of social platforms often perceive these sites as supportive spaces to post about their mental health issues. Those conversations contain important traces about individuals’ health risks. Recently, researchers have exploited this online information to construct mental health detection models, which aim to identify users at risk on platforms like Twitter, Reddit or Facebook. Most of these models are focused on achieving good classification results, ignoring the explainability and interpretability of the decisions. Recent research has pointed out the importance of using clinical markers, such as the use of symptoms, to improve trust in the computational models by health professionals. In this paper, we introduce transformer-based architectures designed to detect and explain the appearance of depressive symptom markers in user-generated content from social media. We present two approaches: (i) train a model to...
In 2017, we launched eRisk as a CLEF Lab to encourage research on early risk detection on the Internet. Since then, thanks to the participants’ work, we have developed detection models and datasets for depression, anorexia, pathological gambling and self-harm. In 2024, it will be the eighth edition of the lab, where we will present a revision of the sentence ranking for depression symptoms, the third edition of tasks on early alert of anorexia and eating disorder severity estimation. This paper outlines the work that we have done to date, discusses key lessons learned in previous editions, and presents our plans for eRisk 2024.
Depression is a global concern suffered by millions of people, significantly impacting their thoughts and behavior. Over the years, heightened awareness, spurred by health campaigns and other initiatives, has driven the study of this disorder using data collected from social media platforms. In our research, we aim to gauge the severity of symptoms related to depression among social media users. The ultimate goal is to estimate the user’s responses to a well-known standardized psychological questionnaire, the Beck Depression Inventory-II (BDI). This is a 21-question multiple-choice self-report inventory that covers multiple topics about how the subject has been feeling. Mining users’ social media interactions and understanding psychological states represents a challenging goal. To that end, we present here an approach based on search and summarization that extracts multiple BDI-biased summaries from the thread of users’ publications. We also leverage a robust large language model to estimate the potential answer for each BDI item. Our method involves several steps. First, we employ a search strategy based on sentence similarity to obtain pertinent extracts related to each topic in the BDI questionnaire. Next, we compile summaries of the content of these groups of extracts. Last, we exploit chatGPT to respond to the 21 BDI questions, using the summaries as contextual information in the prompt. Our model has undergone rigorous evaluation across various depression datasets, yielding encouraging results. The experimental report includes a comparison against an assessment done by expert humans and competes favorably with state-of...
The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate...
Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budget or to lower it. Researchers evaluate the quality of those methods by measuring the correlation between the known gold ranking of systems under the full collection and the observed ranking of systems under the lower-cost one. This traditional analysis ignores whether and how the low-cost judgements impact on the statistically significant differences among systems with respect to the full collection. We fill this void by proposing a novel methodology to evaluate how the low-cost adjudication methods preserve the...
Many cases of violence against children occur in homes and other close environments. Machine leaning is a novel approach that addresses important gaps in ways of examining this socially significant issue, illustrating innovative and emerging approaches for the use of computers from a psychological perspective. In this paper, we aim to use machine learning techniques to predict adolescents’ involvement in family conflict in a sample of adolescents living with their families (community adolescents) and adolescents living in residential care centers, who are temporarily separated from their families because of adverse family conditions. Participants were 251 Spanish adolescents (Mage = 15.59), of whom 167 lived in residential care and 84 lived with their families. We measured perceived interparental and family conflict, adolescents’ emotional security, emotional, cognitive, and behavioral immediate responses to...
This paper provides an overview of eRisk 2023, the seventh edition of the CLEF conference’s lab dedicated to early risk detection. Since its inception, our lab has aimed to explore evaluation methodologies, effectiveness metrics, and other processes associated with early risk detection. The applications of early alerting models are diverse and encompass various domains, including health and safety. eRisk 2023 consisted of three tasks. The first task involved ranking sentences based on their relevance to standardised depression symptoms. The second task focused on early detection of signs related to pathological gambling. The third task required participants to automatically estimate an eating disorders questionnaire by analysing user writings on social media.
Computational methods for depression detection aim to mine traces of depression from online publications posted by Internet users. However, solutions trained on existing collections exhibit limited generalisation and interpretability. To tackle these issues, recent studies have shown that identifying depressive symptoms can lead to more robust models. The eRisk initiative fosters research on this area and has recently proposed a new ranking task focused on developing search methods to find sentences related to depressive symptoms. This search challenge relies on the symptoms specified by the Beck Depression Inventory-II (BDI-II), a questionnaire widely used in clinical practice. Based on the participant systems' results, we present the DepreSym dataset, consisting of 21580 sentences annotated according to their relevance to the 21 BDI-II symptoms. The labelled sentences come from a pool of diverse ranking methods, and the final dataset serves as a valuable resource for advancing the development of models that incorporate depressive markers such as clinical symptoms. Due to the complex nature of this relevance annotation, we designed a robust assessment methodology carried out by three expert assessors (including an expert psychologist). Additionally, we explore here the feasibility of employing recent Large Language Models (ChatGPT and GPT4) as potential assessors in this complex task. We undertake a comprehensive examination of their performance, determine their main limitations and analyze their role as a complement or replacement for human annotators.
People tend to consider social platforms as convenient media for expressing their concerns and emotional struggles. With their widespread use, researchers could access and analyze user-generated content related to mental states. Computational models that exploit that data show promising results in detecting at-risk users based on engineered features or deep learning models. However, recent works revealed that these approaches have a limited capacity for generalization and interpretation when considering clinical settings. Grounding the models' decisions on clinical and recognized symptoms can help to overcome these limitations. In this paper, we introduce BDI-Sen, a symptom-annotated sentence dataset for depressive disorder. BDI-Sen covers all the symptoms present in the Beck Depression Inventory-II (BDI-II), a reliable questionnaire used for detecting and measuring depression. The annotations in...
Offline evaluation of information retrieval systems depends on test collections. These datasets provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. Gathering the latter is costly, requiring human assessors to judge the documents. Therefore, experts usually judge only a portion of the corpus. The most common approach for selecting that subset is pooling. By intelligently choosing which documents to assess, it is possible to optimise the number of positive labels for a given budget. For this reason, much work has focused on developing techniques to better select which documents from the corpus merit human assessments. In this article, we propose using relevance feedback to prioritise the documents when building new pooled test collections. We explore several state-of-the-art statistical feedback methods for prioritising the...
Nowadays, search engine users commonly rely on query suggestions to improve their initial inputs. Current systems are very good at recommending lexical adaptations or spelling corrections to users’ queries. However, they often struggle to suggest semantically related keywords given a user’s query. The construction of a detailed query is crucial in some tasks, such as legal retrieval or academic search. In these scenarios, keyword suggestion methods are critical to guide the user during the query formulation. This paper proposes two novel models for the keyword suggestion task trained on scientific literature. Our techniques adapt the architecture of Word2Vec and FastText to generate keyword embeddings by leveraging documents’ keyword co-occurrence. Along with these models, we also present a specially tailored negative sampling approach that exploits how keywords appear in academic publications. We...
Depression is one of the most prevalent mental disorders. For its effective treatment, patients need a quick and accurate diagnosis. Mental health professionals use self-report questionnaires to serve that purpose. These standardized questionnaires consider different depression symptoms in their evaluations. However, mental health stigmas heavily influence patients when filling out a questionnaire. In contrast, many people feel more at ease discussing their mental health issues on social media. This demo paper presents a platform for assisted examination and tracking of symptoms of depression for social media users. In order to bring a broader context, we have complemented our tool with user profiling. We show a platform that helps professionals with data labelling, relying on depression estimators and profiling models.
In 2017, we launched eRisk as a CLEF Lab to encourage research on early risk detection on the Internet. Since then, thanks to the participants’ work, we have developed detection models and datasets for depression, anorexia, pathological gambling and self-harm. In 2023, it will be the seventh edition of the lab, where we will present a new type of task on sentence ranking for depression symptoms. This paper outlines the work that we have done to date, discusses key lessons learned in previous editions, and presents our plans for eRisk 2023.
Depressive disorders constitute a severe public health issue worldwide. However, public health systems have limited capacity for case detection and diagnosis. In this regard, the widespread use of social media has opened up a way to access public information on a large scale. Computational methods can serve as support tools for rapid screening by exploiting this user-generated social media content. This paper presents an efficient semantic pipeline to study depression severity in individuals based on their social media writings. We select test user sentences for producing semantic rankings over an index of representative training sentences corresponding to depressive symptoms and severity levels. Then, we use the sentences from those results as evidence for predicting users' symptom severity. For that, we explore different aggregation methods to answer one of four Beck Depression Inventory (BDI) options per symptom. We evaluate our methods on two Reddit-based benchmarks, achieving 30% improvement over state of the art in terms of measuring depression severity.
Depression is one of the most common mental health illnesses. The biggest obstacle lies in an efficient and early detection of the disorder. Self-report questionnaires are the instruments used by medical experts to elaborate a diagnosis. These questionnaires were designed by analyzing different depressive symptoms. However, factors such as social stigmas negatively affect the success of traditional methods. This paper presents a novel approach for automatically estimating the degree of depression in social media users. In this regard, we addressed the task Measuring the Severity of the Signs of Depression of eRisk 2020, an initiative in the CLEF Conference. We aimed to explore neural language models to exploit different aspects of the subject’s writings depending on the symptom to capture. We devised two distinct methods based on the symptoms’ sensitivity in terms of willingness on commenting about them…
eRisk stands for Early Risk Prediction on the Internet. It is concerned with the exploration of techniques for the early detection of mental health disorders which manifest in the way people write and communicate on the internet, in particular in user generated content (e.g. Facebook, Twitter, or other social media). Early detection technologies can be employed in several different areas but particularly in those related to health and safety. For instance, early alerts could be sent when the writing of a teenager starts showing increasing signs of depression, or when a social media user starts showing suicidal inclinations, or again when a potential offender starts publishing antisocial threats on a blog, forum or social network. eRisk has been the pioneer of a new interdisciplinary area of research that is potentially applicable to a wide variety of situations, problems and personal profiles. This book presents the best results of the first five years of the eRisk project which started in 2017 and developed into one of the most successful track of CLEF, the Conference and Lab of the Evaluation Forum.
Depression is one of the most debilitating mental health diseases. Detecting the presence of depressive symptoms in the early stages of the disease is essential to reduce further consequences. As the study of language and behaviour is a pivotal component in mental research, social network content positions itself as a helpful tool. This paper introduces a general framework to analyze variations in the individual’s use of language over time on social media. We present a novel approach using temporal word representations to quantify the magnitude of words movements. This framework allows us to evaluate if words evolution can reveal the presence of depressive tendencies. We adapted different temporal word embedding representations to our framework and assessed them in Reddit benchmark datasets. Our results achieve high competitiveness compared with state-of-the-art methods, showing the potential that time-aware word representation models can bring to early detection scenarios.
Automatic keyword labelling methods generate a set of short phrases for a given document providing a short and good description of its content. Those labels are critical in tasks such as exploratory search and for improving the information discovery experience. This paper presents a novel keyword labelling model based on text-to-text transfer transformers (T5). We train a T5 model to generate keywords from academic documents content. We name this model docT5keywords. We compare our proposal with the state-of-the-art EmbedRank model, based on Sent2Vec embeddings and even with the keywords manually assigned by the author for representing their writings. Our proposal does not merely extract fragments of the texts but also may produce unseen labels. We commonly refer to these models as creative models. Classical evaluation based on matching against a set of golden truth labels extracted from the texts is not the best alternative when examining the performance of creative methods. Therefore, we also present an alternative user-based evaluation methodology for creative keyword generation models. In our user study, we examine the performance of the tested models using four expert assessors while analysing the assessor agreement and the correlation with the classical offline evaluation methodologies.
In 2017, we launched eRisk as a CLEF Lab to encourage research on early risk detection on the Internet. The eRisk 2021 was the fifth edition of the Lab. Since then, we have created a large number of collections for early detection addressing different problems (e.g., depression, anorexia or self-harm). This paper outlines the work that we have done to date (2017, 2018, 2019, 2020, and 2021), discusses key lessons learned in previous editions, and presents our plans for eRisk 2022, which introduces a new challenge to assess the severity of eating disorders.
Social networks constitute a valuable source for documenting heritage constitution processes or obtaining a real-time snapshot of a cultural heritage research topic. Many heritage researchers use social networks as a social thermometer to study these processes, creating, for this purpose, collections that constitute born-digital archives potentially reusable, searchable, and of interest to other researchers or citizens. However, retrieval and archiving techniques used in social networks within heritage studies are still semi-manual, being a time-consuming task and hindering the reproducibility, evaluation, and open-up of the collections created. By combining Information Retrieval strategies with emerging archival techniques, some of these weaknesses can be left behind. Specifically, pooling is a well-known Information Retrieval method to extract a sample of documents from an entire document set (posts in case of...
Recommender systems evaluation has evolved rapidly in recent years. However, for offline evaluation, accuracy is the de facto standard for assessing the superiority of one method over another, with most research comparisons focused on tasks ranging from rating prediction to ranking metrics for top-n recommendation. Simultaneously, recommendation diversity and novelty have become recognized as critical to users’ perceived utility, with several new metrics recently proposed for evaluating these aspects of recommendation lists. Consequently, the accuracy-diversity dilemma frequently shows up as a choice to make when creating new recommendation algorithms. We propose a novel adaptation of a unified metric, derived from one commonly used for search system evaluation, to Recommender Systems. The proposed metric combines topical diversity and accuracy, and we show it to satisfy a set of desired...
Automatic profiling models infer demographic characteristics of social network users from their generated content or interactions. Due to its use in business (targeted advertising, market studies...), automatic user profiling from social networks has become a popular task. Users’ demographic data is also crucial information for more socially concerning tasks, such as automatic early detection of mental disorders. For this type of users’ analysis task, it has been demonstrated that the way users employ language is an essential indicator that contributes to the effectiveness of the models. For this reason, we also believe that considering the usage of the language from both psycho-linguistic and semantic characteristics it is useful for detecting variables such as gender, age, and user’s origin. A proper selection of features will be critical for the performance of retrieval, classification, and decision-making software systems, a...
eRisk, a CLEF lab oriented to early risk prediction on the Internet, started in 2017 as a forum to foster experimentation on early risk detection. After four editions (2017, 2018, 2019 and 2020), the lab has created many reference collections in the field and organized multiple early risk detection challenges using those datasets. Each challenge focused on a specific early risk detection problem (e.g., depression, anorexia or self-harm). This paper describes the work done so far, discusses the main lessons learned over the past editions and the plans for the eRisk 2021 edition, where we introduced pathological gambling as a new early risk detection challenge.
Information Retrieval is an area where evaluation is crucial to validate newly proposed models. As the first step in the evaluation of models, researchers carry out offline experiments on specific datasets. While the field started around ad-hoc search, the number of new tasks is continuously growing. These tasks demand the development of new test collections (documents, information needs, and judgments). The construction of those datasets relies on expensive campaigns like TREC. Due to the size of modern collections, obtaining the relevance for each document-topic pair is infeasible. To reduce this cost, organizers usually apply a technique called pooling. When building pooled test collections, assessors only judge a portion of the documents selected among the participants' results. Although the judgments will not be exhaustive, they will be sufficiently complete and unbiased if pooling is done correctly...
Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or...
Personalized recommender systems rely on knowledge of user preferences to produce recommendations. While those preferences are often obtained from past user interactions with the recommendation catalog, in some situations such observations are insufficient or unavailable. The most widely studied case is with new users, although other similar situations arise where explicit preference elicitation is valuable. At the same time, a seemingly disparate challenge is that there is a well-known popularity bias in many algorithmic approaches to recommender systems. The most common way of addressing this challenge is diversification, which tends to be applied to the output of a recommender algorithm, prior to items being presented to users. We tie these two problems together, showing a tight relationship. Our results show that popularity bias in preference elicitation contributes to popularity bias in recommendation...
This article provides an overview of eRisk 2023, the seventh edition of the CLEF conference’s lab dedicated to early risk detection. Our lab has been committed to exploring evaluation methodologies, effectiveness metrics, and other associated processes in the field of early risk detection since its inception. The applications of early alerting models are wide-ranging and span various domains, including health and safety. eRisk 2023 encompassed three tasks. The initial task involved ranking sentences based on their relevance to standardized depression symptoms. The second task concentrated on detecting signs associated with pathological gambling early. Lastly, the third task required participants to automatically estimate an eating disorders questionnaire by analyzing user writings on social media. In this extended overview, we include additional details about the participants’ proposals and more detailed explanations about metrics.
Automatic user profiling from social networks has become a popular task due to its commercial applications (targeted advertising, market studies...). Automatic profiling models infer demographic characteristics of social network users from their generated content or interactions. Users’ demographic information is also precious for more social worrying tasks such as automatic early detection of mental disorders. For this type of users’ analysis tasks, it has been shown that the way how they use language is an important indicator which contributes to the effectiveness of the models. Therefore, we also consider that for identifying aspects such as gender, age or user’s origin, it is interesting to consider the use of the language both from psycho-linguistic and semantic features. A good selection of features will be vital for the performance of retrieval, classification, and decision-making software systems. In this paper, we will address gender classification as a part of the automatic profiling task. We show an experimental analysis of the performance of existing gender classification models based on external corpus and baselines for automatic profiling. We analyse in-depth the influence of the linguistic features in the classification accuracy of the model. After that analysis, we have put together a feature set for gender classification models in social networks with an accuracy performance above existing baselines.
Estudios acerca del impacto de los mecanismos de autoevaluación en la docencia en Ingeniería Informática señalan en los últimos años los beneficios de la autoevaluación como medio formativo en sí mismo, para alcanzar una evaluación continua real y para la mejora de la efectividad de los mecanismos de retroalimentación profesor-alumnado. Esto, unido a los nuevos escenarios docentes abiertos a raíz de la pandemia Covid-19, conlleva la necesidad de puesta en marcha de dinámicas de retroalimentación y evaluación innovadoras. En este artículo, presentamos el desarrollo e implantación de un protocolo de autoevaluación como parte de las actividades de innovación docente en la materia de Sistemas Operativos. El protocolo fue diseñado con el objetivo de incorporar la autoevaluación a lo largo de todas las prácticas de laboratorio de la materia. Posteriormente se realizó un estudio empírico comparativo en el que se analizó la precisión en las autoevaluaciones emitidas por un grupo experimental en comparación con la calificación ciega emitida por el profesor del grupo de prácticas. Los resultados permiten evaluar inicialmente el protocolo diseñado y conocer el punto de partida en capacidades de autoevaluación del estudiantado en las áreas temáticas específicas de Sistemas Operativos.
En este artículo, presentamos un protocolo desarrollado como parte de las actividades de innovación docente en Sistemas Operativos. El objetivo del protocolo es incorporar la autoevaluación como mecanismo de evaluación formativa en competencias en Sistemas Operativos, que fomente en el estudiantado capacidades transversales de especial relevancia en diseño e implementación de funcionalidades de Sistemas Operativos: análisis crítico, detección de posibles mejoras y conciencia del propio proceso de aprendizaje. Para ello, se diseñó un protocolo de autoevaluación y co-evaluación de las pruebas prácticas de la asignatura correspondientes a tres áreas temáticas en Sistemas Operativos: sistemas de ficheros, manejo de memoria y gestión y planificación de procesos. Buscando posibilitar la convivencia de ambos sistemas de evaluación, el protocolo se aplicó a parte del alumnado, manteniendo la evaluación por parte del profesorado paralelamente. Posteriormente, se realizó un estudio empírico de precisión en autoevaluación y co evaluación de los estudiantes, comparando el protocolo con la evaluación tradicional, y evaluando inicialmente sus implicaciones en las calificaciones finales obtenidas. Los resultados permiten no sólo evaluar inicialmente el protocolo diseñado, sino también conocer el punto de partida en capacidades de autoevaluación y co-evaluación de los estudiantes en las áreas temáticas particulares de Sistemas Operativos.
A chatbot is a type of agent that allows people to interact with an information repository using natural language. Nowadays, chatbots have been incorporated in the form of conversational assistants on the most important mobile and desktop platforms. In this article, we present our design of an assistant developed with open-source and widely used components. Our proposal covers the process end-to-end, from information gathering and processing to visual and speech-based interaction. We have deployed a proof of concept over the website of our Computer Science Faculty.
The evaluation of recommender systems is an area with unsolved questions at several levels. Choosing the appropriate evaluation metric is one of such important issues. Ranking accuracy is generally identified as a prerequisite for recommendation to be useful. Ranking metrics have been adapted for this purpose from the Information Retrieval field into the recommendation task. In this article, we undertake a principled analysis of the robustness and the discriminative power of different ranking metrics for the offline evaluation of recommender systems, drawing from previous studies in the information retrieval field. We measure the robustness to different sources of incompleteness that arise from the sparsity and popularity biases in recommendation. Among other results, we find that precision provides high robustness while normalized discounted cumulative gain offers the best discriminative power. In dealing with...
Evaluation is a mandatory task for Information Retrieval research. Under the Cranfield paradigm, this evaluation needs test collections. The creation of these is a time and resource-consuming process. At the same time, new tasks and models are continuously appearing. These tasks demand the building of new test collections. Typically, the researchers organize TREC-like competitions for building these evaluation benchmarks. This is very expensive, both for the organizers and for the participants. In this paper, we present a platform to easily and cheaply build datasets for Information Retrieval evaluation without the need of organizing expensive campaigns. In particular, we propose the simulation of participant systems and the use of pooling strategies to make the most of the assessor’s work. Our platform is aimed to cover the whole process of building the test collection, from document gathering to judgment creation.
This paper describes eRisk, the CLEF lab on early risk prediction on the Internet. eRisk started in 2017 as an attempt to set the experimental foundations of early risk detection. Over the last three editions of eRisk (2017, 2018 and 2019), the lab organized a number of early risk detection challenges oriented to the problems of detecting depression, anorexia and self-harm. We review in this paper the main lessons learned from the past and we discuss our future plans for the 2020 edition.
Nowadays, item recommendation is an increasing concern for many companies. Users tend to be more reactive than proactive for solving information needs. Recommendation accuracy became the most studied aspect of the quality of the suggestions. However, novel and diverse suggestions also contribute to user satisfaction. Unfortunately, it is common to harm those two aspects when optimizing recommendation accuracy. In this paper, we present EER, a linear model for the top-N recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy.
Query-by-example spoken document retrieval (QbESDR) consists in, given a collection of documents, computing how likely a spoken query is present in each document. This is usually done by means of pattern matching techniques based on dynamic time warping (DTW), which leads to acceptable results but is inefficient in terms of query processing time. In this paper, the use of probabilistic retrieval models for information retrieval is applied to the QbESDR scenario. First, each document is represented by means of a language model, as commonly done in information retrieval, obtained by estimating the probability of the different n-grams extracted from automatic phone transcriptions of the documents. Then, the score of a query given a document can be computed following the query likelihood retrieval model. Besides the adaptation of this model to QbESDR, this paper presents two techniques that aim at...
In the field of Information Retrieval, word embedding models have shown to be effective in several tasks. In this paper, we show how one of these neural embedding techniques can be adapted to the recommendation task. This adaptation only makes use of collaborative filtering information, and the results show that it is able to produce effective recommendations efficiently.
Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in IR evaluation is formal and innovative. Following this type of analysis...
Word embeddings techniques have attracted a lot of attention recently due to their effectiveness in different tasks. Inspired by the continuous bag-of-words model, we present prefs2vec, a novel embedding representation of users and items for memory-based recommender systems that rely solely on user–item preferences such as ratings. To improve the performance and prevent overfitting, we use a variant of dropout as regularization, which can leverage existent word2vec implementations. Additionally, we propose a procedure for incremental learning of embeddings that boosts the applicability of our proposal to production scenarios. The experiments show that prefs2vec with a standard memory-based recommender system outperforms all the state-of-the-art baselines in terms of ranking accuracy, diversity, and novelty.
Information Retrieval is not any more exclusively about document ranking. Continuously new tasks are proposed on this and sibling fields. With this proliferation of tasks, it becomes crucial to have a cheap way of constructing test collections to evaluate the new developments. Building test collections is time and resource consuming: it requires time to obtain the documents, to define the user needs and it requires the assessors to judge a lot of documents. To reduce the latest, pooling strategies aim to decrease the assessment effort by presenting to the assessors a sample of documents in the corpus with the maximum number of relevant documents in it. In this paper, we propose the preliminary design of different techniques to easily and cheapily build high-quality test collections without the need of having participants systems.
In this paper, we present PRIN, a probabilistic collaborative filtering approach for top-N recommendation. Our proposal relies on continuous bag-of-words (CBOW) neural model. This fully connected feedforward network takes as input the item profile and produces as output the conditional probabilities of the users given the item. With that information, our model produces item recommendations through Bayesian inversion. The inversion requires the estimation of item priors. We propose different estimates based on centrality measures on a graph that models user-item interactions. An exhaustive evaluation of this proposal shows that our technique outperforms popular state-of-the-art baselines regarding ranking accuracy while showing good values of diversity and novelty
This paper summarizes the activities related to the CLEF lab on early risk prediction on the Internet (eRisk). eRisk was initiated in 2017 as an attempt to set the experimental foundations of early risk detection. The first edition essentially focused on a pilot task on early detection of signs of depression. In 2018, the lab was enlarged and included an additional task oriented to early detection of signs of anorexia. We review here the main lessons learned and we discuss our plans for 2019
Statistical significance tests can provide evidence that the observed difference in performance between two methods is not due to chance. In Information Retrieval, some studies have examined the validity and suitability of such tests for comparing search systems.We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in Information Retrieval evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t-test. The sign test and Wilcoxon signed test also have a good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.
In Information Retrieval evaluation, pooling is a well-known technique to extract a sample of documents to be assessed for relevance. Given the pooled documents, a number of studies have proposed different prioritization methods to adjudicate documents for judgment. These methods follow different strategies to reduce the assessment effort. However, there is no clear guidance on how many relevance judgments are required for creating a reliable test collection. In this paper we investigate and further develop methods to determine when to stop making relevance judgments. We propose a highly diversified set of stopping methods and provide a comprehensive analysis of the usefulness of the resulting test collections. Some of the stopping methods introduced here combine innovative estimates of recall with time series models used in Financial Trading. Experimental results on several representative collections show that some stopping methods can reduce up to 95% of the assessment effort and still produce a robust test collection. We demonstrate that the reduced set of judgments can be reliably employed to compare search systems using disparate effectiveness metrics such as Average Precision, NDCG, P@100 and Rank Biased Precision. With all these measures, the correlations found between full pool rankings and reduced pool rankings is very high.
Query-by-example spoken document retrieval (QbESDR) aims at finding those documents in a set that include a given spoken query. Current approaches are, in general, not valid for real-world applications, since they are mostly focused on being effective (i.e. reliably detecting in which documents the query is present) but practical implementations must also be efficient (i.e. the search must be performed in a limited time) in order to allow for a satisfactory user experience. In addition, systems usually search for exact matches of the query, which limits the number of relevant documents retrieved by the search. This paper proposes a representation of the documents and queries for QbESDR based on combining different-sized phone n-grams obtained from automatic transcriptions, namely phone multigram representation. Since phone transcriptions usually have errors, several hypotheses for the query transcriptions are combined in order to ease the impact of these errors. The proposed system stores the document in inverted indices, which leads to fast and efficient search. Different combinations of the phone multigram strategy with a state-of-art system based on pattern matching using dynamic time warping (DTW) are proposed: one consists in a two-stage system that intends to be as effective but more efficient than a DTW-based system, while the other aims at improving the performance achieved by these two systems by combining their output scores. Experiments performed on the MediaEval 2014 Query-by-Example Search on Speech (QUESST 2014) evaluation framework suggest that the phone multigram representation for QbESDR is a successful approach, and the assessed combinations with a DTW-based strategy lead to more efficient and effective QbESDR systems. In addition, the phone multigram approach succeeded in increasing the detection of non-exact matches of the queries.
Query expansion is a successful approach for improving Information Retrieval effectiveness. This work focuses on pseudo-relevance feedback (PRF) which provides an automaticmethod for expanding queries without explicit user feedback. These techniques perform an initial retrieval with the original query and select expansion terms from the top retrieved documents. We propose two linear methods for pseudo-relevance feedback, one document-based and another term-based, that models the PRF task as a matrix decomposition problem. These factorizations involve the computation of an inter-document or inter-term similarity matrix which is used for expanding the original query. These decompositions can be computed by solving a least squares regression problem with regularization and a non-negativity constraint. We evaluate our proposals on five collections against state-of-the-art baselines. We found that the term-based formulation provides high figures of MAP, nDCG and robustness index whereas the document-based formulation provides very cheap computation at the cost of a slight decrease in effectiveness
The evaluation of Recommender Systems is still an open issue in the field. Despite its limitations, offline evaluation usually constitutes the first step in assessing recommendation methods due to its reduced costs and high reproducibility. Selecting the appropriate metric is a critical and ranking accuracy usually attracts the most attention nowadays. In this paper, we aim to shed light on the advantages of different ranking metrics which were previously used in Information Retrieval and are now used for assessing top-N recommenders. We propose methodologies for comparing the robustness and the discriminative power of different metrics. On the one hand, we study cut-offs and we find that deeper cut-offs offer greater robustness and discriminative power. On the other hand, we find that precision offers high robustness and Normalised Discounted Cumulative Gain provides the best discriminative power.
Retrieval effectiveness has been traditionally pursued by improving the ranking models and by enriching the pieces of evidence about the information need beyond the original query. A successful method for producing improved rankings consists in expanding the original query. Pseudo-relevance feedback (PRF) has proved to be an effective method for this task in the absence of explicit user's judgements about the initial ranking. This family of techniques obtains expansion terms using the top retrieved documents yielded by the original query. PRF techniques usually exploit the relationship between terms and documents or terms and queries. In this paper, we explore the use of linear methods for pseudo-relevance feedback. We present a novel formulation of the PRF task as a matrix decomposition problem which we called LiMe. This factorisation involves the computation of an inter-term similarity matrix which is used for expanding the original query. We use linear least squares regression with regularisation to solve the proposed decomposition with non-negativity constraints. We compare LiMe on five datasets against strong state-of-the-art baselines for PRF showing that our novel proposal achieves improvements in terms of MAP, nDCG and robustness index.
The research community has historically addressed the collaborative filtering task in several fashions. Although model-based approaches such as matrix factorisation attract substantial research efforts, neighbourhood-based recommender systems are effective and interpretable techniques. The performance of neighbour-based methods is strongly tied to the clustering strategies. In this paper, we show that there is room for improvement in this type of recommenders. For showing that, we build an oracle which yields approximately optimal neighbourhoods. We obtain ground truth neighbourhoods using the oracle and perform an analytical study of those to characterise them. As a result of our analysis, we propose to change the user profile size normalisation that cosine similarity employs in order to improve the neighbourhoods computed with k-NN algorithm. Additionally, we present a more appropriate oracle for current grouping strategies which leads us to include the IDF effect on the cosine formulation. An extensive experimentation on four datasets shows an increase in ranking accuracy, diversity and novelty using these cosine variants. This work shed light on the benefits of this type of analysis and paves the way for future research in the characterisation of good neighbourhoods for collaborative filtering.
In this paper we study how to prioritize relevance assessments in the process of creating an Information Retrieval test collection. A test collection consists of a set of queries, a document collection, and a set of relevance assessments. For each query, only a sample of documents from the collection can be manually assessed for relevance. Multiple retrieval strategies are typically used to obtain such sample of documents. And rank fusion plays a fundamental role in creating the sample by combining multiple search results. We propose effective rank fusion models that are adapted to the characteristics of this evaluation task. Our models are based on the distribution of retrieval scores supplied by the search systems and our experiments show that this formal approach leads to natural and competitive solutions when compared to state of the art methods. We also demonstrate the benefits of including pseudo-relevance evidence into the estimation of the score distribution models.
Relevance-Based Language Models are a formal probabilistic approach for explicitly introducing the concept of relevance in the Statistical Language Modelling framework. Recently, they have been determined as a very effective way of computing recommendations. When combining this new recommendation approach with Posterior Probabilistic Clustering for computing neighbourhoods, the item ranking is further improved, radically surpassing rating prediction recommendation techniques. Nevertheless, in the current landscape where the number of recommendation scenarios reaching the big data scale is increasing day after day, high figures of effectiveness are not enough. In this paper, we address one urging and common need of recommendation systems which is algorithm scalability. Particularly, we adapted those highly effective algorithms to the functional MapReduce paradigm, that has been previously proved as an adequate tool for enabling recommenders scalability. We evaluated the performance of our approach under realistic circumstances, showing a good scalability behaviour on the number of nodes in the MapReduce cluster. Additionally, as a result of being able to execute our algorithms distributively, we can show measures in a much bigger collection supporting the results presented on the seminal paper.
This paper provides an overview of eRisk 2018. This was the second year that this lab was organized at CLEF. The main purpose of eRisk was to explore issues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. The second edition of eRisk had two tasks: a task on early risk detection of depression and a task on early risk detection of anorexia.
This paper provides an overview of eRisk 2017. This was the first year that this lab was organized at CLEF. The main purpose of eRisk was to explore is- sues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. The first edition of eRisk had two possible ways to participate: a pilot task on early risk detection of depression, and a workshop open to the submission of papers related to the topics of the lab
Pseudo-relevance feedback (PRF) provides an automatic method for query expansion in Information Retrieval. These techniques find relevant expansion terms using the top retrieved documents with the original query. In this paper, we present an approach based on linear methods called LiMe that formulates the PRF task as a matrix factorization problem. LiMe learns an inter-term similarity matrix from the pseudo-relevant set and the query that uses for computing expansion terms. The experiments on five datasets show that LiMe outperforms state-of-the-art baselines in most cases.
In this paper we describe our recent research on effective construction of Information Retrieval collections. Relevance assessments are a core component of test collections, but they are expensive to produce. For each test query, only a sample of documents in the corpus can be assessed for relevance. We discuss here a class of document adjudication methods that iteratively choose documents based on reinforcement learning. Given a pool of candidate documents supplied by multiple retrieval systems, the production of relevance assessments is modeled as a multi-armed bandit problem. These bandit-based algorithms identify relevant documents with minimal effort. One instance of these models has been adopted by NIST to build the test collection of the TREC 2017 common core track.
Given the diversity of recommendation algorithms, choosing one technique is becoming increasingly difficult. In this paper, we explore methods for combining multiple recommendation approaches. We studied rank aggregation methods that have been proposed for the metasearch task (i.e., fusing the outputs of different search engines) but have never been applied to merge top-N recommender systems. These methods require no training data nor parameter tuning. We analysed two families of methods: voting-based and score-based approaches. These rank aggregation techniques yield significant improvements over state-of-the-art top-N recommenders. In particular, score-based methods yielded good results; however, some voting techniques were also competitive without using score information, which may be unavailable in some recommendation scenarios. The studied methods not only improve the state of the art of recommendation algorithms but they are also simple and efficient.
Evaluating Information Retrieval systems is crucial to making progress in search technologies. Evaluation is often based on assembling reference collections consisting of documents, queries and relevance judgments done by humans. In large-scale environments, exhaustively judging relevance becomes infeasible. Instead, only a pool of documents is judged for relevance. By selectively choosing documents from the pool we can optimize the number of judgments required to identify a given number of relevant documents. We argue that this iterative selection process can be naturally modeled as a reinforcement learning problem and propose innovative and formal adjudication methods based on multi-armed bandits. Casting document judging as a multi-armed bandit problem is not only theoretically appealing, but also leads to highly effective adjudication methods. Under this bandit allocation framework, we consider stationary and non-stationary models and propose seven new document adjudication methods (five stationary methods and two non-stationary variants). Our paper also reports a series of experiments performed to thoroughly compare our new methods against current adjudication methods. This comparative study includes existing methods designed for pooling-based evaluation and existing methods designed for metasearch. Our experiments show that our theoretically grounded adjudication methods can substantially minimize the assessment effort.
Language Models constitute an effective framework for text retrieval tasks. Recently, it has been extended to various collaborative filtering tasks. In particular, relevance-based language models can be used for generating highly accurate recommendations using a memory-based approach. On the other hand, the query likelihood model has proven to be a successful strategy for neighbourhood computation. Since relevance-based language models rely on user neighbourhoods for producing recommendations, we propose to use the query likelihood model for computing those neighbourhoods instead of cosine similarity. The combination of both techniques results in a formal probabilistic recommender system which has not been used before in collaborative filtering. A thorough evaluation on three datasets shows that the query likelihood model provides better results than cosine similarity. To understand this improvement, we devise two properties that a good neighbourhood algorithm should satisfy. Our axiomatic analysis shows that the query likelihood model always enforces those constraints while cosine similarity does not.
Automatically summarizing a document requires conveying the important points of a large document in only a few sentences. Extractive strategies for summarization are based on selecting the most important sentences from the input document(s). We claim here that standard features for estimating sentence importance can be effectively combined with innovative features that encode psychological aspects of communication. We employ Quantitative Text analysis tools for estimating psychological features and we inject them into state-of-the-art extractive summarizers. Our experiments demonstrate that this novel set of features is a good guidance for selecting salient sentences. Our empirical study concludes that psychological features are best suited for hard summarization cases. This motivated us to formally define and study the problem of predicting the difficulty of summarization. We propose a number of predictors to model the difficulty of every summarization problem and we evaluate several learning methods to perform this prediction task.
This paper provides an overview of eRisk 2017. This was the first year that this lab was organized at CLEF. The main purpose of eRisk was to explore issues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. The first edition of eRisk included a pilot task on early risk detection of depression.
This paper provides an overview of eRisk 2017. This was the first year that this lab was organized at CLEF. The main purpose of eRisk was to explore is- sues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. The first edition of eRisk had two possible ways to participate: a pilot task on early risk detection of depression, and a workshop open to the submission of papers related to the topics of the lab
Recommender systems are a growing research field due to its immense potential application for helping users to select products and services. Recommenders are useful in a broad range of domains such as films, music, books, restaurants, hotels, social networks, news, etc. Traditionally, recommenders tend to promote certain products or services of a company that are kind of popular among the communities of users. An important research concern is how to formulate recommender systems centred on those items that are not very popular: the long tail products. A special case of those items are the ones that are product of an overstocking by the vendor. Overstock, that is, the excess of inventory, is a source of revenue loss. In this paper, we propose that recommender systems can be used to liquidate long tail products maximising the business profit. First, we propose a formalisation for this task with the corresponding evaluation methodology and datasets. And, then, we design a specially tailored algorithm centred on getting rid of those unpopular products based on item relevance models. Comparison among existing proposals demonstrates that the advocated method is a significantly better algorithm for this task than other state-of-the-art techniques.
Recently, Relevance-Based Language Models have been demonstrated as an effective Collaborative Filtering approach. Nevertheless, this family of Pseudo-Relevance Feedback techniques is computationally expensive for applying them to web-scale data. Also, they require the use of smoothing methods which need to be tuned. These facts lead us to study other similar techniques with better trade-offs between effectiveness and efficiency. Specifically, in this paper, we analyse the applicability to the recommendation task of four well-known query expansion techniques with multiple probability estimates. Moreover, we analyse the effect of neighbourhood length and devise a new probability estimate that takes into account this property yielding better recommendation rankings. Finally, we find that the proposed algorithms are dramatically faster than those based on Relevance-Based Language Models, they do not have any parameter to tune (apart from the ones of the neighbourhood) and they provide a better trade-off between accuracy and diversity/novelty.
Language Models are state-of-the-art methods in Information Retrieval. Their sound statistical foundation and high effectiveness in several retrieval tasks are key to their current success. In this paper, we explore how to apply these models to deal with the task of computing user or item neighbourhoods in a collaborative filtering scenario. Our experiments showed that this approach is superior to other neighbourhood strategies and also very efficient. Our proposal, in conjunction with a simple neighbourhood-based recommender, showed a great performance compared to state-of-the-art methods (NNCosNgbr and PureSVD) while its computational complexity is low.
Evaluation is crucial in Information Retrieval. The Cranfield paradigm allows reproducible system evaluation by fostering the construction of standard and reusable benchmarks. Each benchmark or test collection comprises a set of queries, a collection of documents and a set of relevance judgements. Relevance judgements are often done by humans and thus expensive to obtain. Consequently, relevance judgements are customarily incomplete. Only a subset of the collection, the pool, is judged for relevance. In TREC-like campaigns, the pool is formed by the top retrieved documents supplied by systems participating in a certain evaluation task. With multiple retrieval systems contributing to the pool, an exploration/exploitation trade-off arises naturally. Exploiting effective systems could find more relevant documents, but exploring weaker systems might also be valuable for the overall judgement process. In this paper, we cast document judging as a multi-armed bandit problem. This formal modelling leads to theoretically grounded adjudication strategies that improve over the state of the art. We show that simple instantiations of multi-armed bandit models are superior to all previous adjudication strategies.
The use of Relevance-Based Language Models for top-N recommendation has become a promising line of research. Previous works have used collection-based smoothing methods for this task. However, a recent analysis on RM1 (an estimation of Relevance-Based Language Models) in document retrieval showed that this type of smoothing methods demote the IDF effect in pseudo-relevance feedback. In this paper, we claim that the IDF effect from retrieval is closely related to the concept of novelty in recommendation. We perform an axiomatic analysis of the IDF effect on RM2 concluding that this kind of smoothing methods also demotes the IDF effect in recommendation. By axiomatic analysis, we find that a collection-agnostic method, Additive smoothing, does not demote this property. Our experiments confirm that this alternative improves the accuracy, novelty and diversity figures of the recommendations
Automatic Text Summarisation is an essential technology to cope with the overwhelming amount of documents that are daily generated. Given an information source, such as a webpage or a news article, text summarisation consists of extracting content from it and present it in a condensed form for human consumption. Summaries are crucial to facilitate information access. The reader is provided with the key information in a concise and fluent way. This speeds up navigation through large repositories of data. With the rapid growth of online contents, creating manual summaries is not an option. Extractive summarisation methods are based on selecting the most important sentences from the input. To meet this aim, a ranking of candidate sentences is often built from a reduced set of sentence features. In this paper, we show that many features derived from psychological studies are valuable for constructing extractive summaries. These features encode psychological aspects of communication and are a good guidance for selecting salient sentences. We use Quantitative Text Analysis tools for extracting these features and inject them into state-of-the-art extractive summarisers. Incorporating these novel components into existing extractive summarisers requires to combine and weight a high number of sentence features. In this respect, we show that Particle Swarm Optimisation is a viable approach to set the feature's weights. Following standard evaluation practice (DUC benchmarks), we also demonstrate that our novel summarisers are highly competitive.
Language models represent a successful framework for many Information Retrieval tasks: ad hoc retrieval, pseudo-relevance feedback or expert finding are some examples. We present how language models can compute effectively user or item neighbourhoods in a collaborative filtering scenario (this idea was originally proposed in ECIR 2016). The experiments support the applicability of this approach for neighbourhood-based recommendation surpassing the rest of the baselines. Additionally, the computational cost of this approach is small since language models have been efficiently applied to large-scale retrieval tasks such as web search
Probabilistic modelling of recommender systems naturally introduces the concept of prior probability into the recommendation task. Relevance-Based Language Models, a principled probabilistic query expansion technique in Information Retrieval, has been recently adapted to the item recommendation task with success. In this paper, we study the effect of the item and user prior probabilities under that framework. We adapt two priors from the document retrieval field and then we propose other two new probabilistic priors. Evidence gathered from experimentation indicates that a linear prior for the neighbour and a probabilistic prior based on Dirichlet smoothing for the items improve the quality of the item recommendation ranking.
Language Models have been traditionally used in several fields like speech recognition or document retrieval. It was only recently when their use was extended to collaborative Recommender Systems. In this field, a Language Model is estimated for each user based on the probabilities of the items. A central issue in the estimation of such Language Model is smoothing, i.e., how to adjust the maximum likelihood estimator to compensate for rating sparsity. This work is devoted to explore how the classical smoothing approaches (Absolute Discounting, Jelinek-Mercer and Dirichlet priors) perform in the field of Recommender Systems. We tested the different methods under the recently presented Relevance-Based Language Models for collaborative filtering, and compared how the smoothing techniques behave in terms of precision and stability. We found that Absolute Discounting is practically insensitive to the parameter value being an almost parameter-free method and, at the same time, its performance is similar to Jelinek-Mercer and Dirichlet priors.
In the blogosphere, different actors express their opinions about multiple topics. Users, companies or editors socially interact by commenting, recommending and linking blogs and posts. These social media contents are increasingly growing. As a matter of fact, the size of the blogosphere is estimated to double every six months. In this context, the problem of finding a topically relevant blog to subscribe to becomes a Big Data challenge. Moreover, combining multiple types of evidence is essential for this search task. In this paper we propose a group of textual and social-based signals, and apply different Information Fusion algorithms for a Blog Distillation Search task. Information fusion through the combination of the different types of evidence requires optimisation for appropriately weighting each source of evidence. To this end, we analyse well-established population-based search methods. Namely, global search (Particle Swarm Optimisation and Differential Evolution) and a local search method (Line Search) that has been effective in various Information Retrieval tasks. Moreover, we propose hybrid combinations between the global search and the local search method and compare all the alternatives following a standard methodology. Efficiency is an imperative here and, therefore, we focus not only on achieving high search effectiveness but also on designing efficient solutions.
Relevance-Based Language Models, commonly known as Relevance Models, are successful approaches to explicitly introduce the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models achieve state-of-the-art retrieval performance in the Pseudo Relevance Feedback task. It is known that one of the factors that more affect to the Pseudo Relevance Feedback robustness is the selection for some queries of harmful expansion terms. In order to minimise this effect in these methods a crucial point is to reduce the number of non-relevant documents in the pseudo relevant set. In this paper, we propose an original approach to tackle this problem. We try to automatically determine for each query how many documents we should select as pseudo-relevant set. For achieving this objective we will study the score distributions of the initial retrieval and trying to discern in base of their distribution between relevant and non-relevant documents. Evaluation of our proposal showed important improvements in terms of robustness.
The Digital Age has brought great benefits for the human race but also some drawbacks. Nowadays, people from opposite corners of the World can communicate online via instant messaging services. Unfortunately, this has introduced new kinds of crime. Sexual predators have adapted their predatory strategies to these platforms and, usually, the target victims are kids. The authorities cannot manually track all threats because massive amounts of online conversations take place in a daily basis. Automatic methods for alerting about these crimes need to be designed. This is the main motivation of this paper, where we present a Machine Learning approach to identify suspicious subjects in chat-rooms. We propose novel types of features for representing the chatters and we evaluate different classifiers against the largest benchmark available. This empirical validation shows that our approach is promising for the identification of predatory behaviour. Furthermore, we carefully analyse the characteristics of the learnt classifiers. This preliminary analysis is a first step towards profiling the behaviour of the sexual predators when chatting on the Internet.
Nowadays, scalability is a critical factor in the design of any system working with big data. In particular, it has been recognised as a main challenge in the construction of recommender systems. In this paper, we present a recommender architecture capable of making personalised recommendations using collaborative filtering in a big data environment. We aim to build highly scalable systems without any single point of failure. Replication and data distribution as well as caching techniques are used to achieve this goal. We suggest specific technologies for each subsystem of our proposed architecture considering scalability and fault tolerance. Furthermore, we evaluate the performance under realistic scenarios of different alternatives (RDBMS and NoSQL) for storing, generating and serving recommendations.
Relevance-Based Language Models, commonly known as Relevance Models, are successful approaches to explicitly introduce the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models achieve state-of-the-art retrieval performance in the pseudo relevance feedback task. On the other hand, the field of recommender systems is a fertile research area where users are provided with personalised recommendations in several applications. In this paper, we propose an adaptation of the Relevance Modelling framework to effectively suggest recommendations to a user. We also propose a probabilistic clustering technique to perform the neighbour selection process as a way to achieve a better approximation of the set of relevant items in the pseudo relevance feedback process. These techniques, although well known in the Information Retrieval field, have not been applied yet to recommender systems, and, as the empirical evaluation results show, both proposals outperform individually several baseline methods. Furthermore, by combining both approaches even larger effectiveness improvements are achieved.
Relevance-Based Language Models are an effective IR approach which explicitly introduces the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models have shown to achieve state-of-the-art retrieval performance in the pseudo relevance feedback task. In this paper we propose a novel adaptation of this language modeling approach to rating-based Collaborative Filtering. In a memory-based approach, we apply the model to the formation of user neighbourhoods, and the generation of recommendations based on such neighbourhoods. We report experimental results where our method outperforms other standard memory-based algorithms in terms of ranking precision.
In recent years, Pseudo Relevance Feedback techniques have become one of the most effective query expansion approaches for document retrieval. Particularly, Relevance-Based Language Models have been applied in several domains as an effective and efficient way to enhance topic retrieval. Recently, some extensions to the original RM methods have been proposed to apply query expansion in other scenarios, such as opinion retrieval. Such approaches rely on mixture models that combine the query expansion provided by Relevance Models with opinionated terms obtained from external resources (e.g., opinion lexicons). However, these methods ignore the structural aspects of a document, which are valuable to extract topic-dependent opinion expressions. For instance, the sentiments conveyed in blogs are often located in specific parts of the blog posts and its comments. We argue here that the comments are a good guidance to find on-topic opinion terms that help to move the query towards burning aspects of the topic. We study the role of the different parts of a blog document to enhance blog opinion retrieval through query expansion. The proposed method does not require external resources or additional knowledge and our experiments show that this is a promising and simple way to make a more accurate ranking of blog posts in terms of their sentiment towards the query topic. Our approach compares well with other opinion finding methods, obtaining high precision performance without harming mean average precision.
Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are introduced is by designing new clustering algorithms that enforce the accomplishment of the constraints. In this paper we present an alternative approach for constrained clustering where, instead of defining new algorithms or objective functions, the constraints are introduced modifying the document representation by means of their language modelling. More precisely the constraints are modelled using the well-known Relevance Models successfully used in other retrieval tasks such as pseudo-relevance feedback. To the best of our knowledge this is the first attempt to try such approach. The results show that the presented approach is an effective method for constrained clustering even improving the results of existing constrained clustering algorithms.
Spectral clustering techniques have become one of the most popular clustering algorithms, mainly because of their simplicity and effectiveness. In this work, we make use of one of these techniques, Normalised Cut, in order to derive a cluster-based collaborative filtering algorithm which outperforms other standard techniques in the state-of-the-art in terms of ranking precision. We frame this technique as a method for neighbour selection, and we show its effectiveness when compared with other cluster-based methods. Furthermore, the performance of our method could be improved if standard similarity metrics -- such as Pearson's correlation -- are also used when predicting the user's preferences.
The existence of sexual predators that enter into chat rooms or forums and try to convince children to provide some sexual favour is a socially worrying issue. Manually monitoring these interactions is a way to attack this problem. However, this manual approach simply cannot keep pace because of the high number of conversations and the huge number of chatrooms or forums where these conversations daily take place. We need tools that automatically process massive amounts of conversations and alert about possible offenses. The sexual predator identification challenge within PAN 2012 is a valuable way to promote research in this area. Our team faced this task as a Machine Learning problem and we designed several innovative sets of features that guide the construction of classifiers for identifying sexual predation. Our methods are driven by psycholinguistic, chat-based, and tf/idf features and yield to very effective classifiers.
Recently a new fashion of semi-supervised clustering algorithms, coined as constrained clustering, has emerged. These new algorithms can incorporate some a priori domain knowledge to the clustering process, allowing the user to guide the method. The vast majority of studies about the effectiveness of these approaches have been performed using information, in the form of constraints, which was totally accurate. This would be the ideal case, but such a situation will be impossible in most realistic settings, due to errors in the constraint creation process, misjudgements of the user, inconsistent information, etc. Hence, the robustness of the constrained clustering algorithms when dealing with erroneous constraints is bound to play an important role in their final effectiveness. In this paper we study the behaviour of four constrained clustering algorithms (Constrained k-Means, Soft Constrained k-Means, Constrained Normalised Cut and Normalised Cut with Imposed Constraints) when not all the information supplied to them is accurate. The experimentation over text and numeric datasets using two different noise models, one of them an original approach based on similarities, highlighted the strengths and weaknesses of each method when working with positive and negative constraints, indicating the scenarios in which each algorithm is more appropriate.
Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively presented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text, and it should be returned in the same order as a human would read the original PDF file. However, current tools for PDF text extraction fail in this objective when working with complex documents with multiple columns. For instance, this is the case of official government bulletins with legal information. In this task, it is mandatory to get correct and ordered text as a result of the application of the PDF extractor. It is very usual that a legal article in a document refers to a previous article and they should be offered in the right sequential order. To overcome these difficulties we have designed a new method for extraction of text in PDFs that simulates the human reading order. We evaluated our method and compared it against other PDF extraction tools and algorithms. Evaluation of our approach shows that it significantly outperforms the results of the existing tools and algorithms.
Information Retrieval techniques traditionally depend on the setting of one or more parameters. Depending on the problem and the techniques the number of parameters can be one, two or even dozens of them. One crucial problem in Information Retrieval research is to achieve a good parameter setting of its methods. The tuning process, when dealing with several parameters, is a time consuming and critical step. In this paper we introduce the use of Particle Swarm Optimisation for the automatic tuning process of the parameters of Information Retrieval methods. We compare our proposal with the Line Search method, previously adopted in Information Retrieval. The comparison shows that our approach is faster and achieves better results than Line Search. Furthermore, Particle Swarm Optimisation algorithms are suitable for parallelisation, improving the algorithm behaviour in terms of time convergence.
La adaptación de las titulaciones de Ingeniería Informática al Espacio Europeo de Educación Superior (EEES) ha supuesto por una parte la renovacion de la oferta de materias y por otra el cambio en el paradigma docente establecido. En particular la Facultad de Informática de la Universidad de A Coruña ha introducido en su curricula la asignatura de Recuperación de Información. La Recuperación de Información es ya, a día de hoy, una materia madura y establecida en el ámbito de las ciencias de la computación. La Universidad de A Coruña, que ha sido desde su constitución un referente en la comunidad autónoma de Galicia en el ámbito de la informática, la ha incluido como materia fundamental en sus nuevos planes. Concretamente, en el Grado de Ingeniería Informática por la Universidad de A Coruña, la asignatura de Recuperación de Información está asociada al itinerario de Computación y cuenta con 6 créditos. En el plan de Master en Ingeniería Informática, recientemente propuesto, la asignatura de Recuperación de Información y Web Semántica cuenta también con 6 créditos ECTS. Gran parte de la docencia asociada a estas nuevas materias sera de carácter práctico al tratarse de titulaciones en el ámbito de la Ingeniería. En este escenario, existe pues una necesidad fundamental de contar con herramientas adecuadas que se adapten al nuevo paradigma educativo donde, de acuerdo al espíritu del EEES, aumenta el trabajo autónomo del alumno y se reducen las horas presenciales guiadas por un docente. Es pues nuestra intención a la luz de la nueva situación docente y metodológica revisar las herramientas existentes para la enseñanza práctica de Recuperación de Información, haciendo especial hincapié en los factores introducidos por las restricciones asociadas a la adaptación al EEES. En concreto en este trabajo analizaremos herramientas software considerando distintos factores importantes para la docencia, sin ánimo de ser exhaustivos: lenguaje de programación, licencia, comunidad, documentación, soporte, modelos disponibles, facilidad de evaluación, etc. A pesar de la existencia de algunas comparativas de herramientas software" desde el puntde uso comercial o en investigación, en este documento consideramos importante analizar las herramientas desde un punto de vista de su idoneidad para la docencia y el aprendizaje. Este trabajo se encuadrará en la línea metodológica y de recursos docentes en el marco de la adaptación al EEES y daremos respuesta a algunas preguntas importantes como: ¿qué herramientas son más adecuadas para el trabajo autónomo del alumnado?, ¿qué herramientas son más adecuadas dado el bagaje adquirido por el alumnado en elcontexto de los planes de estudio de la Universidad de A Coruña?, ¿qué herramientas permitirán al docente poner en la práctica el temario explicado en las clases magistrales?, ¿qué herramientas facilitarán la evaluación continua del alumnado?
In the last years, cluster based retrieval has been demonstrated as an effective tool for both interactive retrieval and pseudo relevance feedback techniques. In this paper we propose a new cluster based retrieval function which uses the best and worst clusters of a document in the cluster ranking, to improve the retrieval effectiveness. The evaluation shows improvements in some standard TREC collections over the state-of-the-art techniques in precision and robustness.
Traditionally the use of pseudo relevance feedback (PRF) techniques for query expansion has been demonstrated very effective. Particularly the use of Relevance Models (RM) in the context of the Language Modelling framework has been established as a high-performance approach to beat. In this paper we present an alternative estimation for the RM promoting terms that being present in the relevance set are also distant from the language model of the collection. We compared this approach with RM3 and with an adaptation to the Language Modelling framework of the Rocchio’s KLD-based term ranking function. The evaluation showed that this alternative estimation of RM reports consistently better results than RM3, showing in average to be the most stable across collections in terms of robustness.
In this paper we study the use of social bookmarking to improve the quality of text clustering. Recently constrained clustering algorithms have been presented as a successful tool to introduce domain knowledge in the clustering process. This paper uses the tags saved by the users of Delicious to generate non artificial constraints for constrained clustering algorithms. The study demonstrates that it is possible to achieve a high percentage of good constraints with this simple approach and, more importantly, the evaluation shows that the use of these constraints produces a great improvement (up to 91.25%) of the clustering algorithms effectiveness.
Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively presented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text, and it should be returned in the same order as a human would read the original PDF file. However, current tools for PDF text extraction fail in this objective when working with complex documents with multiple columns. For instance, this is the case of official government bulletins with legal information. In this task, it is mandatory to get correct and ordered text as a result of the application of the PDF extractor. It is very usual that a legal article in a document refers to a previous article and they should be offered in the right sequential order. To overcome these difficulties we have designed a new method for extraction of text in PDFs that simulates the human reading order. We evaluated our method and compared it against other PDF extraction tools and algorithms. Evaluation of our approach shows that it significantly outperforms the results of the existing tools and algorithms.
Este libro surge de la necesidad de disponer de un material que, con un enfoque eminentemente didáctico, permita dar una visión general de la disciplina de la Recuperación de Información, abarcando desde los fundamentos hasta las propuestas de investigación actuales. La idea es ofrecer al lector los entresijos de un área de conocimiento cuyos avances se trasladan directamente a programas que empleamos todos los días para diversas tareas cotidianas. Para alcanzar estos objetivos se ha contado con la colaboración de un plantel de expertos reconocidos internacionalmente por su investigación en el campo de la Recuperación de Información. Cada uno de ellos se ha centrado en aquellos capítulos de cuyas temáticas son especialistas y ampliamente conocedores. Además, la gran mayoría de ellos posee una inestimable experiencia docente en asignaturas de Recuperación de Información, con lo que sus experiencias y conocimientos a la hora de diseminar esta disciplina se han exportado a sus capítulos de forma directa, e implícitamente al libro completo
We report on the University of Lugano's participation in the Blog and Session tracks of TREC 2010. In particular we describe our system for performing blog distillation, faceted search, top stories identiffication and session reranking.
In the last years Blog Search has been a new exciting task in Information Retrieval. The presence of user generated information with valuable opinions makes this field of huge interest. In this poster we use part of this information, the readers' comments, to improve the quality of post snippets with the objective of enhancing the user access to the relevant posts in a result list. We propose a simple method for snippet generation based on sentence selection, using the comments to guide the selection process. We evaluated our approach with standard TREC methodology in the Blogs06 collection showing significant improvements up to 32% in terms of MAP over the baseline
Novelty detection is a difficult task, particularly at sentence level. Most of the approaches proposed in the past consist of re-ordering all sentences following their novelty scores. However, this re-ordering has usually little value. In fact, a naive baseline with no novelty detection capabilities yields often better performance than any state-of-the-art novelty detection mechanism. We argue here that this is because current methods initiate too early the novelty detection process. When few sentences have been seen, it is unlikely that the user is negatively affected by redundancy. Therefore, re-ordering the first sentences may be harmful in terms of performance. We propose here a query-dependent method based on cluster analysis to determine where we must start filtering redundancy
The problems of finding alternative clusterings and avoiding bias have gained popularity over the last years. In this paper we put the focus on the quality of these alternative clusterings, proposing two approaches based in the use of negative constraints in conjunction with spectral clustering techniques. The first approach tries to introduce these constraints in the core of the constrained normalised cut clustering, while the second one combines spectral clustering and soft constrained k-means. The experiments performed in textual collections showed that the first method does not yield good results, whereas the second one attains large increments on the quality of the results of the clustering while keeping low similarity with the avoided grouping
This paper is focused on the extraction of certain parts of a blog: the post and the comments, presenting a technique based on the blog structure and its elements attributes, exploiting similarities and conventions among different blog providers or Content Management Systems (CMS). The impact of the extraction process over retrieval tasks is also explored. Separate evaluation is performed for both goals: extraction is evaluated through human inspection of the results of the extraction technique over a sampling of blogs, while retrieval performance is automatically evaluated through standard TREC methodology and the resources provided by the Blog Track. The results show important and significant improvements over a baseline which does not incorporate the extraction approach.
Robust Information Retrieval (IR) systems have been demanded due to the widespread and multipurpose use of document images, and the high number of document images repositories available nowadays. This paper presents a novel approach to support the automatic generation of relationships among document images by exploiting Latent Semantic Indexing (LSI) and Optical Character Recognition (OCR). The LinkDI service extracts and indexes document images content, obtains its latent semantics, and defines relationships among images as hyperlinks. LinkDI was experimented with document images repositories, and its performance was evaluated by comparing the quality of the relationships created among textual documents and among their respective document images. Results show the feasibility of LinkDI relating OCR output with high degradation.
In this paper we present a new clustering algorithm which extends the traditional batch k-means enabling the introduction of domain knowledge in the form of Must, Cannot, May and May-Not rules between the data points. Besides, we have applied the presented method to the task of avoiding bias in clustering. Evaluation carried out in standard collections showed considerable improvements in effectiveness against previous constrained and non-constrained algorithms for the given task.
The inclusion of document length factors has been a major topic in the development of retrieval models. We believe that current models can be further improved by more refined estimations of the document's scope. In this poster we present a new document length prior that uses the size of the compressed document. This new prior is introduced in the context of Language Modeling with Dirichlet smoothing. The evaluation performed on several collections shows significant improvements in effectiveness.
This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough evaluation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time
The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in terms of effectiveness the classical term centred approach and the most of the participant systems in the TREC Confusion Track
We present an approach to document clustering based on winnowing fingerprints that achieved good values of effectiveness with considerable save in memory space and computation time.
The amount of legal information is continuously growing. New legislative documents appear everyday in the Web. Legal documents are produced on a daily basis in briefing-format, containing changes in the current legislation, notifications, decisions, resolutions, etc. The scope of these documents includes countries, states, provinces and even city councils. This legal information is produced in a semi-structured format and distributed daily on official web-sites; however, the huge amount of published information makes difficult for an user to find a specific issue, being lawyers probably the most representative example, who need to access to these sources regularly. This motivates the need of legislative information search engines. Standard general web search engines return to the user full documents (web pages typically), within hundreds of pages. As users expect only the relevant part of the document, techniques that recognise and extract these relevant bits of documents are needed to offer quick and effective results. In this paper we present a method to perform segmentation based on domain-specific lexicon information. Our method was tested with a manually tagged data-set coming from different sources of Spanish legislative documents. Results show that this technique is suitable for the task achieving values of 97'85% recall and 95'99% precision
The Information Retrieval Lab is affiliated to the Department of Computer Science of the University of A Coruña (code G000494 in the University catalogue). The group has been researching in basic issues of Information Retrieval for more than ten years
Web information extraction, in particular web news extraction is an open research problem and it is a key point in NewsIR systems. Current techniques fail in the quality of the results, the high computational cost or the necessity of human intervention, all of them critical issues in a real system. We present an automated approach to news recognition and extraction based on a set of heuristics about the articles structure, that is currently applied in an operational system.We also built a data set to evaluate web news extraction methods. Our results in this collection of international news, composed of 4869 web pages from 15 different on-line sources, achieved a 97% of precision and a 94% of recall for the news recognition and extraction task.
The Coruña Corpus: A Collection of Samples for the Historical Study of English Scientific Writing is a project on which the Muste Group has been working since 2003 in the University of A Coruña (Spain). It has been designed as a tool for the study of language change in English scientific writing in general as well as within the different scientific disciplines. Its purpose is to facilitate investigation at all linguistic levels, though, in principle, phonology is not included among our intended research topics.
This poster presents an efficiency oriented approach to the task of summary generation for operational news retrieval systems, where the summaries are appreciated by the users. This work shows that for this task the relevant sentence extraction techniques are suitable due to the compressibility of the generated summaries and the low computational costs associated. To minimize the costs of the summary construction in retrieval time we propose an efficient storage of the summaries as sentence offsets inside the documents. At indexing time the user query is not available to make the selection of the relevant sentence so the article's title was chosen to generate a title-biased summary, because of the high quality description of the news that the titles are. The sentence offsets were included in the direct file to just reconstruct the summaries in processing time from this information. This strategy gets a very high improvement in terms of retrieval time with a very low increment of the index size in comparison with query-biased summaries generated at retrieval time. As future work we will approach the evaluation of the summaries quality in base of the DUC measurements and the improvement of the relevance score formulas
The Coruña Corpus of scientific writing will be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English. The Coruña Corpus Tool is an information retrieval system that allows the extraction of knowledge from the corpus
Nowadays there are thousands of news sites available on-line. Traditional methods to access this huge news repository are overwhelmed. In this paper we present NowOnWeb, a news retrieval system that crawls the articles from the internet publishers and provides news searching and browsing
Agile access to the huge amount of information published by the thousands of news sites available on-line leads to the application of Information Retrieval techniques to this problem. The aim of this paper is to present NowOnWeb, a news retrieval system that obtains the articles from different on-line sources providing news searching and browsing. The main points solved during the development of NowOnWeb were: article recognition and extraction, redundancy detection and text summarization. For these points we provided effective solutions that put all them together had risen to a system that satisfies, in a reasonable way, the daily information needs of the user.
I am part of the the University of A Coruña Computer Science Deparment. I have been teaching several course on Computer Science Degrees in the last years, as well as specific course on different topics and technologies.
Mandatory course for Software Engineering specialization (4th year) on the B.Sc. Eng. in Computer Science (OBL. EI 4º 2C SE).
Mandatory course (3rd year) for on the B.Sc. Data Science and Engineering (OBL. CED 3º 2C).
Elective course (4th year) for on the B.Sc. Data Science and Engineering (OPT. CED 4º 2C).
Mandatory course on the M.Sc. Eng. in Computer Science (OBL. MsC EI 1C).
Mandatory course for Information Systems specialization (3rd year) on the B.Sc. Eng. in Computer Science, elective course on the Information Technologies specialization (4th year) (OBL. EI 3º 2C IS/ OPT. EI 4º 2C IT).
Mandatory course for Software Engineering specialization (4th year) on the B.Sc. Eng. in Computer Science, elective course on the Information Systems specialization (4th year) (OBL. EI 4º 1C SE/ OPT. EI 4º 1C IS).
Mandatory course for Software Engineering specialization (4th year) on the B.Sc. Eng. in Computer Science (OBL. EI 4º 2C SE).
B.Sc. Eng. in Computer Science Degree Projects (old plans) in the Software Engineering and Information Technologies specializations (3rd year).
Elective course on the M.Sc. Eng. and B.Sc. Eng. in Computer Science (old plans, in extinction).
Mandatory course (2nd year) on the B.Sc. Eng. in Computer Science.
Mandatory course (4rd year) on the M.Sc. Eng.+ B.Sc. Eng. in Computer Science (old plans, in extinction).
Elective course on the B.Sc. Eng. in Computer Science (old plans, in extinction).
Elective course on the B.Sc. Eng. in Computer Science (old plans, in extinction).
Mandatory course on the B.Sc. Eng. in Computer Science (old plans, in extinction).
Aula de Formación Informática
Aula de Formación Informática
Consejo Social UDC
Confederación de Empresarios de Ferrol
I would be happy to talk to you if you need my assistance in your research or whether you need help in relation with my research topics for your company.
You also can find me at my office: S4.2 at Facultad de Informática, Campus de Elviña