Academic Positions

  • Present 2012

    Assistant Professor

    University of A Coruña, Computer Science Faculty

  • 2011 2009

    María Barbeito, Predoctoral Grant

    Xunta de Galicia, University of A Coruña, Computer Science Faculty

  • 2009 2006

    Research Assistant & PhD student

    University of A Coruña, Computer Science Faculty

  • Summer 2005

    Software Engineer intern

    Igalia Software Engineering

Education & Training

  • Ph.D. 2013

    Ph.D. in Computer Science

    University of A Coruña, Computer Science Department

  • DEA2008

    Advanced Studies Diploma in CS&AI

    University of A Coruña, Computer Science Department

  • M.Sc. Eng.+B.Sc. Eng.2006

    Ingeniero en Informática

    University of A Coruña/University of West of England, Faculty of Computer Science/Faculty of Environment and Technology

Honors, Awards and Grants

  • ACM RecSys 2012
    Best Short Paper Award
    image
    with Alejandro Bellogín: spectral clustering techniques have become one of the most popular clustering algorithms, mainly because of their simplicity and effectiveness. In this work, we make use of one of these techniques, Normalised Cut, in order to derive a cluster-based collaborative filtering algorithm which outperforms other standard techniques in the state-of-the-art in terms of ranking precision. We frame this technique as a method for neighbour selection, and we show its effectiveness when compared with other cluster-based methods. Furthermore, the performance of our method could be improved if standard similarity metrics -such as Pearson correlation- are also used when predicting the rating score. More...
  • 2009-2012
    María Barbeito Predoctoral Grant, Xunta de Galicia
    image
    The María Barbeito Predoctoral Grant Program, is a competitive call carried out annually by the Goverment of Galicia, Spain (Xunta de Galicia). The objetive of the program is to provide with the first step in the scientific carrer of young Galician researchers with the final aim of integrate those researchers in the Galicia R&D system after the defence of their Ph.D. thesis.

Co-authors and collaborators

Álvaro Barreiro

University of A Coruña

web

David E. Losada

University of Santiago de Compostela

web

Alejandro Bellogín

Centrum Wiskunde & Informatica / Universidad Autónoma de Madrid

web

Fabio Crestani

University of Lugano

web

Mark Carman

Monash University

web

Giacomo Inches

University of Lugano

web

Shima Gerani

University of Lugano

web

Parvaz Mahdabi

University of Lugano

web

Mostafa Keikha

University of Massachusetts Amherst

web

Pablo Castells

Universidad Autónonoma de Madrid

web

Renato de Freitas

Federal University of Goias

web

Alessandra Alaniz

Universidade de São Pauloy

web

Roi Blanco

Yahoo! Research

web

Ronald T. Fernández

University of Santiago de Compostela

web

José M Chenlo

University of Santiago de Compostela

web

M. Eduardo Ares

University of A Coruña

web

José Santos

University of A Coruña

web

Daniel Valcarce

University of A Coruña

web

People with whom I have worked.

Most of my work would be not possible without the help of many group members and external collaborators.

Research Projects

  • image

    TIN2015-64282-R

    Probabilistic Personalized Infomation Access Systems

    Recommender Systems (RecSys) aim, given a set of users, a set of items and a set of users' ratings to items, generate personalised item recommendations for users. Traditionally, RecSys can exploit information both from the past interaction of users and products and from the content of the items to generate new suggestions for users. These systems have proven key to facilitating access to information, products and services. Specifically, it is estimated that a significant percentage of e-commerce transactions are motivated by recommendations: for example, Amazon sales increased by 29% after integrating a recommendation engine.

    The Spanish Strategy for Science, Technology and Innovation 2013-2020 establishes the need to provide Spanish companies with innovative models to increase efficiency and competitiveness of their processes of commercialisation of new products and services. Given the migration of our economy in the context of the digital society, these models must necessarily provide innovative technological solutions that will transform the way we do business and the sales channels or the mechanisms of relationship with the consumer. Given these objectives and challenges, Recommendation Systems for products and services play a central role. Thus, in this project, we want to advance the state of the art, proposing new models of recommendation that, with a solid formal probabilistic basis, may help increase sales and improve products and the satisfaction of buyers. These models and their translation into domains and instances of actual use in the business community contribute, through the quality of its recommendations, to the development of the digital economy.

    A booming research area is the translation of classic Information Retrieval approaches to the Recommendation task. In particular, in this research project, we propose the use of probabilistic Language Models to the item recommendation task. Recently, we developed the first formalisations obtaining high effectiveness figures. Given the previous positive experience, we want to extend the predictability of these models beyond the collaborative filtering approach considering new estimates and models that include and integrate different content information, capturing contextual and temporal aspects. Furthermore, we propose the integration of Bayesian optimization techniques to develop models that not only generate tailored product suggestions but also generate them in a personalised manner, adapting the recommendation models to the particularities of the users. All these objectives are constrained by a common core objective which is transversal: efficiency, scalability and robustness of such methods in relation to their translation into real applications in the productive sector.

  • image

    TIN2012-33867

    Information Retrieval & Sentiment Analysis in Social Web

    In many application domains, there is a growing need to exploit opinions expressed by people in the Web. Decision making processes in companies and organizations can be potentially enriched with software tools able to monitor the voice of the people about products/services, and able to estimate the customer satisfaction. Similarly, governments could promptly obtain the response of the citizens to political actions and, in broad terms, opinion-rich information can help in political decision making processes. On the other hand, users can take into account the opinions of others about products, services or any other issue that affects their information needs in public, private and professional domains.

    In recent years, several research advances have been done in Web Information Retrieval (IR) and in the field of Opinion Mining and Sentiment Analysis. Analyzing and exploiting opinions from the web presents new challenges and needs techniques radically different from those of relevance-based retrieval, which is typical of web search. However, it is known that, for a sentiment mining and analysis system to be useful, effective topic retrieval should be available. This project proposes a number of complementary research lines in order to improve web retrieval and sentiment analysis. We will conduct research into models and techniques that have recently yielded promising results: improving pseudo-relevance feedback using three specific techniques: cluster based pseudo-relevance feedback, adaptive pseudo feedback and selective pseudo feedback; improving traditional techniques to detect opinions and to estimate polarity with new models of sentiment flow; and improving feature mining methods to associate opinions with aspects or properties of the reviewed objects. The team that proposes this project has experience on these research topics and has already made several contributions in these areas.

  • image

    TIN2008-06566-C04-04

    Information Retrieval on different media based on multidimensional models: Relevance, novelty, personalization and context.

    There is a growing realisation that relevant information will be accessible increasingly across media, across contexts and across modalities. The retrieval of such information will depend on factors such as time, place, and history of interaction, task in hand, current user interests, etc. To achieve this, Information Retrieval (IR) models that go beyond the usual relevance-oriented approach will need to be developed so that they can be deployed effectively to enhance retrieval performance. This is important to meet the information access demands of today's users. As a matter of fact, the growing need to deliver information on request in a form that can be readily and easily digested continues to be a challenge.

    In this coordinated project with the University of Granada, and Universidad Autónoma de Madrdid, we tackled the IR problem from a multidimensional perspective. Besides the dimension of relevance, we studied how to endow the systems with advanced capabilities for novelty detection, redundancy filtering, subtopic detection, personalization and context-based retrieval. These dimensions have been not only considered for the basic retrieval task but also for other tasks such as automatic summarization, document clustering and categorization. This research tried therefore to open new ways to improve the quality of access to sources of information.

  • image

    07SIN005206PR

    Improving news retrieval and access to financial information: web news retrieval

    The objetive of this project was to improve NowOnWeb, and R&D platform for web-news retrieval developed by the IRLab. The tasks of the project were centred on one hand on efficiency improvements for faster indexing, more scalable query processing, and crawling engine improvements; and on the other hand on effectiveness improvements in terms of news relevance, construction of better summaries and exploitation of the query-logs.

  • image

    TIN2005-08521-C02-02

    Retrieval of relevant and novel sentences using IR models and techniques

    The aim of this project is to improve the performance of systems for sentence retrieval and novelty. This task, located in the field of Information Retrieval (IR), is a step forward from the basic problem of document retrieval. Given a user query which retrieves an ordered set of documents, this set is processed to identify those sentences which are relevant to the query. This selection of sentences has to be done avoiding redundant material. The task defined in this way has been recently introduced in the field of IR (called “novelty task”) and it is highly related to other IR problems. Hence, lessons learned in novelty are potentially benefitial in other IR subareas. Moreover, the state of the art in novelty shows clearly the need of more research efforts for sentence retrieval and novelty detection. In this respect, several formalisms and tools which have been successfully applied in other IR problems are especially promising for novelty. This project will address the application of Language Models, fuzzy quantification and dimensionality reduction for sentence retrieval and novelty detection. We strongly believe that the variety of approaches taken is a good startpoint for improving the effectiveness of the novelty task. Furthermore, this facilitates crossfertilization between these research lines, which is an added value for this project. Across this document we will provide evidence on the adequacy of these models and techniques for novelty. The implementation of the proposals derived from this project will be based on research tools and platforms available for experimentation along with the development of our own code. The evaluation will be conducted with standard benchmarks and using the methodology of the field of IR.

Filter by type:

Sort by year:

Item-based relevance modelling of recommendations for getting rid of long tail products

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Journal Knowledge-Based Systems s, vol. 103, pp. 41-51, 2016 ISSN: 0950-7051

Abstract

Recommender systems are a growing research field due to its immense potential application for helping users to select products and services. Recommenders are useful in a broad range of domains such as films, music, books, restaurants, hotels, social networks, news, etc. Traditionally, recommenders tend to promote certain products or services of a company that are kind of popular among the communities of users. An important research concern is how to formulate recommender systems centred on those items that are not very popular: the long tail products. A special case of those items are the ones that are product of an overstocking by the vendor. Overstock, that is, the excess of inventory, is a source of revenue loss. In this paper, we propose that recommender systems can be used to liquidate long tail products maximising the business profit. First, we propose a formalisation for this task with the corresponding evaluation methodology and datasets. And, then, we design a specially tailored algorithm centred on getting rid of those unpopular products based on item relevance models. Comparison among existing proposals demonstrates that the advocated method is a significantly better algorithm for this task than other state-of-the-art techniques.

Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recommendation

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 38th European Conference on Information Retrieval, ECIR 2016, Padova, Italy, 20-23 March, 2016, Lecture Notes in Computer Science vol. 9626, pp 602-613 | ISBN: 978-3-319-30670-4

Abstract

Recently, Relevance-Based Language Models have been demonstrated as an effective Collaborative Filtering approach. Nevertheless, this family of Pseudo-Relevance Feedback techniques is computationally expensive for applying them to web-scale data. Also, they require the use of smoothing methods which need to be tuned. These facts lead us to study other similar techniques with better trade-offs between effectiveness and efficiency. Specifically, in this paper, we analyse the applicability to the recommendation task of four well-known query expansion techniques with multiple probability estimates. Moreover, we analyse the effect of neighbourhood length and devise a new probability estimate that takes into account this property yielding better recommendation rankings. Finally, we find that the proposed algorithms are dramatically faster than those based on Relevance-Based Language Models, they do not have any parameter to tune (apart from the ones of the neighbourhood) and they provide a better trade-off between accuracy and diversity/novelty.

Language Models for Collaborative Filtering Neighbourhoods

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 38th European Conference on Information Retrieval, ECIR 2016, Padova, Italy, 20-23 March, 2016, Lecture Notes in Computer Science vol. 9626, pp 614-625| ISBN: 978-3-319-30670-4

Abstract

Language Models are state-of-the-art methods in Information Retrieval. Their sound statistical foundation and high effectiveness in several retrieval tasks are key to their current success. In this paper, we explore how to apply these models to deal with the task of computing user or item neighbourhoods in a collaborative filtering scenario. Our experiments showed that this approach is superior to other neighbourhood strategies and also very efficient. Our proposal, in conjunction with a simple neighbourhood-based recommender, showed a great performance compared to state-of-the-art methods (NNCosNgbr and PureSVD) while its computational complexity is low.

Feeling Lucky? Multi-armed Bandits for Ordering Judgements in Pooling-based Evaluation

David E. Losada, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 31st ACM Symposium on Applied Computing, SAC 2016, Pisa, Italy, 4- 8 April, 20165, pp 1027-1034 ISBN: 978-1-4503-3739-7

Abstract

Evaluation is crucial in Information Retrieval. The Cranfield paradigm allows reproducible system evaluation by fostering the construction of standard and reusable benchmarks. Each benchmark or test collection comprises a set of queries, a collection of documents and a set of relevance judgements. Relevance judgements are often done by humans and thus expensive to obtain. Consequently, relevance judgements are customarily incomplete. Only a subset of the collection, the pool, is judged for relevance. In TREC-like campaigns, the pool is formed by the top retrieved documents supplied by systems participating in a certain evaluation task. With multiple retrieval systems contributing to the pool, an exploration/exploitation trade-off arises naturally. Exploiting effective systems could find more relevant documents, but exploring weaker systems might also be valuable for the overall judgement process. In this paper, we cast document judging as a multi-armed bandit problem. This formal modelling leads to theoretically grounded adjudication strategies that improve over the state of the art. We show that simple instantiations of multi-armed bandit models are superior to all previous adjudication strategies.

Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 4th Spanish Conference on Information Retrieval, CERI 2016, Article 9, Granada, Spain, 14 - 16 June, 2016 ISBN: 978-1-4503-4141-7

Abstract

The use of Relevance-Based Language Models for top-N recommendation has become a promising line of research. Previous works have used collection-based smoothing methods for this task. However, a recent analysis on RM1 (an estimation of Relevance-Based Language Models) in document retrieval showed that this type of smoothing methods demote the IDF effect in pseudo-relevance feedback. In this paper, we claim that the IDF effect from retrieval is closely related to the concept of novelty in recommendation. We perform an axiomatic analysis of the IDF effect on RM2 concluding that this kind of smoothing methods also demotes the IDF effect in recommendation. By axiomatic analysis, we find that a collection-agnostic method, Additive smoothing, does not demote this property. Our experiments confirm that this alternative improves the accuracy, novelty and diversity figures of the recommendations

Injecting Multiple Psychological Features into Standard Text Summarisers

David E. Losada, Javier Parapar
Conference Proceedings of the 4th Spanish Conference on Information Retrieval, CERI 2016, Article 1, Granada, Spain, 14 - 16 June, 2016 ISBN: 978-1-4503-4141-7

Abstract

Automatic Text Summarisation is an essential technology to cope with the overwhelming amount of documents that are daily generated. Given an information source, such as a webpage or a news article, text summarisation consists of extracting content from it and present it in a condensed form for human consumption. Summaries are crucial to facilitate information access. The reader is provided with the key information in a concise and fluent way. This speeds up navigation through large repositories of data. With the rapid growth of online contents, creating manual summaries is not an option. Extractive summarisation methods are based on selecting the most important sentences from the input. To meet this aim, a ranking of candidate sentences is often built from a reduced set of sentence features. In this paper, we show that many features derived from psychological studies are valuable for constructing extractive summaries. These features encode psychological aspects of communication and are a good guidance for selecting salient sentences. We use Quantitative Text Analysis tools for extracting these features and inject them into state-of-the-art extractive summarisers. Incorporating these novel components into existing extractive summarisers requires to combine and weight a high number of sentence features. In this respect, we show that Particle Swarm Optimisation is a viable approach to set the feature's weights. Following standard evaluation practice (DUC benchmarks), we also demonstrate that our novel summarisers are highly competitive.

Computing Neighbourhoods with Language Models in a Collaborative Filtering Scenario

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Workshop Proceedings of the 7th Italian Information Retrieval Workshop, IIR 2016, pp. x-y, Venice, Italy, 30 - 31 May, 2016

Abstract

Language models represent a successful framework for many Information Retrieval tasks: ad hoc retrieval, pseudo-relevance feedback or expert finding are some examples. We present how language models can compute effectively user or item neighbourhoods in a collaborative filtering scenario (this idea was originally proposed in ECIR 2016). The experiments support the applicability of this approach for neighbourhood-based recommendation surpassing the rest of the baselines. Additionally, the computational cost of this approach is small since language models have been efficiently applied to large-scale retrieval tasks such as web search

A Study of Priors for Relevance-Based Language Modelling of Recommender Systems

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, Vienna, Austria, 16 - 20 September, 2015, pp 237-240| ISBN: 978-1-4503-3692-5

Abstract

Probabilistic modelling of recommender systems naturally introduces the concept of prior probability into the recommendation task. Relevance-Based Language Models, a principled probabilistic query expansion technique in Information Retrieval, has been recently adapted to the item recommendation task with success. In this paper, we study the effect of the item and user prior probabilities under that framework. We adapt two priors from the document retrieval field and then we propose other two new probabilistic priors. Evidence gathered from experimentation indicates that a linear prior for the neighbour and a probabilistic prior based on Dirichlet smoothing for the items improve the quality of the item recommendation ranking.

A Study of Smoothing Methods for Relevance-Based Language Modelling of Recommender Systems

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 37th European Conference on Information Retrieval, ECIR 2015, Vienna, Austria, 29 March - 2 April, 2015, Lecture Notes in Computer Science vol. 9022, pp 346-351| ISBN: 978-3-319-16353-6

Abstract

Language Models have been traditionally used in several fields like speech recognition or document retrieval. It was only recently when their use was extended to collaborative Recommender Systems. In this field, a Language Model is estimated for each user based on the probabilities of the items. A central issue in the estimation of such Language Model is smoothing, i.e., how to adjust the maximum likelihood estimator to compensate for rating sparsity. This work is devoted to explore how the classical smoothing approaches (Absolute Discounting, Jelinek-Mercer and Dirichlet priors) perform in the field of Recommender Systems. We tested the different methods under the recently presented Relevance-Based Language Models for collaborative filtering, and compared how the smoothing techniques behave in terms of precision and stability. We found that Absolute Discounting is practically insensitive to the parameter value being an almost parameter-free method and, at the same time, its performance is similar to Jelinek-Mercer and Dirichlet priors.

Finding a Needle in the Blogosphere: An Information Fusion Approach for Blog Distillation Search

Jose M. Chenlo, Javier Parapar, David E. Losada, Jose Santos
Journal Information Fusion, vol. 23, pp. 58-68 2015 | ISSN: 1566-2535

Abstract

In the blogosphere, different actors express their opinions about multiple topics. Users, companies or editors socially interact by commenting, recommending and linking blogs and posts. These social media contents are increasingly growing. As a matter of fact, the size of the blogosphere is estimated to double every six months. In this context, the problem of finding a topically relevant blog to subscribe to becomes a Big Data challenge. Moreover, combining multiple types of evidence is essential for this search task. In this paper we propose a group of textual and social-based signals, and apply different Information Fusion algorithms for a Blog Distillation Search task. Information fusion through the combination of the different types of evidence requires optimisation for appropriately weighting each source of evidence. To this end, we analyse well-established population-based search methods. Namely, global search (Particle Swarm Optimisation and Differential Evolution) and a local search method (Line Search) that has been effective in various Information Retrieval tasks. Moreover, we propose hybrid combinations between the global search and the local search method and compare all the alternatives following a standard methodology. Efficiency is an imperative here and, therefore, we focus not only on achieving high search effectiveness but also on designing efficient solutions.

Score Distributions for Pseudo Relevance Feedback

Javier Parapar, Manuel A Presedo-Quindimil, Álvaro Barreiro
Journal Information Sciences, vol. 273, pp. 171-181 2014 | ISSN: 0020-0255

Abstract

Relevance-Based Language Models, commonly known as Relevance Models, are successful approaches to explicitly introduce the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models achieve state-of-the-art retrieval performance in the Pseudo Relevance Feedback task. It is known that one of the factors that more affect to the Pseudo Relevance Feedback robustness is the selection for some queries of harmful expansion terms. In order to minimise this effect in these methods a crucial point is to reduce the number of non-relevant documents in the pseudo relevant set. In this paper, we propose an original approach to tackle this problem. We try to automatically determine for each query how many documents we should select as pseudo-relevant set. For achieving this objective we will study the score distributions of the initial retrieval and trying to discern in base of their distribution between relevant and non-relevant documents. Evaluation of our proposal showed important improvements in terms of robustness.

Combining Psycho-linguistic, Content-based and Chat-based Features to Detect Predation in Chatrooms

Javier Parapar, David E. Losada, Álvaro Barreiro
Journal Journal of Universal Computer Science, vol. 20, issue 2, pp. 213-239 2014 | ISSN: 0948-695X

Abstract

The Digital Age has brought great benefits for the human race but also some drawbacks. Nowadays, people from opposite corners of the World can communicate online via instant messaging services. Unfortunately, this has introduced new kinds of crime. Sexual predators have adapted their predatory strategies to these platforms and, usually, the target victims are kids. The authorities cannot manually track all threats because massive amounts of online conversations take place in a daily basis. Automatic methods for alerting about these crimes need to be designed. This is the main motivation of this paper, where we present a Machine Learning approach to identify suspicious subjects in chat-rooms. We propose novel types of features for representing the chatters and we evaluate different classifiers against the largest benchmark available. This empirical validation shows that our approach is promising for the identification of predatory behaviour. Furthermore, we carefully analyse the characteristics of the learnt classifiers. This preliminary analysis is a first step towards profiling the behaviour of the sexual predators when chatting on the Internet.

When Recommenders Met Big Data: An Architectural Proposal and Evaluation

Daniel Valcarce, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 3rd Spanish Conference on Information Retrieval, CERI'14, A Coruña, Spain, June 19-20, pp. 73-84, 2014 | ISBN: 978-84-9749-591-2

Abstract

Nowadays, scalability is a critical factor in the design of any system working with big data. In particular, it has been recognised as a main challenge in the construction of recommender systems. In this paper, we present a recommender architecture capable of making personalised recommendations using collaborative filtering in a big data environment. We aim to build highly scalable systems without any single point of failure. Replication and data distribution as well as caching techniques are used to achieve this goal. We suggest specific technologies for each subsystem of our proposed architecture considering scalability and fault tolerance. Furthermore, we evaluate the performance under realistic scenarios of different alternatives (RDBMS and NoSQL) for storing, generating and serving recommendations.

Relevance-Based Language Modelling for Recommender Systems

Javier Parapar, Alejandro Bellogín, Pablo Castells, Álvaro Barreiro
Journal Information Processing and Management, vol. 49, issue 4, pp. 966-980 2013 | ISSN: 0306-4573

Abstract

Relevance-Based Language Models, commonly known as Relevance Models, are successful approaches to explicitly introduce the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models achieve state-of-the-art retrieval performance in the pseudo relevance feedback task. On the other hand, the field of recommender systems is a fertile research area where users are provided with personalised recommendations in several applications. In this paper, we propose an adaptation of the Relevance Modelling framework to effectively suggest recommendations to a user. We also propose a probabilistic clustering technique to perform the neighbour selection process as a way to achieve a better approximation of the set of relevant items in the pseudo relevance feedback process. These techniques, although well known in the Information Retrieval field, have not been applied yet to recommender systems, and, as the empirical evaluation results show, both proposals outperform individually several baseline methods. Furthermore, by combining both approaches even larger effectiveness improvements are achieved.

Probabilistic Collaborative Filtering with Negative Cross Entropy

Alejandro Bellogín, Javier Parapar, Pablo Castells
Conference Proceedings of the 7th ACM Conference on Recommender Systems, pp. 387-390, ACM RecSys 2013, Hong Kong, October 2013 | ISBN: 978-1-4503-2409-0

Abstract

Relevance-Based Language Models are an effective IR approach which explicitly introduces the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models have shown to achieve state-of-the-art retrieval performance in the pseudo relevance feedback task. In this paper we propose a novel adaptation of this language modeling approach to rating-based Collaborative Filtering. In a memory-based approach, we apply the model to the formation of user neighbourhoods, and the generation of recommendations based on such neighbourhoods. We report experimental results where our method outperforms other standard memory-based algorithms in terms of ranking precision.

Comments-Oriented Query Expansion for Opinion Retrieval in Blogs

José M. G. Chenlo Javier Parapar, David E. Losada
Conference Proceedings of the 15th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2013, Madrid, Spain, September 2013 Lecture Notes in Computer Science vol. 8109, pp. 32-41 | ISBN: 978-3-642-40642-3

Abstract

In recent years, Pseudo Relevance Feedback techniques have become one of the most effective query expansion approaches for document retrieval. Particularly, Relevance-Based Language Models have been applied in several domains as an effective and efficient way to enhance topic retrieval. Recently, some extensions to the original RM methods have been proposed to apply query expansion in other scenarios, such as opinion retrieval. Such approaches rely on mixture models that combine the query expansion provided by Relevance Models with opinionated terms obtained from external resources (e.g., opinion lexicons). However, these methods ignore the structural aspects of a document, which are valuable to extract topic-dependent opinion expressions. For instance, the sentiments conveyed in blogs are often located in specific parts of the blog posts and its comments. We argue here that the comments are a good guidance to find on-topic opinion terms that help to move the query towards burning aspects of the topic. We study the role of the different parts of a blog document to enhance blog opinion retrieval through query expansion. The proposed method does not require external resources or additional knowledge and our experiments show that this is a promising and simple way to make a more accurate ranking of blog posts in terms of their sentiment towards the query topic. Our approach compares well with other opinion finding methods, obtaining high precision performance without harming mean average precision.

Language Modelling of Constraints for Text Clustering

Javier Parapar, Álvaro Barreiro
ConferenceProceedings of the 34th European Conference on Information Retrieval Research, ECIR 2012, Barcelona, Spain, April 2012 Lecture Notes in Computer Science vol. 7224, pp. 352-363 | ISBN: 978-3-642-28996-5

Abstract

Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are introduced is by designing new clustering algorithms that enforce the accomplishment of the constraints. In this paper we present an alternative approach for constrained clustering where, instead of defining new algorithms or objective functions, the constraints are introduced modifying the document representation by means of their language modelling. More precisely the constraints are modelled using the well-known Relevance Models successfully used in other retrieval tasks such as pseudo-relevance feedback. To the best of our knowledge this is the first attempt to try such approach. The results show that the presented approach is an effective method for constrained clustering even improving the results of existing constrained clustering algorithms.

Using Graph Partitioning Techniques for Neighbour Selection in User-Based Collaborative Filtering

Alejandro Bellogín, Javier Parapar
ConferenceProceedings of the 6th ACM Conference on Recommender Systems, pp. 213-216, ACM RecSys 2012, Dublin, Ireland, September 2012 | ISBN: 978-1-4503-1270-7

Abstract

Spectral clustering techniques have become one of the most popular clustering algorithms, mainly because of their simplicity and effectiveness. In this work, we make use of one of these techniques, Normalised Cut, in order to derive a cluster-based collaborative filtering algorithm which outperforms other standard techniques in the state-of-the-art in terms of ranking precision. We frame this technique as a method for neighbour selection, and we show its effectiveness when compared with other cluster-based methods. Furthermore, the performance of our method could be improved if standard similarity metrics -- such as Pearson's correlation -- are also used when predicting the user's preferences.

A learning-based approach for the identification of sexual predators in chat logs

Javier Parapar, David E. Losada, Álvaro Barreiro
Working NotesProceedings of the CLEF 2012 Evaluation Labs and Workshop Online Working Notes. PAN 2012, Rome, Italy, September 2012 | ISBN: 978-88-904810-3-1

Abstract

The existence of sexual predators that enter into chat rooms or forums and try to convince children to provide some sexual favour is a socially worrying issue. Manually monitoring these interactions is a way to attack this problem. However, this manual approach simply cannot keep pace because of the high number of conversations and the huge number of chatrooms or forums where these conversations daily take place. We need tools that automatically process massive amounts of conversations and alert about possible offenses. The sexual predator identification challenge within PAN 2012 is a valuable way to promote research in this area. Our team faced this task as a Machine Learning problem and we designed several innovative sets of features that guide the construction of classifiers for identifying sexual predation. Our methods are driven by psycholinguistic, chat-based, and tf/idf features and yield to very effective classifiers.

An Experimental Study of Constrained Clustering Effectiveness in Presence of Erroneous Constraints

M. Eduardo Ares, Javier Parapar, Álvaro Barreiro
Journal Information Processing and Management, vol. 48, issue 3, pp. 537-551 2012| ISSN: 0306-4573

Abstract

Recently a new fashion of semi-supervised clustering algorithms, coined as constrained clustering, has emerged. These new algorithms can incorporate some a priori domain knowledge to the clustering process, allowing the user to guide the method. The vast majority of studies about the effectiveness of these approaches have been performed using information, in the form of constraints, which was totally accurate. This would be the ideal case, but such a situation will be impossible in most realistic settings, due to errors in the constraint creation process, misjudgements of the user, inconsistent information, etc. Hence, the robustness of the constrained clustering algorithms when dealing with erroneous constraints is bound to play an important role in their final effectiveness. In this paper we study the behaviour of four constrained clustering algorithms (Constrained k-Means, Soft Constrained k-Means, Constrained Normalised Cut and Normalised Cut with Imposed Constraints) when not all the information supplied to them is accurate. The experimentation over text and numeric datasets using two different noise models, one of them an original approach based on similarities, highlighted the strengths and weaknesses of each method when working with positive and negative constraints, indicating the scenarios in which each algorithm is more appropriate.

Improving the Extraction of Text in PDFs by Simulating the Human Reading Order

Ismael Hasan, Javier Parapar, Álvaro Barreiro
Journal Journal of Universal Computer Science, vol. 18, issue 5, pp. 623-649 2012| ISSN: 0948-695X

Abstract

Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively presented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text, and it should be returned in the same order as a human would read the original PDF file. However, current tools for PDF text extraction fail in this objective when working with complex documents with multiple columns. For instance, this is the case of official government bulletins with legal information. In this task, it is mandatory to get correct and ordered text as a result of the application of the PDF extractor. It is very usual that a legal article in a document refers to a previous article and they should be offered in the right sequential order. To overcome these difficulties we have designed a new method for extraction of text in PDFs that simulates the human reading order. We evaluated our method and compared it against other PDF extraction tools and algorithms. Evaluation of our approach shows that it significantly outperforms the results of the existing tools and algorithms.

Finding the Best Parameter Setting: Particle Swarm Optimisation

Javier Parapar, María M. Vidal, José Santos
ConferenceProceedings of the 2nd Spanish Conference on Information Retrieval, CERI'12, Valencia, Spain, June 18-19, pp. 49-60, 2012 | ISBN: 978-84-8021-860-3

Abstract

Information Retrieval techniques traditionally depend on the setting of one or more parameters. Depending on the problem and the techniques the number of parameters can be one, two or even dozens of them. One crucial problem in Information Retrieval research is to achieve a good parameter setting of its methods. The tuning process, when dealing with several parameters, is a time consuming and critical step. In this paper we introduce the use of Particle Swarm Optimisation for the automatic tuning process of the parameters of Information Retrieval methods. We compare our proposal with the Line Search method, previously adopted in Information Retrieval. The comparison shows that our approach is faster and achieves better results than Line Search. Furthermore, Particle Swarm Optimisation algorithms are suitable for parallelisation, improving the algorithm behaviour in terms of time convergence.

Análisis de herramientas para la docencia práctica en Recuperación de Información

Javier Parapar, Álvaro Barreiro
Book Chapter FECIES 2012 | ISBN: 978-84-695-6734-0

Chapter's book of 9th International Forum on the Quality Assessment of Higher Education and Research (FECIES 2012)

Loreto Del Río Bermúdez,Inmaculada Teva Álvarez (eds.)
image

La adaptación de las titulaciones de Ingeniería Informática al Espacio Europeo de Educación Superior (EEES) ha supuesto por una parte la renovacion de la oferta de materias y por otra el cambio en el paradigma docente establecido. En particular la Facultad de Informática de la Universidad de A Coruña ha introducido en su curricula la asignatura de Recuperación de Información. La Recuperación de Información es ya, a día de hoy, una materia madura y establecida en el ámbito de las ciencias de la computación. La Universidad de A Coruña, que ha sido desde su constitución un referente en la comunidad autónoma de Galicia en el ámbito de la informática, la ha incluido como materia fundamental en sus nuevos planes. Concretamente, en el Grado de Ingeniería Informática por la Universidad de A Coruña, la asignatura de Recuperación de Información está asociada al itinerario de Computación y cuenta con 6 créditos. En el plan de Master en Ingeniería Informática, recientemente propuesto, la asignatura de Recuperación de Información y Web Semántica cuenta también con 6 créditos ECTS. Gran parte de la docencia asociada a estas nuevas materias sera de carácter práctico al tratarse de titulaciones en el ámbito de la Ingeniería. En este escenario, existe pues una necesidad fundamental de contar con herramientas adecuadas que se adapten al nuevo paradigma educativo donde, de acuerdo al espíritu del EEES, aumenta el trabajo autónomo del alumno y se reducen las horas presenciales guiadas por un docente. Es pues nuestra intención a la luz de la nueva situación docente y metodológica revisar las herramientas existentes para la enseñanza práctica de Recuperación de Información, haciendo especial hincapié en los factores introducidos por las restricciones asociadas a la adaptación al EEES. En concreto en este trabajo analizaremos herramientas software considerando distintos factores importantes para la docencia, sin ánimo de ser exhaustivos: lenguaje de programación, licencia, comunidad, documentación, soporte, modelos disponibles, facilidad de evaluación, etc. A pesar de la existencia de algunas comparativas de herramientas software" desde el puntde uso comercial o en investigación, en este documento consideramos importante analizar las herramientas desde un punto de vista de su idoneidad para la docencia y el aprendizaje. Este trabajo se encuadrará en la línea metodológica y de recursos docentes en el marco de la adaptación al EEES y daremos respuesta a algunas preguntas importantes como: ¿qué herramientas son más adecuadas para el trabajo autónomo del alumnado?, ¿qué herramientas son más adecuadas dado el bagaje adquirido por el alumnado en elcontexto de los planes de estudio de la Universidad de A Coruña?, ¿qué herramientas permitirán al docente poner en la práctica el temario explicado en las clases magistrales?, ¿qué herramientas facilitarán la evaluación continua del alumnado?

A Cluster Based Pseudo Feedback Technique which Exploits Good and Bad Clusters

Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 14th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2011, 7-11 November 2011, Tenerife, Spain, Lecture Notes in Artificial Intelligence vol. 7023, pp. 403-412 | ISBN: 978-3-642-25273-0

Abstract

In the last years, cluster based retrieval has been demonstrated as an effective tool for both interactive retrieval and pseudo relevance feedback techniques. In this paper we propose a new cluster based retrieval function which uses the best and worst clusters of a document in the cluster ranking, to improve the retrieval effectiveness. The evaluation shows improvements in some standard TREC collections over the state-of-the-art techniques in precision and robustness.

Promoting Divergent Terms in the Estimation of Relevance Models

Javier Parapar, Álvaro Barreiro
Conference Proceedings of the Third International Conference on the Theory of Information Retrieval, ICTIR 2011, 12-14 September 2011, Bertinoro, Italy, Lecture Notes in Computer Science vol. 6931, pp. 77-88 | ISBN: 978-3-642-23317-3

Abstract

Traditionally the use of pseudo relevance feedback (PRF) techniques for query expansion has been demonstrated very effective. Particularly the use of Relevance Models (RM) in the context of the Language Modelling framework has been established as a high-performance approach to beat. In this paper we present an alternative estimation for the RM promoting terms that being present in the relevance set are also distant from the language model of the collection. We compared this approach with RM3 and with an adaptation to the Language Modelling framework of the Rocchio’s KLD-based term ranking function. The evaluation showed that this alternative estimation of RM reports consistently better results than RM3, showing in average to be the most stable across collections in terms of robustness.

Improving text clustering with social tagging

M. Eduardo Ares, Javier Parapar, Álvaro Barreiro
Conference Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media ICWSM 2011, 17-21 July 2011, Barcelona, pp. 430-433 | ISBN: 978-1-57735-505-2

Abstract

In this paper we study the use of social bookmarking to improve the quality of text clustering. Recently constrained clustering algorithms have been presented as a successful tool to introduce domain knowledge in the clustering process. This paper uses the tags saved by the users of Delicious to generate non artificial constraints for constrained clustering algorithms. The study demonstrates that it is possible to achieve a high percentage of good constraints with this simple approach and, more importantly, the evaluation shows that the use of these constraints produces a great improvement (up to 91.25%) of the clustering algorithms effectiveness.

The Use of Latent Semantic Indexing to Mitigate OCR Effects of Related Document Images

Renato de Freitas Bulcão-Neto, José Antonio Camacho-Guerrero, Márcio Dutra, Álvaro Barreiro, Javier Parapar, Alessandra Alaniz Macedo
Journal Journal of Universal Computer Science, vol. 17, issue 1, pp. 64-80 2011| ISSN: 0948-695X

Abstract

Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively presented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text, and it should be returned in the same order as a human would read the original PDF file. However, current tools for PDF text extraction fail in this objective when working with complex documents with multiple columns. For instance, this is the case of official government bulletins with legal information. In this task, it is mandatory to get correct and ordered text as a result of the application of the PDF extractor. It is very usual that a legal article in a document refers to a previous article and they should be offered in the right sequential order. To overcome these difficulties we have designed a new method for extraction of text in PDFs that simulates the human reading order. We evaluated our method and compared it against other PDF extraction tools and algorithms. Evaluation of our approach shows that it significantly outperforms the results of the existing tools and algorithms.

Agrupamiento documental

M. Eduardo Ares, Javier Parapar, Álvaro Barreiro
Book Chapter RA-MA, September 2011 | ISBN: 978-84-9964-112-6

Recuperación de Información. Un enfoque práctico y multidisciplinar

F. Cacheda Seijo, J.M. Fernández-Luna and J. Huete (eds.)
image

Este libro surge de la necesidad de disponer de un material que, con un enfoque eminentemente didáctico, permita dar una visión general de la disciplina de la Recuperación de Información, abarcando desde los fundamentos hasta las propuestas de investigación actuales. La idea es ofrecer al lector los entresijos de un área de conocimiento cuyos avances se trasladan directamente a programas que empleamos todos los días para diversas tareas cotidianas. Para alcanzar estos objetivos se ha contado con la colaboración de un plantel de expertos reconocidos internacionalmente por su investigación en el campo de la Recuperación de Información. Cada uno de ellos se ha centrado en aquellos capítulos de cuyas temáticas son especialistas y ampliamente conocedores. Además, la gran mayoría de ellos posee una inestimable experiencia docente en asignaturas de Recuperación de Información, con lo que sus experiencias y conocimientos a la hora de diseminar esta disciplina se han exportado a sus capítulos de forma directa, e implícitamente al libro completo

University of Lugano at TREC 2010

Mostafa Keikha, Parvaz Mahdabi, Shima Gerani, Giacomo Inches, Javier Parapar, Mark Carman, Fabio Crestani.
Working NotesProceedings of the Nineteenth Text Retrieval Conference (TREC 2010), Gaithersburg, Maryland, November 16-19, 2010.

Abstract

We report on the University of Lugano's participation in the Blog and Session tracks of TREC 2010. In particular we describe our system for performing blog distillation, faceted search, top stories identiffication and session reranking.

Blog Snippets: A Comments-Biased Approach

Javier Parapar, Jorge-López-Castro, Álvaro Barreiro
Conference Proceedings of the 33rd ACM International Conference on Research and Development in Information Retrieval SIGIR'10, Geneva, Switzerland, July 19-23, pp. 711-712 | ISBN: 978-1-60558-896-4

Abstract

In the last years Blog Search has been a new exciting task in Information Retrieval. The presence of user generated information with valuable opinions makes this field of huge interest. In this poster we use part of this information, the readers' comments, to improve the quality of post snippets with the objective of enhancing the user access to the relevant posts in a result list. We propose a simple method for snippet generation based on sentence selection, using the comments to guide the selection process. We evaluated our approach with standard TREC methodology in the Blogs06 collection showing significant improvements up to 32% in terms of MAP over the baseline

Where to Start Filtering Redundancy? A Cluster-Based Approach

Ronald T. Fernández , Javier Parapar, David E. Losada, Álvaro Barreiro
Conference Proceedings of the 33rd ACM International Conference on Research and Development in Information Retrieval SIGIR'10, Geneva, Switzerland, July 19-23, pp. 735-736 | ISBN: 978-1-60558-896-4

Abstract

Novelty detection is a difficult task, particularly at sentence level. Most of the approaches proposed in the past consist of re-ordering all sentences following their novelty scores. However, this re-ordering has usually little value. In fact, a naive baseline with no novelty detection capabilities yields often better performance than any state-of-the-art novelty detection mechanism. We argue here that this is because current methods initiate too early the novelty detection process. When few sentences have been seen, it is unlikely that the user is negatively affected by redundancy. Therefore, re-ordering the first sentences may be harmful in terms of performance. We propose here a query-dependent method based on cluster analysis to determine where we must start filtering redundancy

Improving Alternative Text Clustering Quality in the Avoiding Bias Task with Spectral and Flat Partition Algorithms

M. Eduardo Ares, Javier Parapar, Álvaro Barreiro.
Conference Proceedings of the 21st International Conference on Database and Expert Systems Applications DEXA'10, Bilbao, Spain, August 30 - September 3 2010, Lecture Notes in Computer Science, vol. 6262, Part II, pp. 407-421, 2010 | ISBN: 978-3-642-15250-4

Abstract

The problems of finding alternative clusterings and avoiding bias have gained popularity over the last years. In this paper we put the focus on the quality of these alternative clusterings, proposing two approaches based in the use of negative constraints in conjunction with spectral clustering techniques. The first approach tries to introduce these constraints in the core of the constrained normalised cut clustering, while the second one combines spectral clustering and soft constrained k-means. The experiments performed in textual collections showed that the first method does not yield good results, whereas the second one attains large increments on the quality of the results of the clustering while keeping low similarity with the avoided grouping

Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness

Javier Parapar, Jorge-López-Castro, Álvaro Barreiro.
ConferenceProceedings of the 1st Spanish Conference on Information Retrieval, CERI'10, Madrid, Spain, June 15-17, pp.5-16, 2010 | ISBN: 978-84-693-2200-0

Abstract

This paper is focused on the extraction of certain parts of a blog: the post and the comments, presenting a technique based on the blog structure and its elements attributes, exploiting similarities and conventions among different blog providers or Content Management Systems (CMS). The impact of the extraction process over retrieval tasks is also explored. Separate evaluation is performed for both goals: extraction is evaluated through human inspection of the results of the extraction technique over a sampling of blogs, while retrieval performance is automatically evaluated through standard TREC methodology and the resources provided by the Blog Track. The results show important and significant improvements over a baseline which does not incorporate the extraction approach.

An Automatic Linking Service of Document Images Reducing the Effects of OCR Errors with Latent Semantics

Renato de Freitas Bulcão-Neto, José Antonio Camacho-Guerrero, Álvaro Barreiro, Javier Parapar, Alessandra Alaniz Macedo
Conference Proceedings of 25th ACM Symposium On Applied Computing, ACM SAC 2010, pp 13-17, Switzerland, March 22-26 2010 | ISBN: 978-1-60558-638-0

Abstract

Robust Information Retrieval (IR) systems have been demanded due to the widespread and multipurpose use of document images, and the high number of document images repositories available nowadays. This paper presents a novel approach to support the automatic generation of relationships among document images by exploiting Latent Semantic Indexing (LSI) and Optical Character Recognition (OCR). The LinkDI service extracts and indexes document images content, obtains its latent semantics, and defines relationships among images as hyperlinks. LinkDI was experimented with document images repositories, and its performance was evaluated by comparing the quality of the relationships created among textual documents and among their respective document images. Results show the feasibility of LinkDI relating OCR output with high degradation.

Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links

M. Eduardo Ares, Javier Parapar, Álvaro Barreiro
ConferenceProceedings of the 2nd International Conference on the Theory of Information Retrieval ICTIR 2009, Cambridge, UK, September 10-12, 2009, Lecture Notes in Computer Science vol. 5766, pp. 322-329, 2009 | ISBN: 978-3-642-04416-8

Abstract

In this paper we present a new clustering algorithm which extends the traditional batch k-means enabling the introduction of domain knowledge in the form of Must, Cannot, May and May-Not rules between the data points. Besides, we have applied the presented method to the task of avoiding bias in clustering. Evaluation carried out in standard collections showed considerable improvements in effectiveness against previous constrained and non-constrained algorithms for the given task.

Compression-based document length prior for language models

Javier Parapar, David E. Losada, Álvaro Barreiro
ConferenceProceedigns of the 32nd ACM International Conference on Research and Development in Information Retrieval SIGIR'09, pp. 652-653, Boston, July 19-23 2009 | ISBN: 978-1-60558-483-6

Abstract

The inclusion of document length factors has been a major topic in the development of retrieval models. We believe that current models can be further improved by more refined estimations of the document's scope. In this poster we present a new document length prior that uses the size of the compressed document. This new prior is introduced in the context of Language Modeling with Dirichlet smoothing. The evaluation performed on several collections shows significant improvements in effectiveness.

Evaluation of text clustering algorithms with n-gram-based document fingerprints

Javier Parapar, Álvaro Barreiro
Conference Proceedings of the 31st European Conference on Information Retrieval Research ECIR 2009, Toulouse, France, April 2009, Lecture Notes in Computer Science vol. 5478, pp. 645-653, 2009 | ISBN: 978-3-642-00957-0

Abstract

This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough evaluation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time

Revisiting n-gram based models for retrieval in degraded large collections

Javier Parapar, Ana Freire, Álvaro Barreiro
Conference Proceedings of the 31st European Conference on Information Retrieval Research ECIR 2009, Toulouse, France, April 2009, Lecture Notes in Computer Science vol. 5478, pp. 680-684, 2009 | ISBN: 978-3-642-00957-0

Abstract

The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in terms of effectiveness the classical term centred approach and the most of the participant systems in the TREC Confusion Track

Winnowing-based text clustering

Javier Parapar, Álvaro Barreiro
ConferenceProceedings of 17th ACM Conference on Information and Knowledge Management CIKM 2008, pp. 1353-1354, Napa Valley, California, October 2008 | ISBN: 978-1-59593-991-3

Abstract

We present an approach to document clustering based on winnowing fingerprints that achieved good values of effectiveness with considerable save in memory space and computation time.

Segmentation of legislative documents using a domain-specific lexicon

Ismael Hasan, Javier Parapar, Roi Blanco
ConferenceDEXA WS 2008, IEEE Press Proceedings pp. 665-669, Torino, Italy, September 2008| ISBN: 978-0-7695-3299-8

Workshop

The amount of legal information is continuously growing. New legislative documents appear everyday in the Web. Legal documents are produced on a daily basis in briefing-format, containing changes in the current legislation, notifications, decisions, resolutions, etc. The scope of these documents includes countries, states, provinces and even city councils. This legal information is produced in a semi-structured format and distributed daily on official web-sites; however, the huge amount of published information makes difficult for an user to find a specific issue, being lawyers probably the most representative example, who need to access to these sources regularly. This motivates the need of legislative information search engines. Standard general web search engines return to the user full documents (web pages typically), within hundreds of pages. As users expect only the relevant part of the document, techniques that recognise and extract these relevant bits of documents are needed to offer quick and effective results. In this paper we present a method to perform segmentation based on domain-specific lexicon information. Our method was tested with a manually tagged data-set coming from different sources of Spanish legislative documents. Results show that this technique is suitable for the task achieving values of 97'85% recall and 95'99% precision

The IRLab at the University of A Coruña

Javier Parapar, Álvaro Barreiro
Dissemination Notes BCS-IRSG Informer Vol 25, pp. 5-7 | ISSN: 0950-4974

Abstract

The Information Retrieval Lab is affiliated to the Department of Computer Science of the University of A Coruña (code G000494 in the University catalogue). The group has been researching in basic issues of Information Retrieval for more than ten years

An Effective and Efficient Web News Extraction Technique for an Operational NewsIR System

Javier Parapar, Álvaro Barreiro
Conference2th Conference of the Spanish Association for Artificial Intelligence, CAEPIA - TTIA 2007. Salamanca, Spain. 12-16 November 2007. Proceedings Vol II. pp. 319-32 | ISBN: 978-84-611-8848-2

Abstract

Web information extraction, in particular web news extraction is an open research problem and it is a key point in NewsIR systems. Current techniques fail in the quality of the results, the high computational cost or the necessity of human intervention, all of them critical issues in a real system. We present an automated approach to news recognition and extraction based on a set of heuristics about the articles structure, that is currently applied in an operational system.We also built a data set to evaluate web news extraction methods. Our results in this collection of international news, composed of 4869 web pages from 15 different on-line sources, achieved a 97% of precision and a 94% of recall for the news recognition and extraction task.

Writing Science, Compiling Science: The Coruña Corpus of English Scientific Writing

Isabel Moskowich-Spiegel, Javier Parapar
ConferenceProceedings from the 31st AEDEAN Conference XXXI Congreso Internacional de la Asociación Española De Estudios Anglo-Norteamericanos, A Coruña, Spain. 14-17 November 2007, pp. 531-545 | ISBN: 978-84-9749-278-2

Abstract

The Coruña Corpus: A Collection of Samples for the Historical Study of English Scientific Writing is a project on which the Muste Group has been working since 2003 in the University of A Coruña (Spain). It has been designed as a tool for the study of language change in English scientific writing in general as well as within the different scientific disciplines. Its purpose is to facilitate investigation at all linguistic levels, though, in principle, phonology is not included among our intended research topics.

Generating News Summaries at Indexing Time

Javier Parapar
Symposium BCS IRSG Symposium: Future Directions in Information Access, FDIA 2007, Glasgow, Scotland. 28-29 August 2007 | ISSN: 1477-9358

Abstract

This poster presents an efficiency oriented approach to the task of summary generation for operational news retrieval systems, where the summaries are appreciated by the users. This work shows that for this task the relevant sentence extraction techniques are suitable due to the compressibility of the generated summaries and the low computational costs associated. To minimize the costs of the summary construction in retrieval time we propose an efficient storage of the summaries as sentence offsets inside the documents. At indexing time the user query is not available to make the selection of the relevant sentence so the article's title was chosen to generate a title-biased summary, because of the high quality description of the news that the titles are. The sentence offsets were included in the direct file to just reconstruct the summaries in processing time from this information. This strategy gets a very high improvement in terms of retrieval time with a very low increment of the index size in comparison with query-biased summaries generated at retrieval time. As future work we will approach the evaluation of the summaries quality in base of the DUC measurements and the improvement of the relevance score formulas

The Coruña Corpus Tool

Javier Parapar, Isabel Moskowich-Spiegel
ConferencePresented in XXIII Congreso de la Sociedad Española de Procesamiento del Lenguaje Natural, SEPLN 2007. Sevilla, Spain, 10-12 September 2007. Published in Revista del Procesamiento de Lenguaje Natural Vol 39, pp. 289-290 | ISSN: 1135-5948

Abstract

The Coruña Corpus of scientific writing will be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English. The Coruña Corpus Tool is an information retrieval system that allows the extraction of knowledge from the corpus

NowOnWeb: A NewsIR System

Javier Parapar, Álvaro Barreiro
Conference Presented in XXIII Congreso de la Sociedad Española de Procesamiento del Lenguaje Natural, SEPLN 2007. Sevilla, Spain, 10-12 September 2007. Published in Revista del Procesamiento de Lenguaje Natural Vol 39, pp. 287-288 | ISSN: 1135-5948

Abstract

Nowadays there are thousands of news sites available on-line. Traditional methods to access this huge news repository are overwhelmed. In this paper we present NowOnWeb, a news retrieval system that crawls the articles from the internet publishers and provides news searching and browsing

Now On Web: News Search and Summarization

Javier Parapar, José M. Casanova, Álvaro Barreiro
Conference Proceedings of EUROCAST 2007, Las Palmas de Gran Canaria, Spain, February 12-16, 2007, Lecture Notes in Computer Science LNCS Vol.4739 pp. 225-232 | ISBN: 978-3-540-75866-2

Abstract

Agile access to the huge amount of information published by the thousands of news sites available on-line leads to the application of Information Retrieval techniques to this problem. The aim of this paper is to present NowOnWeb, a news retrieval system that obtains the articles from different on-line sources providing news searching and browsing. The main points solved during the development of NowOnWeb were: article recognition and extraction, redundancy detection and text summarization. For these points we provided effective solutions that put all them together had risen to a system that satisfies, in a reasonable way, the daily information needs of the user.

Current Teaching

  • Present 2015

    Software Development Methodologies (614G01051) (Metodologías de Desarrollo)

    Mandatory course for Software Engineering specialization (4th year) on the B.Sc. Eng. in Computer Science, elective course on the Information Systems specialization (4th year) (OBL. EI 4º 1C SE/ OPT. EI 4º 1C IS).

  • Present 2013

    Degree Projects in CS Eng (614G01227,614G01106) (Trabajos Fin de Grado)

    B.Sc. Eng. in Computer Science Degree Projects in the Software Engineering and Computer Science specializations (4th year) (OBL. CS)(OBL. SE).

  • Present 2012

    Information Systems Control (614G01044) (Calidad en Sistemas de Información)

    Mandatory course for Information Systems specialization (3rd year) on the B.Sc. Eng. in Computer Science, elective course on the Information Technologies specialization (4th year) (OBL. EI 3º 2C IS/ OPT. EI 4º 2C IT).

  • Present 2012

    Information Retrieval and Semantic Web (614502010) (Recuperación de Información y Web Semántica)

    Mandatory course on the M.Sc. Eng. in Computer Science (OBL. MsC EI 1C).

Past Teaching

  • 2015 2013

    Software Development Tools (614G01054) (Herramientas de Desarrollo)

    Mandatory course for Software Engineering specialization (4th year) on the B.Sc. Eng. in Computer Science (OBL. EI 4º 2C SE).

  • 2013 2012

    Degree Projects in B.Sc. Eng. (old plans) (Proyecto Fin de Carrera)

    B.Sc. Eng. in Computer Science Degree Projects (old plans) in the Software Engineering and Information Technologies specializations (3rd year).

  • 2012 2011

    Information Technology Audit (614111607) (Auditoría Informática)

    Elective course on the M.Sc. Eng. and B.Sc. Eng. in Computer Science (old plans, in extinction).

  • 2012 2011

    Programming II (614G01006) (Programación II)

    Mandatory course (2nd year) on the B.Sc. Eng. in Computer Science.

  • 2011 2010

    Information Systems Design (614111403) (Diseño de Sistemas de Información)

    Mandatory course (4rd year) on the M.Sc. Eng.+ B.Sc. Eng. in Computer Science (old plans, in extinction).

  • 2011 2009

    Artificial Intelligence (614211654,614311654) (Inteligencia Artificial)

    Elective course on the B.Sc. Eng. in Computer Science (old plans, in extinction).

  • 2011 2009

    Cognitive Science (614211609,614311609) (Ciencia Cognitiva)

    Elective course on the B.Sc. Eng. in Computer Science (old plans, in extinction).

  • 2009 2008

    Programming Technology (614211203) (Tecnología de la Programación)

    Mandatory course on the B.Sc. Eng. in Computer Science (old plans, in extinction).

Other courses

  • May 2010

    Persitencia y manejo de documentos XML en Java

    Aula de Formación Informática

  • 2009 2007

    El sistema operativo Linux. Conceptos Básicos

    Aula de Formación Informática

  • March 2007

    El S.O. GNU/Linux. OpenOffice 2.0.

    Consejo Social UDC

  • Nov. 2007

    El S.O. GNU/Linux. OpenOffice 2.0)

    Confederación de Empresarios de Ferrol

At My Lab

You can find me at my lab located on the Computer Science Faculty at the Elviña Campus:
  • Javier Parapar
  • Facultad de Informática, Campus de Elviña s/n
  • 15071, A Coruña, Spain

At My Office

You also can find me at my office - D1.14 - located at the new Building on the Research Area (behind the Faculty of Civil Engineering)