Roi Blanco's academic home page

Roi Blanco

University of A Coruña
Spain

Short Bio

Former senior Research Scientist at Yahoo Labs London in the Semantic Search group (formerly known as the Natural Language Retrieval group in the old glory days). Before moving to London, he spent six years in Yahoo Labs Barcelona. He is generally interested in applications of natural language processing for information retrieval, web search and mining and large scale information access in general, and has been active publishing at international conferences in those areas.

As part of this long tenure at Yahoo he has been a core contributor to company products, most notably Yahoo Search.

Previously he taught computer science at A Coruña University, from which he received his Ph.D. degree (cum laude) in 2008.

Academic Positions

2017 2016

Research Fellow

RMIT University, Melbourne, AU

2016 2009

Senior Researcher

Yahoo Labs, London UK
2012 2007

Associate Professor

University of A Coruña
2006

Research intern

ISTI-CNR, Pisa
2005

Research intern

University of Glasgow

Education & Training

Ph.D. 2008

Ph.D. in Computer Science

University of A Coruña
M.Sc. Eng.+B.Sc. Eng 2001

Ingeniero en Informática

University of A Coruña

Research

In this page I'm collecting references to research projects I've been involved in, so that they are not lost forever in the void of time and space. I'm including software that has been released as open sourced as well as demos that go along with a few research papers we've published. I'm also including some links to organization initiatives I've collaborated into, as well as some patents (owned by Yahoo).

Whenever I have time/feel like it I may write a larger essay about my research interests, the interesting people I've worked with and the kind of things I've done, but that would require starting a blog or something like that. Or not. In any case, a brief story is that I got my PhD degree from the University of A Coruña in 2008, directed by the very knowledgeable Alvaro Barreiro. My thesis title was Index Compression for Information Systems, and that might give an idea of what topics I was interested into at the time. At 2009 I joined Yahoo Research's Natural Language Retrieval group in Barcelona, which at the time was led by Hugo Zaragoza, working in the interesection of NLP and IR. When Hugo left to venture into the start-up business and opened Websays (which, coincidently, I'm a co-founder of), we started the Semantic Search group at Yahoo Labs, under Peter Mika, first in Barcelona, and later in London. At Yahoo I've done a lot of work on Web search and other projects mostly dealing with applied Machine Learning (for text) and content understanding using Natural Language Processing. We've published a good number of side-products of our research (see the Publications and Software pages) and even received an Outstanding Paper award at SIGIR, best system award at HCIR and other not-so-lucky best paper nominations at SIGIR, ISWC and other places.

Over the years I've had the opportunity to collaborate with many very talented researchers and to host smart, energetic PhD students. I owe a great deal of gratitude to all of them, for all the things that I've learned teaming up. I've worked on joint projects with outstanding academics like Sebastiano Vigna, Paolo Boldi, Soumen Chakrabarti, Mark Sanderson, Arjen de Vries, Roberto Grossi, Benjamin Piwowarski and Abraham Bernstein (just to cite a few) and co-wrote papers with many others like Raffaelle Perego, Harry Halpin, Falk Scholer, David Vallet, Joemon Jose, Hideo Joho, Adam Jatow, Kjetil Norvag among others. I was fortunate enough to have had outstanding colleagues in Yahoo Labs: Ricardo Baeza, Flavio Junqueira, Edgar Meij, Barla Cambazouglu, Jordi Atserias, Michael Matthews, Ronny Lempel, Fabrizio Silvestri, Nicola Barbieri, Ulf Brefeld, Gianmarco Defrancisi Morales ... (the list could go on forever).

Possibly the part of my job I'm the most proud of is to have worked with dynamic, eager, tiredless ex-interns and students, most of which are well positioned now on high tech cutting-edge companies or rallying on to a professorship tenured track. You are too many to list you all here without forgetting someone!

I've been a reviewer for a large number of conferences (SIGIR, WWW, ACL, CIKM, WSDM, IJCAI, ECIR, NAACL, EACL, ICTIR, WISE, INFOSCALE, AIRS, CERI, AICCSA among others), journals (Artificial Intelligence, Knowledge and Information Systems, Information Processing and Management, Journal of Discrete Algorithms, Natural Language Engineering, WWW Journal, Information Sytems, etc.) and many interesting workshops.

Software

Below you can find a list of software products that I've contributed to directly or that have been developed as a side result of a research project.

Fast Entity Linker (FEL)

FEL is and unsupervised, accurate, and extensible multilingual named entity recognition and linking system. At this time, Fast Entity Linker is one of only three freely-available multilingual named entity recognition and linking systems (others are DBpedia Spotlight and Babelfy). In addition to a stand-alone entity linker, the software includes tools for creating and compressing word/entity embeddings and datapacks for different languages from Wikipedia data. As an example, the datapack containing information from all of English Wikipedia is only ~2GB. The technical contributions of this system are described in two scientific papers: Fast and space-efficient entity linking in queries and Lightweight multilingual entity extraction and linking.
Anthelion

Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages. The project is open-sourced under the Apache License 2.0. and implements the techniques described in this CIKM paper.
Glimmer

Glimmer provides support for offline distributed indexing of RDF data using Hadoop MapReduce. It also contains an online ranking component using a state-of-the-art method based on BM25F released under the Apache License 2.0. For further information, the code implements the techniques described in this ISWC paper.
Time Explorer

Time Explorer is an application designed for analyzing how news changes over time. Time Explorer extends upon current time-based systems in many important ways. First, Time Explorer is designed to help users discover how entities such as people and locations associated with a query change over time. Second, by searching on time expressions extracted automatically from text, the application allows the user to explore not only how topics evolved the past, but also how they will continue to evolve in the future. Finally, Time Explorer is designed around an intuitive interface that allows users to interact with time and entities in a powerful way. While aspects of these features can be found in other systems, they are combined in Time Explorer in a way that allows searching through time in no time at all. This paper describes the architecture and techniques we used in Time Explorer.
Entity2Vec

This library generates semantic embeddings of entities from text that describes them. It can also quantize and compress the obtained models. The training code is written in Python and it requires Numpy, Scipy, Numexpr, Theano, and it also relies on gensim, which is included as a git submodule. The code for model compression and entity scoring is instead written in Java..

This WSDM paper describes the algorithms implemented in Entity2Vec with an application to Entity Linking.
Diversity Engine

DiversityEngine is a framework that provides different tools to create diversity-aware multimedia search applications. This IJMR paper describes the architecture and services deployed within this framework (mostly discontinued).
Terrier

Terrier is a highly flexible, efficient, and effective open source search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the-art indexing and retrieval functionalities, and provides an ideal platform for the rapid development and evaluation of large-scale retrieval applications.

During the first year of my PhD I did an internship at Glasgow University, hosted by Iadh Ounis. At that time I implemented different indexing and compression mechanisms for the search engine, most of them explained in this seminar slides. Some of the code I wrote is also used in other projects like Ivory.

Data

Semantic Search Competition
NCTIR Temporalia
Yahoo Webscope

The Yahoo Webscope Program is a reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists.

Yahoo is pleased to make these datasets available to researchers who are advancing the state of knowledge and understanding in web sciences. The datasets are only available for academic use by faculty and university researchers who agree to the Data Sharing Agreement.

I've contributed to creating some Webscope datasets, like the L22 - Yahoo! News Sessions Content which was used in this SIGIR paper, and we've used the L24 - Yahoo Search Query Log To Entities in this WSDM paper.
A Dataset for Evaluating Entity Retrieval over Time (DEERT v.0)

The TREC Novelty track in 2004 consisted on a collection of news articles and a set of topics for evaluating retrieval of novel information over lists of documents ordered in time for each topic. The systems had to retrieve information (i.e., sentences in this case) relevant to the topic and not yet present in the retrieved results. A time-stamped list of documents is provided for every topic reflecting the temporal flow of the story the topic is about. We created a new collection based on the one developed at the TREC 2004 Novelty track for evaluating entity retrieval over time.

This dataset was used in this SIGIR poster and this CIKM paper.
Support Sentences Evaluation Dataset (SEE v.0)

We created a dataset to study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, had not been addressed before.

This dataset was used in this SIGIR paper where we propose several alternatives for selecting descriptive sentences.
Keyword search over RDF graphs

Large knowledge bases consisting of entities and relationships between them have become vital sources of information for many applications. Most of these knowledge bases adopt the SemanticWeb data model RDF as a representation model. Querying these knowledge bases is typically done using structured queries utilizing graph-pattern languages such as SPARQL. However, such structured queries require some expertise from users which limits the accessibility to such data sources. To overcome this, keyword search must be supported.

We created a benchmark that contains a set of structured queries, possibly augmented with keywords, along with their descriptions and gathered relevance assessment for each result using at least 4 different human judges. Overall, we had about 15,000 unique relevance assessments for the 30 queries.

This dataset was used in this CIKM paper.
Ranking Related News Predictions

We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. We formally defined this task and proposed a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity.

We created a benchmark that contains 52 queries. For each one of them we retrieved up to 100 sentences that contained predictions. On average 94 sentences with future mentions were retrieved, with an average of 1.2 future dates per prediction. Assessors evaluated 4,888 query/prediction pairs (approximately 6,032 of triples) using 5 levels of relevance: 4 for excellent (very relevant prediction), 3 for good (relevant prediction), 2 for fair (related prediction), 1 for bad (non-relevant prediction), and 0 for non prediction (incorrect tagged date).

This dataset was used in this SIGIR paper.

Organization

I've co-organized and served as a PC on many major Information Retrieval/Natural Language conferences and journals. Currently I'm SPC in WWW and CIKM and have been area chair in CIKM and AIRS. I've also helped to co-organize workshops and competitions for researchers (listed below for references to data and proceedings).

Conference organization:

International Conference on Informatation and Knowledge Management CIKM 2014 (Information Retrieval co-chair)

The Ninth Asia Information Retrieval Societies Conference AIRS 2013 (Machine learning and Data mining co-chair)

34th European Conference on Information Retrieval ECIR 2012

The Third International ICST Conference on Scalable Information Systems Infoscale 2008
Semantic Search Competition

Semsearch was a competition that run for two years (2010 and 2011) that required participants to answer queries of varying complexity based on a set of structured data collected from the Web. The competition had two tracks. The first "Entity Search Track" consists of queries that refer to one particular entity. The second "List Search Track" consists of complex queries with multiple possible answers.

Queries 2010, Relevance Assessments 2010

Queries Entity Track 2011, Relevance Assessments Entity Track 2011

Queries Types Track 2011, Relevance Assessments Type Track 2011
NTCIR Temporalia

Temporalia is a competition that has been running since 2012 for two different editions, each one spanning approximately one and a half years.

The objective of this task is to foster research in temporal information access. Given the fact that time plays crucial role in estimating information relevance and validity we believe that successful search engines must consider temporal aspects of information in greater detail. We propose a challenge that establishes common grounds for designing and analyzing temporally-aware information access systems.

Temporal Information Retrieval has been gaining a lot of interest in IR and related research communities. It can be defined as a subset of document retrieval in which time plays crucial role in estimating document relevance. The objective of this task is to systematize various requirements in Temporal IR and offer a standardized challenge based on which competing systems can be compared and analyzed. Our analysis suggests that although many of temporal information needs seek for recent (fresh) information, a good proportion of them also search for information about past incidents as well as future incidents. Although there are several evaluation tasks that involve search and filtering over time (e.g., TDT, NTCIR GeoTime, TREC Temporal Summarization), there is no test collection to measure the performance of search applications across temporal information needs categories such as Past, Recent, Seasonal, and Future in a systematic way.

This paper describes the Temporalia-1 outcome (the second edition is still running).
Efficiency Issues in Information Retrieval Workshop

In 2008 I co-organized the first Efficiency Issues in Information Retrieval Workshop, which was collocated with ECIR in Glasgow.

Today's technological advancements have allowed for vast amounts of information to be widely generated, disseminated and stored. This exponentially increasing amount of information has rendered the retrieval of relevant information a necessary and cumbersome task. The field of Information Retrieval addresses this task by developing systems in an effective and efficient way.

Specifically, IR effectiveness deals with retrieving the most relevant information to a user need, while IR efficiency deals with providing fast and ordered access to large amounts of information. The efficiency of IR systems is of utmost importance, because it ensures that systems scale up to the vast amounts of information needing retrieval. This is an important topic of research for both academic and corporative environments. In academia, it is imperative for new ideas and techniques to be evaluated on as near-realistic environments as possible; this is reflected in the past Terabyte track and recent Million Query track organised by the Text REtrieval Evaluation Conferences (TREC).

In corporate environments, it is important that systems response time is kept low, and the amount of data processed high. These efficiency concerns need to be addressed in a principled way, so that they can be adapted to new platforms and environments, such as information retrieval from mobile devices, desktop search, distributed peer to peer, expert search, collaborative filtering, multimedia retrieval, and so on. Efficiency research over the past years has focused on efficient indexing, storage (compression) and retrieval of data (query processing strategies). Some of the questions that we targetted with this workshop were: What are the efficiency concernsregarding IR applications (both new and traditional)? Do new applications create novel efficiency problems? Can existing efficiency related technology deal with these new applications? Has there been any advance in the last decade on state-of-the-art efficiency, or is it at a stand-still?

The proceedings are available online.
LSDSIR 2010

The workshop series Large Scale and Distributed Systems for Information Retrieval has been running for a good number of years. In 2010 I helped to co-organize its 8th edition, which was held collocated with SIGIR'10 in Geneva.

The Web is estimated to contain at least 21 billion Web pages as of January 2010. In parallel to the growth of the Web, the population of Web users has also rapidly increased, and the patterns of interaction of those users with search engines have changed. Today, the main research challenge is to scale large-scale search engines with the growth of the Web, without sacrificing from user satisfaction.

Traditionally, large-scale commercial search engines operate on one or more data centers, each containing multiple, very large clusters of computers. Coping with the increasing number of user requests as well as crawling and indexing more pages requires adding more computational resources to data centers. There are, however, constraints in the scalability of such systems, such as financial costs and physical space constraints. Consequently, distributing the functionality of a search engine along with improving the utilization of resources in a single data center becomes of ultimate importance to the success of future generations of search systems.

Distributed search engines do not only solve the scalability issue, but they enable new classes of applications that leverage the distribution of processing units across a geographical area. As another facet of distributed IR, we are interested in contributions that propose different ways of using diversity and multiplicity of resources available in such distributed systems. More specifically, we are interested in novel applications, models, and architectures that deal with efficiency and scalability issues as well as data diversity and community-oriented information sharing.

This workshop brought together both experienced and young researchers from distributed IR, including work on P2P search and efficiency of distributed systems for information processing. This edition of the workshop favored novel, perhaps even outrageous ideas as opposed to finished research work, thus strongly encouraging the submission of position papers in addition to research papers. Position papers are important to foster discussion upon controversial and intriguing ideas on new ways of building distributed infrastructures for information processing.

The proceedings are available online.

Patents and defensive publications

Method or system for ranking related news predictions

United States US Patent App. 13/538,798
Quote-Based Search

United States 20130159340
Using cache invalidation to assign documents to indexes in a distributed search engine

United States Defensive publication, Docket No. ID‐11‐7640
Caching Search Engine Results over Incremental Indices

United States Defensive publication, Docket No. ID‐10‐6334
Interactive interface for object search

United States 20130159222

Random Stuff

Other software I've used extensively for research projects:

Managing Gigabyes for Java (mg4j)

Some (old) teaching resources in Spanish:

Cognitive Science (Opt. II-ITIG-ITIS, 2C). Labs webpage.

"Modelos y Técnicas Avanzadas en Recuperación de Información" (MTARI) in the Doctorate Programme of the Computer Science Department and in the Master Degree in Computing of the University of A Coruña.

"El sistema operativo Linux. Conceptos Básicos" [Slides](Spanish)
Aula de Formación Informática, Since 2005.

"El S.O. GNU/Linux. OpenOffice 2.0." [Slides Part I](Spanish) [Slides Part II](Spanish)
Consejo Social UDC, March 2007.

Publications

This is a list of my publications along with their pdfs and bibtext. For more information you can check my Google Scholar or DBLP profile pages.

Filter by type:

Sort by year:

Click Through Rate Prediction for Local Search Results

Fidel Cacheda, Nicola Barbieri, Roi Blanco

Paper WSDM 2017 - 10th ACM International Conference on Web Search and Data Mining

Abstract

With the ubiquity of internet access and location services provided by smartphone devices, the volume of queries issued by users to find products and services that are located near them is rapidly increasing. Local search engines help users in this task by matching queries with a predefined geographical connotation (“local queries”) against a database of local business listings. Local search differs from traditional web-search because to correctly capture users’ click behavior, the estimation of relevance between query and candidate results must be integrated with geographical signals, such as distance. The intuition is that users prefer businesses that are physically closer to them. However, this notion of closeness is likely to depend upon other factors, like the category of the business, the quality of the service provided, the density of businesses in the area of interest, etc. In this paper we perform an extensive analysis of online users’ behavior and investigate the problem of estimating the click-through rate on local search (LCTR) by exploiting the combination of standard retrieval methods with a rich collection of geo and business-dependent features. We validate our approach on a large log collected from a real-world local search service. Our evaluation shows that the non-linear combination of business information, geo-local and textual relevance features leads to a significant improvements over state of the art alternative approaches based on a combination of relevance, distance and business reputation.

A Concise Integer Linear Programming Formulation for Implicit Search Result Diversification

Hai-Tao Yu, Adam Jatowt, Roi Blanco, Hideo Joho, Joemon M. Jose, Long Chen, Fajie Yuan

Paper WSDM 2017 - 10th ACM International Conference on Web Search and Data Mining

Abstract

To cope with ambiguous and/or underspecified queries, search result diversification (SRD) is a key technique that has attracted a lot of attention. This paper focuses on implicit SRD, where the possible subtopics underlying a query are unknown beforehand. We formulate implicit SRD as a process of selecting and ranking k exemplar documents that utilizes integer linear programming (ILP). Unlike the common practice of relying on approximate methods, this formulation enables us to obtain the optimal solution of the objective function. Based on four benchmark collections, our extensive empirical experiments reveal that: (1) The factors, such as different initial runs, the number of input documents, query types and the ways of computing document similarity significantly affect the performance of diversification models. Careful examinations of these factors are highly recommended in the development of implicit SRD methods. (2) The proposed method can achieve substantially improved performance over the state-of-the-art unsupervised methods for implicit SRD.

Lightweight Multilingual Entity Extraction and Linking

Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, Kapil Thadani

Paper WSDM 2017 - 10th ACM International Conference on Web Search and Data Mining

Abstract

Text analytics systems often rely heavily on detecting and linking entity mentions in documents to knowledge bases for downstream applications such as sentiment analysis, question answering and recommender systems. A major challenge for this task is to be able to accurately detect entities in new languages with limited labeled resources. In this paper we present an accurate and lightweight 1 multilingual named entity recognition (NER) and linking (NEL) system. The contributions of this paper are three-fold: 1) Lightweight named entity recognition with competitive accuracy; 2) Candidate entity retrieval that uses search clicklog data and entity embeddings to achieve high precision with a low memory footprint; and 3) efficient entity disambiguation. Our system achieves state-of-the-art performance on tac kbp 2013 multilingual data and on English aida-conll data

Exploiting Green Energy to Reduce the Operational Costs of Multi-Center Web Search Engines

Roi Blanco, Matteo Catena, Nicola Tonellotto

Paper WWW 2016 - 25th International Conference on World Wide Web

Abstract

Carbon dioxide emissions resulting from fossil fuels (brown energy) combustion are the main cause of global warming due to the greenhouse effect. Large IT companies have recently increased their efforts in reducing the carbon dioxide footprint originated from their data center electricity consumption. On one hand, better infrastructure and modern hardware allow for a more efficient usage of electric resources. On the other hand, data-centers can be powered by renewable sources (green energy) that are both environmental friendly and economically convenient. In this paper, we tackle the problem of targeting the usage of green energy to minimize the expenditure of running multi-center Web search engines, i.e., systems composed by multiple, geographically remote, computing facilities. We propose a mathematical model to minimize the operational costs of multi-center Web search engines by exploiting renewable energies whenever available at different locations. Using this model, we design an algorithm which decides what fraction of the incoming query load arriving into one processing facility must be forwarded to be processed at different sites to use green energy sources. We experiment using real traffic from a large search engine and we compare our model against state of the art baselines for query forwarding. Our experimental results show that the proposed solution maintains an high query throughput, while reducing by up to ~25% the energy operational costs of multi-center search engines. Additionally, our algorithm can reduce the brown energy consumption by almost 6% when energy-proportional servers are employed.

@inproceedings{Blanco:2016:EGE:2872427.2883021,
 author = {Blanco, Roi and Catena, Matteo and Tonellotto, Nicola},
 title = {Exploiting Green Energy to Reduce the Operational Costs of Multi-Center Web Search Engines},
 booktitle = {Proceedings of the 25th International Conference on World Wide Web},
 series = {WWW '16},
 year = {2016},
 isbn = {978-1-4503-4143-1},
 location = {Montr\&\#233;al, Qu\&\#233;bec, Canada},
 pages = {1237--1247},
 numpages = {11},
 url = {https://doi.org/10.1145/2872427.2883021},
 doi = {10.1145/2872427.2883021},
 acmid = {2883021},
 publisher = {International World Wide Web Conferences Steering Committee},
 address = {Republic and Canton of Geneva, Switzerland},
 keywords = {green energy, query forwarding, web search engines},
}

Term-by-Term Query Auto-Completion for Mobile Search

Saul Vargas, Roi Blanco, Peter Mika

Paper WSDM 2016 - Ninth ACM International Conference on Web Search and Data Mining

Abstract

With the ever increasing usage of mobile search, where text input is typically slow and error-prone, assisting users to formulate their queries contributes to a more satisfactory search experience. Query auto-completion (QAC) techniques, which predict possible completions for user queries, are the archetypal example of query assistance and are present in most search engines. We argue, however, that classic QAC, which operates by suggesting whole-query completions, may be sub-optimal for the case of mobile search as the available screen real estate to show suggestions is limited and editing is typically slower than in desktop search. In this paper we propose the idea of term-by-term QAC, which is a new technique inspired by predictive keyboards that suggests to the user one term at a time, instead of whole-query completions. We describe an efficient mechanism to implement this technique and an adaptation of a prior user model to evaluate the effectiveness of both standard and term-by-term QAC approaches using query log data. Our experiments with a mobile query log from a commercial search engine show the validity of our approach according to this user model with respect to saved characters, saved terms and examination effort. Finally, a user study provides further insights about our term-by-term technique compared with standard QAC with respect to the variables analyzed in the query log-based evaluation and additional variables related to the successfulness, the speed of the interactions and the properties of the submitted queries.

@inproceedings{Vargas:2016:TQA:2835776.2835813,
 author = {Vargas, Sa\'{u}l and Blanco, Roi and Mika, Peter},
 title = {Term-by-Term Query Auto-Completion for Mobile Search},
 booktitle = {Proceedings of the Ninth ACM International Conference on Web Search and Data Mining},
 series = {WSDM '16},
 year = {2016},
 isbn = {978-1-4503-3716-8},
 location = {San Francisco, California, USA},
 pages = {143--152},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2835776.2835813},
 doi = {10.1145/2835776.2835813},
 acmid = {2835813},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {query auto completion, query logs, user models, word prediction},
}

Memory-based Recommendations of Entities for Web Search Users

Ignacio Fernandez-Tobias, Roi Blanco

Paper CIKM 2016 - 25th ACM International on Conference on Information and Knowledge Management.

Abstract

Modern search engines have evolved from mere document retrieval systems to platforms that assist the users in discovering new information. In this context, entity recommendation systems exploit query log data to proactively provide the users with suggestions of entities (people, movies, places, etc.) from knowledge bases that are relevant for their current information need. Previous works consider the problem of ranking facts and entities related to the user's current query, or focus on specific recommendation domains requiring supervised selection and extraction of features from knowledge bases. In this paper we propose a set of domain-agnostic methods based on nearest neighbors collaborative filtering that exploit query log data to generate entity suggestions, taking into account the user's full search session. Our experimental results on a large dataset from a commercial search engine show that the proposed methods are able to compute relevant entity recommendations outperforming a number of baselines. Finally, we perform an analysis on a cross-domain scenario using different entity types, and conclude that even if knowing the right target domain is important for providing effective recommendations, some inter-domain user interactions are helpful for the task at hand.

@inproceedings{Fernandez-Tobias:2016:MRE:2983323.2983823,
 author = {Fern\'{a}ndez-Tob\'{\i}as, Ignacio and Blanco, Roi},
 title = {Memory-based Recommendations of Entities for Web Search Users},
 booktitle = {Proceedings of the 25th ACM International on Conference on Information and Knowledge Management},
 series = {CIKM '16},
 year = {2016},
 isbn = {978-1-4503-4073-1},
 location = {Indianapolis, Indiana, USA},
 pages = {35--44},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2983323.2983823},
 doi = {10.1145/2983323.2983823},
 acmid = {2983823},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {entity recommendation, recommender systems, web search},
}

An In-Depth Study of Implicit Search Result Diversification

Haitao Yu, Adam Jatow, Roi Blanco, Hideo Joho, Joemon M. Jose, Long Chen, Fajie Yuan

Paper AIRS 2016 - 12th Asia Information Retrieval Societies Conference.

Abstract

In this paper, we present a novel Integer Linear Programming formulation (termed ILP4ID) for implicit search result diversification (SRD). The advantage is that the exact solution can be achieved, which enables us to investigate to what extent using the greedy strategy affects the performance of implicit SRD. Specifically, a series of experiments are conducted to empirically compare the state-of-the-art methods with the proposed approach. The experimental results show that: (1) The factors, such as different initial runs and the number of input documents, greatly affect the performance of diversification models. (2) ILP4ID can achieve substantially improved performance over the state-of-the-art methods in terms of standard diversity metrics.

@Inbook{Yu2016,
author="Yu, Hai-Tao
and Jatowt, Adam
and Blanco, Roi
and Joho, Hideo
and Jose, Joemon
and Chen, Long
and Yuan, Fajie",
editor="Ma, Shaoping
and Wen, Ji-Rong
and Liu, Yiqun
and Dou, Zhicheng
and Zhang, Min
and Chang, Yi
and Zhao, Xin",
title="An In-Depth Study of Implicit Search Result Diversification",
bookTitle="Information Retrieval Technology: 12th Asia Information Retrieval Societies Conference, AIRS 2016, Beijing, China, November 30 -- December 2, 2016, Proceedings",
year="2016",
publisher="Springer International Publishing",
address="Cham",
pages="342--348",
isbn="978-3-319-48051-0",
doi="10.1007/978-3-319-48051-0_29",
url="http://dx.doi.org/10.1007/978-3-319-48051-0_29"
}

Building Test Collections for Evaluating Temporal IR

Hideo Joho, Adam Jatow, Roi Blanco, Haitao Yu, Shuhei Yamamoto

Paper SIGIR 2016 - 39th International ACM SIGIR conference on Research and Development in Information Retrieval.

Abstract

Research on temporal aspects of information retrieval has recently gained considerable interest within the Information Retrieval (IR) community. This paper describes our efforts for building test collections for the purpose of fostering temporal IR research. In particular, we overview the test collections created at the two recent editions of Temporal Information Access (Temporalia) task organized at NTCIR-11 and NTCIR-12, report on selected results and discuss several observations we made during the task design and implementation. Finally, we outline further directions for constructing test collections suitable for temporal IR.

@inproceedings{Joho:2016:BTC:2911451.2914673,
 author = {Joho, Hideo and Jatowt, Adam and Blanco, Roi and Yu, Haitao and Yamamoto, Shuhei},
 title = {Building Test Collections for Evaluating Temporal IR},
 booktitle = {Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '16},
 year = {2016},
 isbn = {978-1-4503-4069-4},
 location = {Pisa, Italy},
 pages = {677--680},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/2911451.2914673},
 doi = {10.1145/2911451.2914673},
 acmid = {2914673},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {temporal ir, temporal query intents, temporal search result diversification, test collections},
}

Re-finding Behavior in Vertical Domains

Sargol Sadeghi, Roi Blanco, Peter Mika, Mark Sanderson, Falk Scholer and David Vallet

Journal Paper Transacions on Information Systems (2016)

Abstract

We differentiate and model user re-finding behavior within different media and topic domains that are related to the domains of vertical search engines (e.g. images, news, reference material, and movies). We distinguish the re-finding behavior from general search, and engineer features that are effective in differentiating re-finding across the domains. The features are then used to build machine learned models. While the accuracy of detection is 85.7%, compared with 97.5% when examining re-finding from general search tasks, to the best of our knowledge such an examination of vertical re-finding has not been tried before. We attempt to differentiate re-finding behavior when the history of a searchers interactions are not available. In this scenario we achieve an average accuracy of 77.5% across the domains. We also examine early detection of re-finding during a searchers session. Finally, we investigate in which types of domains is re-finding most difficult. Here, it would appear that re-finding images is particularly challenging for users. This research has implications for search engine design, in terms of adapting search results by predicting the type of user tasks and enabling the presentation of vertical-specific results when re-finding is identified.

Temporal Information Retrieval

Nattiya Kanhabua, Roi Blanco, Kjetil Norvag

Book Foundations and Trends in Information Retrieval 2015.

Temporal dynamics and how they impact upon various components of information retrieval (IR) systems have received a large share of attention in the last decade. In particular, the study of relevance in information retrieval can now be framed within the so-called temporal IR approaches, which explain how user behavior, document content and scale vary with time, and how we can use them in our favor in order to improve retrieval effectiveness. This survey provides a comprehensive overview of temporal IR approaches, centered on the following questions: what are temporal dynamics, why do they occur, and when and how to leverage temporal information throughout the search cycle and architecture. We first explain the general and wide aspects associated to temporal dynamics by focusing on the web domain, from content and structural changes to variations of user behavior and interactions. Next, we pinpoint several research issues and the impact of such temporal characteristics on search, essentially regarding processing dynamic content, temporal query analysis and time-aware ranking. We also address particular aspects of temporal information extraction (for instance, how to timestamp documents and generate temporal profiles of text). To this end, we present existing temporal search engines and applications in related research areas, e.g., exploration, summarization, and clustering of search results, as well as future event retrieval and prediction, where the time dimension also plays an important role.

@article{INR-043,
url = {http://dx.doi.org/10.1561/1500000043},
year = {2015},
volume = {9},
journal = {Foundations and Trends® in Information Retrieval},
title = {Temporal Information Retrieval},
doi = {10.1561/1500000043},
issn = {1554-0669},
number = {2},
pages = {91-208},
author = {Nattiya Kanhabua and Roi Blanco and Kjetil Nørvåg}

Using graph distances for named-entity linking

Roi Blanco, Paolo Boldi, Andrea Marino

Journal Paper Science of computer programming (2016)

Abstract

Entity-linking is a natural-language-processing task that consists in identifying strings of text that refer to a particular item in some reference knowledge base. When the knowledge base is Wikipedia, the problem is also referred to as wikification (in this case, items are Wikipedia articles). Entity-linking consists conceptually of many different phases: identifying the portions of text that may refer to an entity (sometimes called “entity detection”), determining a set of concepts (candidates) from the knowledge base that may match each such portion, and choosing one candidate for each set; the latter step, known as candidate selection, is the phase on which this paper focuses. One instance of candidate selection can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between the selected items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; we propose several heuristics trying to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset. Finally, in the appendix, we show an exact linear time algorithm that works under some more restrictive assumptions.


@article{BlancoLinking2015,
title = "Using graph distances for named-entity linking ",
journal = "Science of Computer Programming ",
volume = "",
number = "",
pages = " - ",
year = "2015",
note = "",
issn = "0167-6423",
doi = "http://dx.doi.org/10.1016/j.scico.2015.10.013",
url = "http://www.sciencedirect.com/science/article/pii/S0167642315003160",
author = "Roi Blanco and Paolo Boldi and Andrea Marino",
}

Predicting primary categories of business listings for local search ranking

Changsung Kahn, Jeehaeng Lee, Roi Blanco, Yi Chang

Journal Paper Neurocomputing (2015)

Abstract

We consider the problem of identifying primary categories of a business listing among the categories provided by the owner of the business, in order to enhance local search and browsing. The category information submitted by business owners cannot be trusted with absolute certainty since they may purposefully add some secondary or irrelevant categories to increase recall in local search results, which makes category search very challenging for local search engines. Thus, identifying primary categories of a business is a crucial problem in local search. This problem can be cast as a multi-label classification problem with a large number of categories. However, the large scale of the problem makes it infeasible to use conventional supervised-learning-based text categorization approaches. We propose a large-scale classification framework that leverages multiple types of classification labels to produce a highly accurate classifier with fast training time. We effectively combine the complementary label sources to refine prediction. The experimental results indicate that our framework achieves very high precision and recall and outperforms a competitive baseline using a centroid-based method. We also propose a new ranking feature based on the mapping of queries and documents to category space and show that the new feature leads to ranking relevance improvements for local search.

@article{Khan2015961,
title = "Predicting primary categories of business listings for local search ranking ",
journal = "Neurocomputing ",
volume = "168",
number = "",
pages = "961 - 969",
year = "2015",
note = "",
issn = "0925-2312",
doi = "http://dx.doi.org/10.1016/j.neucom.2015.05.029",
url = "http://www.sciencedirect.com/science/article/pii/S0925231215006815",
author = "Changsung Khan and Jeehaeng Lee and Roi Blanco and Yi Chang",
keywords = "Vertical search",
keywords = "Text categorization",
keywords = "Primary category",
keywords = "Ranking relevance ",
}

IntoNews: Online news retrieval using closed captions

Roi Blanco,Gianmarco De Francisci Morales, Fabrizio Silvestri

Journal Paper Information Processing and Management (2015) .

Abstract

We present INTONEWS, a system to match online news articles with spoken news from a television newscasts represented by closed captions. We formalize the news matching problem as two independent tasks: closed captions segmentation and news retrieval. The system segments closed captions by using a windowing scheme: sliding or tumbling window. Next, it uses each segment to build a query by extracting representative terms. The query is used to retrieve previously indexed news articles from a search engine. To detect when a new article should be surfaced, the system compares the set of retrieved articles with the previously retrieved one. The intuition is that if the difference between these sets is large enough, it is likely that the topic of the newscast currently on air has changed and a new article should be displayed to the user. In order to evaluate INTONEWS, we build a test collection using data coming from a second screen application and a major online news aggregator. The dataset is manually segmented and annotated by expert assessors, and used as our ground truth. It is freely available for download through the Webscope program. Our evaluation is based on a set of novel time-relevance metrics that take into account three different aspects of the problem at hand: precision, timeliness and coverage. We compare our algorithms against the best method previously proposed in literature for this problem. Experiments show the trade-offs involved among precision, timeliness and coverage of the airing news. Our best method is four times more accurate than the baseline.


@article{DBLP:journals/ipm/BlancoMS15,
  author    = {Roi Blanco and
               Gianmarco De Francisci Morales and
               Fabrizio Silvestri},
  title     = {IntoNews: Online news retrieval using closed captions},
  journal   = {Inf. Process. Manage.},
  volume    = {51},
  number    = {1},
  pages     = {148--162},
  year      = {2015},
  url       = {http://dx.doi.org/10.1016/j.ipm.2014.07.010},
  doi       = {10.1016/j.ipm.2014.07.010},
  timestamp = {Wed, 05 Nov 2014 00:00:00 +0100},
  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/ipm/BlancoMS15},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Ranking of daily deals with concept expansion

Roi Blanco, Michael Mattews, Peter Mika

Journal Paper Information Processing and Management (2015) .

Abstract

We formalize the problem of retrieving daily deals in the context of Web search.We effectively combine keyword-based retrieval with automated classification.Our solution outperforms state-of-the-art query expansion and prior ad ranking work. Daily deals have emerged in the last three years as a successful form of online advertising. The downside of this success is that users are increasingly overloaded by the many thousands of deals offered each day by dozens of deal providers and aggregators. The challenge is thus offering the right deals to the right users i.e., the relevance ranking of deals. This is the problem we address in our paper. Exploiting the characteristics of deals data, we propose a combination of a term- and a concept-based retrieval model that closes the semantic gap between queries and documents expanding both of them with category information. The method consistently outperforms state-of-the-art methods based on term-matching alone and existing approaches for ad classification and ranking.

@article{Blanco:2015:RDD:2793724.2793930,
 author = {Blanco, Roi and Matthews, Michael and Mika, Peter},
 title = {Ranking of Daily Deals with Concept Expansion},
 journal = {Information Processing and Management},
 issue_date = {July 2015},
 volume = {51},
 number = {4},
 month = jul,
 year = {2015},
 issn = {0306-4573},
 pages = {359--372},
 numpages = {14},
 url = {http://dx.doi.org/10.1016/j.ipm.2015.01.003},
 doi = {10.1016/j.ipm.2015.01.003},
 acmid = {2793930},
 publisher = {Pergamon Press, Inc.},
 address = {Tarrytown, NY, USA},
 keywords = {Deals ranking, Query expansion, Semantic search, Text classification},
}

Temporal information searching behaviour and strategies

Hideo Joho, Adam Jatow, Roi Blanco

Journal Paper Information Processing and Management 2015 .

Abstract

Temporal information searching behaviour and strategies were investigated.Searching patterns were identified for past, recency and future search tasks.Implications for the development of temporal IR systems are discussed. Temporal aspects have been receiving a great deal of interest in Information Retrieval and related fields. Although previous studies have proposed, designed and implemented temporal-aware systems and solutions, understanding of people's temporal information searching behaviour is still limited. This paper reports the findings of a user study that explored temporal information searching behaviour and strategies in a laboratory setting. Information needs were grouped into three temporal classes (Past, Recency, and Future) to systematically study their characteristics. The main findings of our experiment are as follows. (1) It is intuitive for people to augment topical keywords with temporal expressions such as history, recent, or future as a tactic of temporal search. (2) However, such queries produce mixed results and the success of query reformulations appears to depend on topics to a large extent. (3) Search engine interfaces should detect temporal information needs to trigger the display of temporal search options. (4) Finding a relevant Wikipedia page or similar summary page is a popular starting point of past information needs. (5) Current search engines do a good job for information needs related to recent events, but more work is needed for past and future tasks. (6) Participants found it most difficult to find future information. Searching for domain experts was a key tactic in Future search, and file types of relevant documents are different from other temporal classes. Overall, the comparison of search across temporal classes indicated that Future search was the most difficult and the least successful followed by the search for the Past and then for Recency information. This paper discusses the implications of these findings on the design of future temporal IR systems.


@article{Joho:2015:TIS:2829380.2829564,
 author = {Joho, Hideo and Jatowt, Adam and Blanco, Roi},
 title = {Temporal Information Searching Behaviour and Strategies},
 journal = {Inf. Process. Manage.},
 issue_date = {November 2015},
 volume = {51},
 number = {6},
 month = nov,
 year = {2015},
 issn = {0306-4573},
 pages = {834--850},
 numpages = {17},
 url = {http://dx.doi.org/10.1016/j.ipm.2015.03.006},
 doi = {10.1016/j.ipm.2015.03.006},
 acmid = {2829564},
 publisher = {Pergamon Press, Inc.},
 address = {Tarrytown, NY, USA},
 keywords = {Information searching behaviour, Search strategies, Temporal information retrieval, User study},
}

Predicting Re-finding Activity and Difficulty

Sargol Sadeghi, Roi Blanco, Peter Mika, Mark Sanderson, Falk Scholer, David Vallet

Paper ECIR 2015 - 37th European conference on Advances in Information Retrieval.

Abstract

In this study, we address the problem of identifying if users are attempting to re-find information and estimating the level of difficulty of the re-finding task. We propose to consider the task information (e.g. multiple queries and click information) rather than only queries. Our resultant prediction models are shown to be significantly more accurate (by 2%) than the current state of the art. While past research assumes that previous search history of the user is available to the prediction model, we examine if re-finding detection is possible without access to this information. Our evaluation indicates that such detection is possible, but more challenging. We further describe the first predictive model in detecting re-finding difficulty, showing it to be significantly better than existing approaches for detecting general search difficulty.


@Inbook{Sadeghi2015,
author="Sadeghi, Sargol
and Blanco, Roi
and Mika, Peter
and Sanderson, Mark
and Scholer, Falk
and Vallet, David",
editor="Hanbury, Allan
and Kazai, Gabriella
and Rauber, Andreas
and Fuhr, Norbert",
chapter="Predicting Re-finding Activity and Difficulty",
title="Advances in Information Retrieval: 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings",
year="2015",
publisher="Springer International Publishing",
address="Cham",
pages="715--727",
isbn="978-3-319-16354-3",
doi="10.1007/978-3-319-16354-3_78",
url="http://dx.doi.org/10.1007/978-3-319-16354-3_78"
}

Online News Tracking for Ad-Hoc Information Needs

Jeroen B. P. Vuurens, Arjen P. de Vries, Roi Blanco, Peter Mika

Paper ICTIR 2015 - Fith International Conference on the Theory of Information Retrieval.

Abstract

Following online news about a specific event can be a difficult task as new information is often scattered across web pages. In such cases, an up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We propose a three-step approach to online news tracking for ad-hoc information needs. First, we continuously cluster the titles of all incoming news articles. Then, we select the clusters that best fit a user's ad-hoc information need and identify salient sentences. Finally, we select sentences for the summary based on novelty and relevance to the information seen, without requiring an a-priori model of events of interest. We evaluate this approach using the 2013 TREC Temporal Summarization test set and show that compared to existing systems our approach retrieves news facts with significantly higher F-measure and Latency-Discounted Expected Gain.

@inproceedings{Vuurens:2015:ONT:2808194.2809474,
 author = {Vuurens, Jeroen B.P. and de Vries, Arjen P. and Blanco, Roi and Mika, Peter},
 title = {Online News Tracking for Ad-Hoc Information Needs},
 booktitle = {Proceedings of the 2015 International Conference on The Theory of Information Retrieval},
 series = {ICTIR '15},
 year = {2015},
 isbn = {978-1-4503-3833-2},
 location = {Northampton, Massachusetts, USA},
 pages = {221--230},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2808194.2809474},
 doi = {10.1145/2808194.2809474},
 acmid = {2809474},
 publisher = {ACM},
 address = {New York, NY, USA},
}

Insights into Entity Recommendation in Web Search

Nitish Aggarwal, Peter Mika, Roi Blanco, Paul Buitelaar

Workshop Paper IWSD@ISWC '15 .

Abstract

User engagement is a fundamental goal for search engines. Recommendations of entities that are related to the user’s original search query can increase engagement by raising interest in these entities and thereby extending the user’s search session. Related entity recommendations have thus become a standard feature of the interfaces of modern search engines. These systems typically combine a large number of individual signals (features) extracted from the content and interaction logs of a variety of sources. Such studies, however, do not reveal the contribution of individual features, their importance and interaction, or the quality of the sources. In this work, we measure the performance of entity recommendation features individually and by combining them based on a novel dataset of 4.5K search queries and their related entities, which have been evaluated by human assessors.

@inproceedings{DBLP:conf/semweb/AggarwalMBB15,
  author    = {Nitish Aggarwal and
               Peter Mika and
               Roi Blanco and
               Paul Buitelaar},
  title     = {Insights into Entity Recommendation in Web Search},
  booktitle = {Proceedings of the 4th International Workshop on Intelligent Exploration
               of Semantic Data {(IESD} 2015) co-located with the 14th International
               Semantic Web Conference {(ISWC} 2015), Bethlehem, Pennsylvania , USA,
               October 12, 2015.},
  year      = {2015},
  url       = {http://ceur-ws.org/Vol-1472/IESD_2015_paper_6.pdf},
  timestamp = {Mon, 26 Oct 2015 13:42:19 +0100},
  biburl    = {http://dblp.uni-trier.de/rec/bib/conf/semweb/AggarwalMBB15},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Leveraging Wikipedia Knowledge for Entity Recommendations

Nitish Aggarwal, Peter Mika, Roi Blanco, Paul Buitelaar

Poster ISWC '15 - 14th International Semantic Web Conference.

Abstract

User engagement is a fundamental goal of commercial search engines. In order to increase it, they provide the users an opportunity to explore the entities related to the queries. As most of the queries can be linked to entities in knowledge bases, search engines recommend the enti- ties that are related to the users’ search query. In this paper, we present Wikipedia-based Features for Entity Recommendation (WiFER) that combines different features extracted from Wikipedia in order to provide related entity recommendations. We evaluate WiFER on a dataset of 4.5K search queries where each query has around 10 related entities tagged by human experts on 5-level label scale.

@inproceedings{DBLP:conf/semweb/AggarwalMBB15a,
  author    = {Nitish Aggarwal and
               Peter Mika and
               Roi Blanco and
               Paul Buitelaar},	
  title     = {Leveraging Wikipedia Knowledge for Entity Recommendations},
  booktitle = {Proceedings of the {ISWC} 2015 Posters and Demonstrations Track
               co-located with the 14th International Semantic Web Conference (ISWC-2015),
               Bethlehem, PA, USA, October 11, 2015.},
  year      = {2015},
  url       = {http://ceur-ws.org/Vol-1486/paper_81.pdf},
  timestamp = {Tue, 22 Dec 2015 13:07:50 +0100},
  biburl    = {http://dblp.uni-trier.de/rec/bib/conf/semweb/AggarwalMBB15a},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Timely Semantics: A Study of a Stream-Based Ranking System for Entity Relationships

Lorenz Fischer, Roi Blanco, Peter Mika, Abraham Bernstein

Paper ISWC '15 - 14th International Semantic Web Conference.

Abstract

In recent years, search engines have started presenting semantically relevant entity information together with document search results. Entity ranking systems are used to compute recommendations for related entities that a user might also be interested to explore. Typically, this is done by ranking relationships between entities in a semantic knowledge graph using signals found in a data source as well as type annotations on the nodes and links of the graph. However, the process of producing these rankings can take a substantial amount of time. As a result, entity ranking systems typically lag behind real-world events and present relevant entities with outdated relationships to the search term or even outdated entities that should be replaced with more recent relations or entities. This paper presents a study using a real-world stream-processing based implementation of an entity ranking system, to understand the effect of data timeliness on entity rankings. We describe the system and the data it processes in detail. Using a longitudinal case-study, we demonstrate (i) that low-latency, large-scale entity relationship ranking is feasible using moderate resources and (ii) that stream-based entity ranking improves the freshness of related entities while maintaining relevance.

Inbook{Fischer2015,
author="Fischer, Lorenz
and Blanco, Roi
and Mika, Peter
and Bernstein, Abraham",
editor="Arenas, Marcelo
and Corcho, Oscar
and Simperl, Elena
and Strohmaier, Markus
and d'Aquin, Mathieu
and Srinivas, Kavitha
and Groth, Paul
and Dumontier, Michel
and Heflin, Jeff
and Thirunarayan, Krishnaprasad
and Staab, Steffen",
chapter="Timely Semantics: A Study of a Stream-Based Ranking System for Entity Relationships",
title="The Semantic Web - ISWC 2015: 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part II",
year="2015",
publisher="Springer International Publishing",
address="Cham",
pages="429--445",
isbn="978-3-319-25010-6",
doi="10.1007/978-3-319-25010-6_28",
url="http://dx.doi.org/10.1007/978-3-319-25010-6_28"
}

Local Ranking Problem on the BrowseGraph

Michele Trevisiol, Luca Maria Aiello, Paolo Boldi, Roi Blanco

Paper SIGIR'15 - 38th international ACM SIGIR conference on Research and development in information retrieval

Abstract

The "Local Ranking Problem" (LRP) is related to the computation of a centrality-like rank on a local graph, where the scores of the nodes could significantly differ from the ones computed on the global graph. Previous work has studied LRP on the hyperlink graph but never on the BrowseGraph, namely a graph where nodes are webpages and edges are browsing transitions. Recently, this graph has received more and more attention in many different tasks such as ranking, prediction and recommendation. However, a web-server has only the browsing traffic performed on its pages (local BrowseGraph) and, as a consequence, the local computation can lead to estimation errors, which hinders the increasing number of applications in the state of the art. Also, although the divergence between the local and global ranks has been measured, the possibility of estimating such divergence using only local knowledge has been mainly overlooked. These aspects are of great interest for online service providers who want to: (i) gauge their ability to correctly assess the importance of their resources only based on their local knowledge, and (ii) take into account real user browsing fluxes that better capture the actual user interest than the static hyperlink network. We study the LRP problem on a BrowseGraph from a large news provider, considering as subgraphs the aggregations of browsing traces of users coming from different domains. We show that the distance between rankings can be accurately predicted based only on structural information of the local graph, being able to achieve an average rank correlation as high as 0.8.

@inproceedings{Trevisiol:2015:LRP:2766462.2767704,
 author = {Trevisiol, Michele and Aiello, Luca Maria and Boldi, Paolo and Blanco, Roi},
 title = {Local Ranking Problem on the BrowseGraph},
 booktitle = {Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '15},
 year = {2015},
 isbn = {978-1-4503-3621-5},
 location = {Santiago, Chile},
 pages = {173--182},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2766462.2767704},
 doi = {10.1145/2766462.2767704},
 acmid = {2767704},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {browsegraph, centrality algorithms, domain-specific browsing graphs, local ranking problem, pagerank},
}

Hierarchy Construction for News Summarizations

Jeroen B. P. Vuurens, Arjen P. de Vries, Roi Blanco, Peter Mika

Workshop paper SIGIR'15 - 38th international ACM SIGIR conference on Research and development in information retrieval

Abstract

Following online news about a specific event can be a difficult task as new information is often scattered across web pages. In such cases, an up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. Several approaches exist to compose a summary of salient sentences that are extracted from an online news stream for a given topic. Summaries often consist of multiple news stories, that when entwined may make it harder to read. We propose a general approach to convert non-hierarchical temporal summarizations into a hierarchical structure, that can be used to further compress the summary to provide more overview, that allows the user to navigate to specific subtopics of interest, and can be used to provide feedback to improve results. This approach reorganizes the sentences in a summary using a divisive clustering approach to capture the sentences per news story in a hierarchy.


@inproceedings{Vuurens:2015:ONT:2766462.2767872,
 author = {Vuurens, Jeroen B.P. and de Vries, Arjen P. and Blanco, Roi and Mika, Peter},
 title = {Hierarchy Construction for News Summarizations},
 booktitle = {Proceedings of the SIGIR 2015 Workshop on Temporal, Social and Spatially-aware Information Access},
 series = {TAIA2015},
 year = {2015},
 isbn = {978-1-4503-3621-5},
 location = {Santiago, Chile},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {content analysis, information extraction, information presentation},
}

Online News Tracking for Ad-Hoc Queries

Jeroen B. P. Vuurens, Arjen P. de Vries, Roi Blanco, Peter Mika

Demo SIGIR'15 - 38th international ACM SIGIR conference on Research and development in information retrieval

Abstract

Following news about a specific event can be a difficult task as new information is often scattered across web pages. An up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We demonstrate an approach that is feasible for online tracking of news that is relevant to a user's ad-hoc query.


@inproceedings{Vuurens:2015:ONT:2766462.2767872,
 author = {Vuurens, Jeroen B.P. and de Vries, Arjen P. and Blanco, Roi and Mika, Peter},
 title = {Online News Tracking for Ad-Hoc Queries},
 booktitle = {Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '15},
 year = {2015},
 isbn = {978-1-4503-3621-5},
 location = {Santiago, Chile},
 pages = {1047--1048},
 numpages = {2},
 url = {http://doi.acm.org/10.1145/2766462.2767872},
 doi = {10.1145/2766462.2767872},
 acmid = {2767872},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {content analysis, information extraction, information presentation},
}

Fast and Space-Efficient Entity Linking for Queries

Roi Blanco, Giuseppe Ottaviano, Edgar Meij

Paper WSDM '15 - 8th International Conference on Web Seach and Data Mining

Abstract

Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds. In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredients that make the algorithm fast and space-efficient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O(k2) implementation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model. We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times---at least two orders of magnitude faster than existing systems.


@inproceedings{Blanco:2015:FSE:2684822.2685317,
 author = {Blanco, Roi and Ottaviano, Giuseppe and Meij, Edgar},
 title = {Fast and Space-Efficient Entity Linking for Queries},
 booktitle = {Proceedings of the Eighth ACM International Conference on Web Search and Data Mining},
 series = {WSDM '15},
 year = {2015},
 isbn = {978-1-4503-3317-7},
 location = {Shanghai, China},
 pages = {179--188},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2684822.2685317},
 doi = {10.1145/2684822.2685317},
 acmid = {2685317},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {entity linking, queries, web search, wikipedia},
}

From "Selena Gomez" to "Marlon Brando": Understanding Explorative Entity Search

Iris Milliaraki, Roi Blanco, Mounia Lalmas

Paper WWW '15 - 24th international conference on World Wide Web

Abstract

Consider a user who submits a search query "Shakira" having a specific search goal in mind (such as her age) but at the same time willing to explore information for other entities related to her, such as comparable singers. In previous work, a system called Spark, was developed to provide such search experience. Given a query submitted to the Yahoo search engine, Spark provides related entity suggestions for the query, exploiting, among else, public knowledge bases from the Semantic Web. We refer to this search scenario as explorative entity search. The effectiveness and efficiency of the approach has been demonstrated in previous work. The way users interact with these related entity suggestions and whether this interaction can be predicted have however not been studied. In this paper, we perform a large-scale analysis into how users interact with the entity results returned by Spark. We characterize the users, queries and sessions that appear to promote an explorative behavior. Based on this analysis, we develop a set of query and user-based features that reflect the click behavior of users and explore their effectiveness in the context of a prediction task.


@inproceedings{Miliaraki:2015:SGM:2736277.2741284,
 author = {Miliaraki, Iris and Blanco, Roi and Lalmas, Mounia},
 title = {From "Selena Gomez" to "Marlon Brando": Understanding Explorative Entity Search},
 booktitle = {Proceedings of the 24th International Conference on World Wide Web},
 series = {WWW '15},
 year = {2015},
 isbn = {978-1-4503-3469-3},
 location = {Florence, Italy},
 pages = {765--775},
 numpages = {11},
 url = {http://dl.acm.org/citation.cfm?id=2736277.2741284},
 acmid = {2741284},
 publisher = {International World Wide Web Conferences Steering Committee},
 address = {Republic and Canton of Geneva, Switzerland},
 keywords = {explorative search, log analysis, related entity, user click behavior, yahoo spark system},
}

Using Wikipedia for Cross-Language Named Entity Recognition

Eraldo R. Fernandes, Ulf Brefeld, Roi Blanco, Jordi Atserias

Workshop Paper Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014

Abstract

Named entity recognition and classification (NERC) is fun- damental for natural language processing tasks such as information ex- traction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by ex- ploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a par- tially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple ex- tensions of hidden Markov models and structural perceptrons. Empiri- cally, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effec- tively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

Focused Crawling for Structured Data

Robert Meusel, Peter Mika, Roi Blanco

Paper CIKM '14 - 23rd ACM International Conference on Conference on Information and Knowledge Management.

Abstract

The Web is rapidly transforming from a pure document collection to the largest connected public data space. Semantic annotations of web pages make it notably easier to extract and reuse data and are increasingly used by both search engines and social media sites to provide better search experiences through rich snippets, faceted search, task completion, etc. In our work, we study the novel problem of crawling structured data embedded inside HTML pages. We describe Anthelion, the first focused crawler addressing this task. We propose new methods of focused crawling specifically designed for collecting data-rich pages with greater efficiency. In particular, we propose a novel combination of online learning and bandit-based explore/exploit approaches to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages. We show that these techniques significantly outperform state-of-the-art approaches for focused crawling, measured as the ratio of relevant pages and non-relevant pages collected within a given budget.


@inproceedings{Meusel:2014:FCS:2661829.2661902,
 author = {Meusel, Robert and Mika, Peter and Blanco, Roi},
 title = {Focused Crawling for Structured Data},
 booktitle = {Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management},
 series = {CIKM '14},
 year = {2014},
 isbn = {978-1-4503-2598-1},
 location = {Shanghai, China},
 pages = {1039--1048},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2661829.2661902},
 doi = {10.1145/2661829.2661902},
 acmid = {2661902},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {bandit-based selection, focused crawling, microdata, online learning},
}

Entity-Linking via Graph-Distance Minimization

Roi Blanco, Paolo Boldi, Andrea Marino

Workshop Paper Graphite 2014 - 3rd Workshop on GRAPH Inspection and Traversal

Abstract

Entity-linking is a natural-language-processing task that consists in identifying the entities mentioned in a piece of text, linking each to an appropriate item in some knowledge base; when the knowledge base is Wikipedia, the problem comes to be known as wikification (in this case, items are wikipedia articles). One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, under some restrictive assumptions, it turns out to be solvable in linear time. For the general case, we propose two heuristics: one tries to enforce the above assumptions and another one is based on the notion of hitting distance; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.

@inproceedings{BlancoBM14,
  author    = {Roi Blanco and
               Paolo Boldi and
               Andrea Marino},
  title     = {Entity-Linking via Graph-Distance Minimization},
  booktitle = {Proceedings 3rd Workshop on {GRAPH} Inspection and Traversal Engineering,
               {GRAPHITE} 2014, Grenoble, France, 5th April 2014.},
  year      = {2014},
  pages     = {30--43},
  doi       = {10.4204/EPTCS.159.4}
}

NTCIR temporalia: a test collection for temporal information access research

Hideo Joho, Adam Jatow, Roi Blanco

Workshop Paper TempWeb 2014 - 4th Temporal Web Analytics Workshop.

Abstract

Time is one of the key constructs of information quality. Following an upsurge of research in temporal aspects of information search, it has become clear that the community needs standardized evaluation benchmark for fostering research in Temporal Information Access. This paper introduces Temporalia (Temporal Information Access), a new pilot task run at NTCIR-11 to create re-usable datasets for those who are interested in temporal aspects of search technologies, and discusses its task design in detail.


@inproceedings{Joho:2014:NTT:2567948.2579044,
 author = {Joho, Hideo and Jatowt, Adam and Blanco, Roi},
 title = {NTCIR Temporalia: A Test Collection for Temporal Information Access Research},
 booktitle = {Proceedings of the 23rd International Conference on World Wide Web},
 series = {WWW '14 Companion},
 year = {2014},
 isbn = {978-1-4503-2745-9},
 location = {Seoul, Korea},
 pages = {845--850},
 numpages = {6},
 url = {http://dx.doi.org/10.1145/2567948.2579044},
 doi = {10.1145/2567948.2579044},
 acmid = {2579044},
 publisher = {International World Wide Web Conferences Steering Committee},
 address = {Republic and Canton of Geneva, Switzerland},
 keywords = {NTCIR, data challenge, temporal IR},
}

Identifying Re-finding Difficulty from User Query Logs

Sargol Sadeghi, Roi Blanco, Peter Mika, Mark Sanderson, Falk Scholer, David Vallet

Workshop Paper ACDS 2014 Australasian Document Computing Symposium.

Abstract

This paper presents a first study of how consistently human assessors are able to identify, from query logs, when searchers are facing difficulties re-finding documents. Using 12 assessors, we investigate the effect of two variables on assessor agreement: the assessment guideline detail, and assessor experience. The results indicate statistically significant better agreement when using detailed guidelines. An upper agreement of 78.9% was achieved, which is comparable to the levels of agreement in other information retrieval contexts. The effects of two contextual factors, representative of system performance and user effort, were studied. Significant differences between agreement levels were found for both factors, suggesting that contextual factors may play an important role in obtaining higher agreement levels. The findings contribute to a better understanding of how to generate ground truth data both in the re-finding and other labeling contexts, and have further implications for building automatic re-finding difficulty prediction models.

@inproceedings{Sadeghi:2014:IRD:2682862.2682867,
 author = {Sadeghi, Sargol and Blanco, Roi and Mika, Peter and Sanderson, Mark and Scholer, Falk and Vallet, David},
 title = {Identifying Re-finding Difficulty from User Query Logs},
 booktitle = {Proceedings of the 2014 Australasian Document Computing Symposium},
 series = {ADCS '14},
 year = {2014},
 isbn = {978-1-4503-3000-8},
 location = {Melbourne, VIC, Australia},
 pages = {105:105--105:108},
 articleno = {105},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/2682862.2682867},
 doi = {10.1145/2682862.2682867},
 acmid = {2682867},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {Assessor Agreement, Difficulty Detection, Re-finding},
}

Information extraction from multimedia web documents: an open-source platform and testbed

David Dupplaw, Michael Matthews , Richard Johansson, Giulia Boato, Andrea Costanzo, Marco Fontani, Enrico Minack, Elena Demidova, Roi Blanco , Thomas Griffiths, Paul Lewis, Jonathon Hare, Alessandro Moschitti

Journal Paper International Journal of Multimedia Information Retrieval (2014)

Abstract

The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval.


@Article{Dupplaw2014,
author="Dupplaw, David Paul
and Matthews, Michael
and Johansson, Richard
and Boato, Giulia
and Costanzo, Andrea
and Fontani, Marco
and Minack, Enrico
and Demidova, Elena
and Blanco, Roi
and Griffiths, Thomas
and Lewis, Paul
and Hare, Jonathon
and Moschitti, Alessandro",
title="Information extraction from multimedia web documents: an open-source platform and testbed",
journal="International Journal of Multimedia Information Retrieval",
year="2014",
volume="3",
number="2",
pages="97--111",
abstract="The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval.",
issn="2192-662X",
doi="10.1007/s13735-014-0051-2",
url="http://dx.doi.org/10.1007/s13735-014-0051-2"
}

Overview of NTCIR-11 Temporal Information Access (Temporalia) Task

Hideo Joho, Adam Jatowt, Roi Blanco, Hajime Naka, Shuhei Yamamoto

Paper NTCIR'14 - 11th NTCIR Conference

Abstract

This paper describes the overview of NTCIR-11 Temporal Information Access (Temporalia) task. This pilot task aims to foster research in temporal aspects of information re- trieval and search. Temporalia is composed of two subtasks: Temporal Query Intent Classification (TQIC) and Tempo- ral Information Retrieval (TIR) subtask. TQIC attracted 6 teams which submitted a total of 17 runs, while 6 teams took part in TIR proposing 18 runs. In this paper we describe both subtasks, datasets, evaluation methods and results of meta analyses.


@inproceedings{Joho:2014,
 author = {Joho, Hideo and Jatowt, Adam and Blanco, Roi and Naka, Hajime and Yamamoto, Shuhei},
 title = {Web Usage Mining with Semantic Analysis},
 booktitle = {Overview of NTCIR-11 Temporal Information Access (Temporalia) Task,
 series = {Proceedings of the 11th NTCIR Conference},
 year = {2014},
 location = {Tokyo, Japan},
}

User Generated Content Search

Roi Blanco , Manuel Eduardo Ares Brea, and Christina Lioma

Book Chapter Mining User Generated Content Search (2014)

Abstract

Due to developing technologies that are now readily available, user generated content (UGC) is growing rapidly and becoming one of the most prevalent and dynamic sources of information on the Web. Increasingly more data ap- pears online representing human judgement and interpretation about almost every aspect of the world: discussions, news, comments and other forms of ‘socialising’ on the Web. The increasing availability of such UGC from heterogeneous sources resembles a terra incognita of data and drives the need for advanced information retrieval (IR) technology that enables humans to search and retrieve it, navigate through it, and make sense of it. As such, UGC crosses paths with information retrieval (IR): it creates new IR scenarios, needs and expectations. This article presents (i) an overview of the main challenges and the respec- tive state-of-the-art (section 1.2), and (ii) a novel and effective approach for using UGC in IR.

inbook{BlancoUGC2014,
  author       = {Blanco, Roi and Ares, Eduardo and Lioma, Christina }, 
  title        = {User Generated Content Search},
  chapter      = 7,
  pages        = {167-188},
  publisher    = {CRC Press},
  year         = 2014,
  booktitle      = {Mining User Generated Content},
  edition      = 1,  
}

Marie-Francine Moens, Juanzi Li, Tat-Seng Chua (eds.)

Originating from Facebook, LinkedIn, Twitter, Instagram, YouTube, and many other networking sites, the social media shared by users and the associated metadata are collectively known as user generated content (UGC). To analyze UGC and glean insight about user behavior, robust techniques are needed to tackle the huge amount of real-time, multimedia, and multilingual data. Researchers must also know how to assess the social aspects of UGC, such as user relations and influential users. Mining User Generated Content is the first focused effort to compile state-of-the-art research and address future directions of UGC. It explains how to collect, index, and analyze UGC to uncover social trends and user habits. Divided into four parts, the book focuses on the mining and applications of UGC. The first part presents an introduction to this new and exciting topic. Covering the mining of UGC of different medium types, the second part discusses the social annotation of UGC, social network graph construction and community mining, mining of UGC to assist in music retrieval, and the popular but difficult topic of UGC sentiment analysis. The third part describes the mining and searching of various types of UGC, including knowledge extraction, search techniques for UGC content, and a specific study on the analysis and annotation of Japanese blogs. The fourth part on applications explores the use of UGC to support question-answering, information summarization, and recommendations.

Web Usage Mining with Semantic Analysis

Laura Hollink, Peter Mika, Roi Blanco

Paper WWW '13 - 22nd international conference on World Wide Web

Abstract

Web usage mining has traditionally focused on the individual queries or query words leading to a web site or web page visit, mining patterns in such data. In our work, we aim to characterize websites in terms of the semantics of the queries that lead to them by linking queries to large knowledge bases on the Web. We demonstrate how to exploit such links for more effective pattern mining on query log data. We also show how such patterns can be used to qualitatively describe the differences between competing websites in the same domain and to quantitatively predict website abandonment


@inproceedings{Hollink:2013:WUM:2488388.2488438,
 author = {Hollink, Laura and Mika, Peter and Blanco, Roi},
 title = {Web Usage Mining with Semantic Analysis},
 booktitle = {Proceedings of the 22Nd International Conference on World Wide Web},
 series = {WWW '13},
 year = {2013},
 isbn = {978-1-4503-2035-1},
 location = {Rio de Janeiro, Brazil},
 pages = {561--570},
 numpages = {10},
 url = {http://dl.acm.org/citation.cfm?id=2488388.2488438},
 acmid = {2488438},
 publisher = {International World Wide Web Conferences Steering Committee},
 address = {Republic and Canton of Geneva, Switzerland},
 keywords = {query log, query session, semantic analysis},
}

Entity Recommendations in Web Search

Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec

Paper ISWC'13 - 12th International Semantic Web Conference

Abstract

While some web search users know exactly what they are looking for, others are willing to explore topics related to an initial interest. Often, the user’s initial interest can be uniquely linked to an entity in a knowledge base. In this case, it is natural to recommend the explicitly linked entities for further exploration. In real world knowledge bases, however, the number of linked entities may be very large and not all related entities may be equally relevant. Thus, there is a need for ranking related entities. In this paper, we describe Spark, a recommendation engine that links a user’s initial query to an entity within a knowledge base and provides a ranking of the related entities. Spark extracts several signals from a variety of data sources, including Yahoo! Web Search, Twitter, and Flickr, using a large cluster of computers running Hadoop. These signals are combined with a machine learned ranking model in order to produce a final recommendation of entities to user queries. This system is currently powering Yahoo! Web Search result pages.

@Inbook{Blanco2013,
author="Blanco, Roi
and Cambazoglu, Berkant Barla
and Mika, Peter
and Torzec, Nicolas",
editor="Alani, Harith
and Kagal, Lalana
and Fokoue, Achille
and Groth, Paul
and Biemann, Chris
and Parreira, Josiane Xavier
and Aroyo, Lora
and Noy, Natasha
and Welty, Chris
and Janowicz, Krzysztof",
chapter="Entity Recommendations in Web Search",
title="The Semantic Web -- ISWC 2013: 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part II",
year="2013",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="33--48",
isbn="978-3-642-41338-4",
doi="10.1007/978-3-642-41338-4_3",
url="http://dx.doi.org/10.1007/978-3-642-41338-4_3"
}

Federated Entity Search Using On-the-Fly Consolidation

Daniel Herzig, Peter Mika, Roi Blanco

Paper ISWC'13 - 12th International Semantic Web Conference

Abstract

Nowadays, search on the Web goes beyond the retrieval of textual Web sites and increasingly takes advantage of the growing amount of structured data. Of particular interest is entity search, where the units of retrieval are structured entities instead of textual documents. These entities reside in different sources, which may provide only limited information about their content and are therefore called “uncooperative”. Further, these sources capture complementary but also redundant information about entities. In this environment of uncooperative data sources, we study the problem of federated entity search, where redundant information about entities is reduced on-the-fly through entity consolidation performed at query time. We propose a novel method for entity consolidation that is based on using language models and completely unsupervised, hence more suitable for this on-the-fly uncooperative setting than state-of-the-art methods that require training data. Further, we apply the same language model technique to deal with the federated search problem of ranking results returned from different sources. Particular novel are the mechanisms we propose to incorporate consolidation results into this ranking. We perform experiments using real Web queries and data sources. Our experiments show that our approach for federated entity search with on-the-fly consolidation improves upon the performance of a state-of-the-art preference aggregation baseline and also benefits from consolidation.

@inproceedings{Herzig:2013:FES:2717129.2717141,
 author = {Herzig, Daniel M. and Mika, Peter and Blanco, Roi and Tran, Thanh},
 title = {Federated Entity Search Using On-the-Fly Consolidation},
 booktitle = {Proceedings of the 12th International Semantic Web Conference - Part I},
 series = {ISWC '13},
 year = {2013},
 isbn = {978-3-642-41334-6},
 pages = {167--183},
 numpages = {17},
 url = {http://dx.doi.org/10.1007/978-3-642-41335-3_11},
 doi = {10.1007/978-3-642-41335-3_11},
 acmid = {2717141},
 publisher = {Springer-Verlag New York, Inc.},
 address = {New York, NY, USA},
}

Repeatable and Reliable Semantic Search Evaluation

Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, and Thanh Tran Duc

Journal Paper Journal Web Semantics (2013): Science, Services and Agents on the World Wide Web

Abstract

An increasing amount of structured data on the Web has attracted industry attention and renewed research interest in what is collectively referred to as semantic search. These solutions exploit the explicit semantics captured in structured data such as RDF for enhancing document representation and retrieval, or for finding answers by directly searching over the data. These data have been used for different tasks and a wide range of corresponding semantic search solutions have been proposed in the past. However, it has been widely recognized that a standardized setting to evaluate and analyze the current state-of-the-art in semantic search is needed to monitor and stimulate further progress in the field. In this paper, we present an evaluation framework for semantic search, analyze the framework with regard to repeatability and reliability, and report on our experiences on applying it in the Semantic Search Challenge 2010 and 2011.


@article{Blanco:2013:RRS:2528552.2528622,
 author = {Blanco, Roi and Halpin, Harry and Herzig, Daniel M. and Mika, Peter and Pound, Jeffrey and Thompson, Henry S. and Tran, Thanh},
 title = {Repeatable and Reliable Semantic Search Evaluation},
 journal = {Web Semant.},
 issue_date = {August, 2013},
 volume = {21},
 month = aug,
 year = {2013},
 issn = {1570-8268},
 pages = {14--29},
 numpages = {16},
 url = {http://dx.doi.org/10.1016/j.websem.2013.05.005},
 doi = {10.1016/j.websem.2013.05.005},
 acmid = {2528622},
 publisher = {Elsevier Science Publishers B. V.},
 address = {Amsterdam, The Netherlands, The Netherlands},
 keywords = {RDF, Semantic search, Semantic search evaluation, Structured data, Web data, Web search},
}

Influence of Timeline and Named-Entity Components on User Engagement

Yashar Moshfeghi, Michael Mattews, Roi Blanco, Joemon M. Jose

Paper ECIR'13 - 35th European conference on Advances in Information Retrieval

Abstract

Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.

@Inbook{Moshfeghi2013,
author="Moshfeghi, Yashar
and Matthews, Michael
and Blanco, Roi
and Jose, Joemon M.",
editor="Serdyukov, Pavel
and Braslavski, Pavel
and Kuznetsov, Sergei O.
and Kamps, Jaap
and R{\"u}ger, Stefan
and Agichtein, Eugene
and Segalovich, Ilya
and Yilmaz, Emine",
chapter="Influence of Timeline and Named-Entity Components on User Engagement",
title="Advances in Information Retrieval: 35th European Conference on IR Research, ECIR 2013, Moscow, Russia, March 24-27, 2013. Proceedings",
year="2013",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="305--317",
isbn="978-3-642-36973-5",
doi="10.1007/978-3-642-36973-5_26",
url="http://dx.doi.org/10.1007/978-3-642-36973-5_26"
}

Learning Relevance of Web Resources across Domains to Make Recommendations

Julia Hoxha, Peter Mika, Roi Blanco

Paper ICMLA '13 - 12th International Conference in Machine Learning and Applications

Abstract

Most traditional recommender systems focus on the objective of improving the accuracy of recommendations in a single domain. However, preferences of users may extend over multiple domains, especially in the Web where users often have browsing preferences that span across different sites, while being unaware of relevant resources on other sites. This work tackles the problem of recommending resources from various domains by exploiting the semantic content of these resources in combination with patterns of user browsing behavior. We overcome the lack of overlaps between domains by deriving connections based on the explored semantic content of Web resources. We present an approach that applies Support Vector Machines for learning the relevance of resources and predicting which ones are the most relevant to recommend to a user, given that the user is currently viewing a certain page. In real-world datasets of semantically-enriched logs of user browsing behavior at multiple Web sites, we study the impact of structure in generating accurate recommendations and conduct experiments that demonstrate the effectiveness of our approach.

@inproceedings{Hoxha:2013,
 author = {Hoxha, Julia and Mika, Peter and Blanco, Roi},
 title = {Towards Leveraging Closed Captions for News Retrieval},
 series = {Proceedings of the 12th International Conference in Machine Learning and Applications (ICMLA ’13)},
 year = {2013},
 location = {Miami},
 publisher = {IEEE Computer Society}, 
}

Towards Leveraging Closed Captions for News Retrieval

Roi Blanco, Gianmarco Defrancisi Morales, Fabrizio Silvestri

Poster WWW '13 - Companion Proceedings of the 22nd international conference on World Wide Web

Abstract

IntoNow from Yahoo! is a second screen application that enhances the way of watching TV programs. The application uses audio from the TV set to recognize the program being watched, and provides several services for different use cases. For instance, while watching a football game on TV it can show statistics about the teams playing, or show the title of the song performed by a contestant in a talent show. The additional content provided by IntoNow is a mix of editorially curated and automatically selected one. From a research perspective, one of the most interesting and challenging use cases addressed by IntoNow is related to news programs (newscasts). When a user is watching a newscast, IntoNow detects it and starts showing online news articles from the Web. This work presents a preliminary study of this problem, i.e., to find an online news article that matches the piece of news discussed in the newscast currently airing on TV, and display it in real-time.


@inproceedings{Blanco:2013:TLC:2487788.2487853,
 author = {Blanco, Roi and De Francisci Morales, Gianmarco and Silvestri, Fabrizio},
 title = {Towards Leveraging Closed Captions for News Retrieval},
 booktitle = {Proceedings of the 22Nd International Conference on World Wide Web},
 series = {WWW '13 Companion},
 year = {2013},
 isbn = {978-1-4503-2038-2},
 location = {Rio de Janeiro, Brazil},
 pages = {135--136},
 numpages = {2},
 url = {http://dl.acm.org/citation.cfm?id=2487788.2487853},
 acmid = {2487853},
 publisher = {International World Wide Web Conferences Steering Committee},
 address = {Republic and Canton of Geneva, Switzerland},
 keywords = {continuous retrieval, intonow, news retrieval},
}

A Survey of Temporal Web Search Experience

Hideo Joho, Adam Jatow, Roi Blanco

Workshop paper WWW '13 Companion Proceedings of the 22nd international conference on World Wide Web

Abstract

Temporal aspects of web search have gained a great level of attention in the recent years. However, many of the research attempts either focused on technical development of various tools or behavioral analysis based on log data. This paper presents the results of user survey carried out to investigate the practice and experience of temporal web search. A total of 110 people was recruited and answered 18 questions regarding their recent experience of web search. Our results suggest that an interplay of seasonal interests, technicality of information needs, target time of information, re-finding behaviour, and freshness of information can be important factors for the application of temporal search. These findings should be complementary to log analyses for further development of temporally aware search engines.

@inproceedings{Joho:2013:STW:2487788.2488126,
 author = {Joho, Hideo and Jatowt, Adam and Roi, Blanco},
 title = {A Survey of Temporal Web Search Experience},
 booktitle = {Proceedings of the 22Nd International Conference on World Wide Web},
 series = {WWW '13 Companion},
 year = {2013},
 isbn = {978-1-4503-2038-2},
 location = {Rio de Janeiro, Brazil},
 pages = {1101--1108},
 numpages = {8},
 url = {http://dl.acm.org/citation.cfm?id=2487788.2488126},
 acmid = {2488126},
 publisher = {International World Wide Web Conferences Steering Committee},
 address = {Republic and Canton of Geneva, Switzerland},
 keywords = {survey, temporal web search, user experience},
}

Extending BM25 with multiple query operators

Roi Blanco, Paolo Boldi

Paper SIGIR '12 - 35th international ACM SIGIR conference on Research and development in information retrieval

Abstract

Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.


@inproceedings{Blanco:2012:EBM:2348283.2348406,
 author = {Blanco, Roi and Boldi, Paolo},
 title = {Extending BM25 with Multiple Query Operators},
 booktitle = {Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '12},
 year = {2012},
 isbn = {978-1-4503-1472-5},
 location = {Portland, Oregon, USA},
 pages = {921--930},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2348283.2348406},
 doi = {10.1145/2348283.2348406},
 acmid = {2348406},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {BM25, query processing, query segmentation, ranking},
}

Language intent models for inferring user browsing behavior

Manos Tsagkias, Roi Blanco

Paper SIGIR '12 - 35th international ACM SIGIR conference on Research and development in information retrieval

Abstract

Modeling user browsing behavior is an active research area with tangible real-world applications, e.g., organizations can adapt their online presence to their visitors browsing behavior with positive effects in user engagement, and revenue. We concentrate on online news agents, and present a semi-supervised method for predicting news articles that a user will visit after reading an initial article. Our method tackles the problem using language intent models trained on historical data which can cope with unseen articles. We evaluate our method on a large set of articles and in several experimental settings. Our results demonstrate the utility of language intent models for predicting user browsing behavior within online news sites.


@inproceedings{Tsagkias:2012:LIM:2348283.2348330,
 author = {Tsagkias, Manos and Blanco, Roi},
 title = {Language Intent Models for Inferring User Browsing Behavior},
 booktitle = {Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '12},
 year = {2012},
 isbn = {978-1-4503-1472-5},
 location = {Portland, Oregon, USA},
 pages = {335--344},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2348283.2348330},
 doi = {10.1145/2348283.2348330},
 acmid = {2348330},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {article intent models, behavior, browsing, online news, user},
}

You should read this! Let me explain to you why! Explaining News Recommendations to Users

Roi Blanco, Diego Ceccarelli, Claudio Lucchese, Raffaele Perego, Fabrizio Silvestri

Paper CIKM '12 - 21st ACM international conference on Information and knowledge management

Abstract

Recommender systems have become ubiquitous in content-based web applications, from news to shopping sites. Nonetheless, an aspect that has been largely overlooked so far in the recommender system literature is that of automatically building explanations for a particular recommendation. This paper focuses on the news domain, and proposes to enhance effectiveness of news recommender systems by adding, to each recommendation, an explanatory statement to help the user to better understand if, and why, the item can be her interest. We consider the news recommender system as a black-box, and generate different types of explanations employing pieces of information associated with the news. In particular, we engineer text-based, entity-based, and usage-based explanations, and make use of a Markov Logic Networks to rank the explanations on the basis of their effectiveness. The assessment of the model is conducted via a user study on a dataset of news read consecutively by actual users. Experiments show that news recommender systems can greatly benefit from our explanation module as it allows users to discriminate between interesting and not interesting news in the majority of the cases.


@inproceedings{Blanco:2012:YRT:2396761.2398559,
 author = {Blanco, Roi and Ceccarelli, Diego and Lucchese, Claudio and Perego, Raffaele and Silvestri, Fabrizio},
 title = {You Should Read This! Let Me Explain You Why: Explaining News Recommendations to Users},
 booktitle = {Proceedings of the 21st ACM International Conference on Information and Knowledge Management},
 series = {CIKM '12},
 year = {2012},
 isbn = {978-1-4503-1156-4},
 location = {Maui, Hawaii, USA},
 pages = {1995--1999},
 numpages = {5},
 url = {http://doi.acm.org/10.1145/2396761.2398559},
 doi = {10.1145/2396761.2398559},
 acmid = {2398559},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {markov logic networks, news recommendation, query log analysis, recommendation snippets},
}

Characterizing Web Search Queries that Match Few or No Results

Ismail Sengor Altingovde, Roi Blanco, Berkant Barla Cambazoglu, Rifat Ozcan, Erdem Sarigil, Ozgur Ulusoy

Paper CIKM '12 - 21st ACM international conference on Information and knowledge management

Abstract

Query logs record the actual usage of search systems and their analysis has proven critical to improving search engDespite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.


@inproceedings{Altingovde:2012:CWS:2396761.2398560,
 author = {Altingovde, Ismail Sengor and Blanco, Roi and Cambazoglu, Berkant Barla and Ozcan, Rifat and Sarigil, Erdem and Ulusoy, \"{O}zg\"{u}r},
 title = {Characterizing Web Search Queries That Match Very Few or No Results},
 booktitle = {Proceedings of the 21st ACM International Conference on Information and Knowledge Management},
 series = {CIKM '12},
 year = {2012},
 isbn = {978-1-4503-1156-4},
 location = {Maui, Hawaii, USA},
 pages = {2000--2004},
 numpages = {5},
 url = {http://doi.acm.org/10.1145/2396761.2398560},
 doi = {10.1145/2396761.2398560},
 acmid = {2398560},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {query difficulty, search result quality, web search engines},
}

Measuring Website Similarity Using an Entity-Aware Click Graph

Pablo Mendes, Peter Mika, Hugo Zaragoza, Roi Blanco

Paper CIKM '12 - 21st ACM international conference on Information and knowledge management

Abstract

Query logs record the actual usage of search systems and their analysis has proven critical to improving search engine functionality. Yet, despite the deluge of information, query log analysis often suffers from the sparsity of the query space. Based on the observation that most queries pivot around a single entity that represents the main focus of the user's need, we propose a new model for query log data called the entity-aware click graph. In this representation, we decompose queries into entities and modifiers, and measure their association with clicked pages. We demonstrate the benefits of this approach on the crucial task of understanding which websites fulfill similar user needs, showing that using this representation we can achieve a higher precision than other query log-based approaches.

@inproceedings{Mendes:2012:MWS:2396761.2398500,
 author = {Mendes, Pablo N. and Mika, Peter and Zaragoza, Hugo and Blanco, Roi},
 title = {Measuring Website Similarity Using an Entity-aware Click Graph},
 booktitle = {Proceedings of the 21st ACM International Conference on Information and Knowledge Management},
 series = {CIKM '12},
 year = {2012},
 isbn = {978-1-4503-1156-4},
 location = {Maui, Hawaii, USA},
 pages = {1697--1701},
 numpages = {5},
 url = {http://doi.acm.org/10.1145/2396761.2398500},
 doi = {10.1145/2396761.2398500},
 acmid = {2398500},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {click graph, query logs, website similarity},
}

Graph-based term weighting for information retrieval

Roi Blanco, Christina Lioma

Journal Paper Information Retrieval (2012), Springer.

Abstract

A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it performs comparably to a tuned ranking baseline, such as BM25. In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval. We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC datasets and evaluation measures.


@article{Blanco:2012:GTW:2158560.2158588,
 author = {Blanco, Roi and Lioma, Christina},
 title = {Graph-based Term Weighting for Information Retrieval},
 journal = {Inf. Retr.},
 issue_date = {February  2012},
 volume = {15},
 number = {1},
 month = feb,
 year = {2012},
 issn = {1386-4564},
 pages = {54--92},
 numpages = {39},
 url = {http://dx.doi.org/10.1007/s10791-011-9172-x},
 doi = {10.1007/s10791-011-9172-x},
 acmid = {2158588},
 publisher = {Kluwer Academic Publishers},
 address = {Hingham, MA, USA},
 keywords = {Graph theory, Information retrieval, Natural language processing},
}

BM-Yahoo! at RepLab 2012

Jose M. Chenlo, Carlos Rodriguez, Jordi Atserias, Roi Blanco

Workshop Paper CLEF 2012

Abstract

This paper describes FBM-Yahoo!'s participation in the profiling task of RepLab 2012, which aims at determining whether a given tweet is related to a specific company and, in if this being the case, whether it contains a positive or negative statement related to the company's reputation or not. We addressed both problems (ambiguity and polarity reputation) using Support Vector Machines (SVM) classifiers and lexicon-based techniques, building automatically company profiles and bootstrapping background data. Concretely, for the ambiguity task we employed a linear SVM classifier with a token-based representation of relevant and irrelevant information extracted from the tweets and Freebase resources. With respect to polarity classification, we combined SVM lexicon-based approaches with bootstrapping in order to determine the final polarity label of a tweet.


@inproceedings{conf/clef/ChenloARB12,
  added-at = {2012-10-01T11:19:47.000+0200},
  author = {Chenlo, Jose M. and Atserias, Jordi and Rodriguez, Carlos and Blanco, Roi},
  bibsource = {DBLP, http://dblp.uni-trier.de},
  biburl = {http://www.bibsonomy.org/bibtex/23c6fbc5bdba5fb745d7823689fdd6ecf/dbenz_test},
  booktitle = {CLEF (Online Working Notes/Labs/Workshop)},
  crossref = {conf/clef/2012w},
  editor = {Forner, Pamela and Karlgren, Jussi and Womser-Hacker, Christa},
  ee = {http://www.clef-initiative.eu/documents/71612/bd18de7e-435a-4ae6-bbc0-5b9bc135d52e},
  interhash = {a7df30b406fb6ab22f57ce48fbabc494},
  intrahash = {3c6fbc5bdba5fb745d7823689fdd6ecf},
  isbn = {978-88-904810-3-1},
  keywords = {dblp},
  timestamp = {2012-10-01T11:19:54.000+0200},
  title = {FBM-Yahoo! at RepLab 2012.},
  url = {http://dblp.uni-trier.de/db/conf/clef/clef2012w.html#ChenloARB12},
  year = 2012
}

Machine Learning for Spammer Detection in Crowd-Sourcing

Harry Halpin, Roi Blanco

Workshop Paper HCOMP 2012 - The Second AAAI Conference on Human Computation and Crowdsourcing

Abstract

Over a series of evaluation experiments conducted using naive judges recruited and managed via Amazon’s Mechanical Turk facility using a task from information retrieval (IR), we show that a SVM shows itself to have a very high accuracy when the machine-learner is trained and tested on a single task and that the method was portable from more complex tasks to simpler tasks, but not vice versa.

@inproceedings{Halpin:2012,
 author = {Halpin, Harry and Blanco, Roi},
 title = {Machine Learning for Spammer Detection in Crowd-Sourcing},
 booktitle = {HThe Second AAAI Conference on Human Computation and Crowdsourcing },
 series = {COMP 2012},
 year = {2012},
}

Effective and Efficient Entity Search in RDF Data

Roi Blanco, Sebastiano Vigna, Peter Mika

Paper ISWC'11 - 10th international conference on The Semantic Web

Abstract

Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors combined have led to an increased interest in semantic search, i.e. access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.


@inproceedings{Blanco:2011:EEE:2063016.2063023,
 author = {Blanco, Roi and Mika, Peter and Vigna, Sebastiano},
 title = {Effective and Efficient Entity Search in RDF Data},
 booktitle = {Proceedings of the 10th International Conference on The Semantic Web - Volume Part I},
 series = {ISWC'11},
 year = {2011},
 isbn = {978-3-642-25072-9},
 location = {Bonn, Germany},
 pages = {83--97},
 numpages = {15},
 url = {http://dl.acm.org/citation.cfm?id=2063016.2063023},
 acmid = {2063023},
 publisher = {Springer-Verlag},
 address = {Berlin, Heidelberg},
}

Repeatable and Reliable Search System Evaluation using Crowd-Sourcing

Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, and Thanh Tran Duc

Paper SIGIR '11 - 34th international ACM SIGIR conference on Research and development in Information Retrieval

Abstract

The primary problem confronting any new kind of search task is how to boot-strap a reliable and repeatable evaluation campaign, and a crowd-sourcing approach provides many advantages. However, can these crowd-sourced evaluations be repeated over long periods of time in a reliable manner? To demonstrate, we investigate creating an evaluation campaign for the semantic search task of keyword-based ad-hoc object retrieval. In contrast to traditional search over web-pages, object search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. Using the first large-scale evaluation campaign that specifically targets the task of ad-hoc Web object retrieval over a number of deployed systems, we demonstrate that crowd-sourced evaluation campaigns can be repeated over time and still maintain reliable results. Furthermore, we show how these results are comparable to expert judges when ranking systems and that the results hold over different evaluation and relevance metrics. This work provides empirical support for scalable, reliable, and repeatable search system evaluation using crowdsourcing.

@inproceedings{Blanco:2011:RRS:2009916.2010039,
 author = {Blanco, Roi and Halpin, Harry and Herzig, Daniel M. and Mika, Peter and Pound, Jeffrey and Thompson, Henry S. and Tran Duc, Thanh},
 title = {Repeatable and Reliable Search System Evaluation Using Crowdsourcing},
 booktitle = {Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '11},
 year = {2011},
 isbn = {978-1-4503-0757-4},
 location = {Beijing, China},
 pages = {923--932},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2009916.2010039},
 doi = {10.1145/2009916.2010039},
 acmid = {2010039},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {crowdsourcing, evaluation, retrieval, search engines},
}

Energy-Price-Driven Query Processing in Multi-center Web Search Engines

Enver Kayaaslan, Berkant Barla Cambazoglu, Roi Blanco, Flavio Junqueira, Cevdet Aykanat

Paper SIGIR '11 - 34th international ACM SIGIR conference on Research and development in Information Retrieval

Abstract

Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.


@inproceedings{Kayaaslan:2011:EQP:2009916.2010047,
 author = {Kayaaslan, Enver and Cambazoglu, B. Barla and Blanco, Roi and Junqueira, Flavio P. and Aykanat, Cevdet},
 title = {Energy-price-driven Query Processing in Multi-center Web Search Engines},
 booktitle = {Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '11},
 year = {2011},
 isbn = {978-1-4503-0757-4},
 location = {Beijing, China},
 pages = {983--992},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2009916.2010047},
 doi = {10.1145/2009916.2010047},
 acmid = {2010047},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {data center, energy, query processing, web search engine},
}

Enhanced Results for Web Search

Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco

Paper SIGIR '11 - 34th international ACM SIGIR conference on Research and development in Information Retrieval

Abstract

"Ten blue links" have defined web search results for the last fifteen years -- snippets of text combined with document titles and URLs. In this paper, we establish the notion of enhanced search results that extend web search results to include multimedia objects such as images and video, intent-specific key value pairs, and elements that allow the user to interact with the contents of a web page directly from the search results page. We show that users express a preference for enhanced results both explicitly, and when observed in their search behavior. We also demonstrate the effectiveness of enhanced results in helping users to assess the relevance of search results. Lastly, we show that we can efficiently generate enhanced results to cover a significant fraction of search result pages.


@inproceedings{Haas:2011:ERW:2009916.2010014,
 author = {Haas, Kevin and Mika, Peter and Tarjan, Paul and Blanco, Roi},
 title = {Enhanced Results for Web Search},
 booktitle = {Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '11},
 year = {2011},
 isbn = {978-1-4503-0757-4},
 location = {Beijing, China},
 pages = {725--734},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2009916.2010014},
 doi = {10.1145/2009916.2010014},
 acmid = {2010014},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {search results, semantic web, user interfaces, web search},
}

Ranking Related News Predictions

Nattiya Kanhabua, Roi Blanco, Michael Matthews

Paper SIGIR'11 - 34th Annual ACM SIGIR Conference

Abstract

We estimate that nearly one third of news articles contain references to future events. While this information can prove crucial to understanding news stories and how events will develop for a given topic, there is currently no easy way to access this information. We propose a new task to address the problem of retrieving and ranking sentences that contain mentions to future events, which we call ranking related news predictions. In this paper, we formally define this task and propose a learning to rank approach based on 4 classes of features: term similarity, entity-based similarity, topic similarity, and temporal similarity. Through extensive evaluations using a corpus consisting of 1.8 millions news articles and 6,000 manually judged relevance pairs, we show that our approach is able to retrieve a significant number of relevant predictions related to a given topic.

@inproceedings{Kanhabua:2011:RRN:2009916.2010018,
 author = {Kanhabua, Nattiya and Blanco, Roi and Matthews, Michael},
 title = {Ranking Related News Predictions},
 booktitle = {Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '11},
 year = {2011},
 isbn = {978-1-4503-0757-4},
 location = {Beijing, China},
 pages = {755--764},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2009916.2010018},
 doi = {10.1145/2009916.2010018},
 acmid = {2010018},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {future events, news predictions, sentence retrieval and ranking},
}

Hybrid models for future event prediction

Giuseppe Amodeo, Roi Blanco, Ulf Brefeld

Paper CIKM '11 - 20th ACM international conference on Information and knowledge management

Abstract

We present a hybrid method to turn off-the-shelf information retrieval (IR) systems into future event predictors. Given a query, a time series model is trained on the publication dates of the retrieved documents to capture trends and periodicity of the associated events. The periodicity of historic data is used to estimate a probabilistic model to predict future bursts. Finally, a hybrid model is obtained by intertwining the probabilistic and the time-series model. Our empirical results on the New York Times corpus show that autocorrelation functions of time-series suffice to classify queries accurately and that our hybrid models lead to more accurate future event predictions than baseline competitors.

@inproceedings{Amodeo:2011:HMF:2063576.2063870,
 author = {Amodeo, Giuseppe and Blanco, Roi and Brefeld, Ulf},
 title = {Hybrid Models for Future Event Prediction},
 booktitle = {Proceedings of the 20th ACM International Conference on Information and Knowledge Management},
 series = {CIKM '11},
 year = {2011},
 isbn = {978-1-4503-0717-8},
 location = {Glasgow, Scotland, UK},
 pages = {1981--1984},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/2063576.2063870},
 doi = {10.1145/2063576.2063870},
 acmid = {2063870},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {arima, arma, event prediction, future prediction, information retrieval, sarima, time series, web search},
}

Assigning documents to master sites in distributed search

Roi Blanco, Berkant Barla Cambazoglu, Flavio Junqueira, Ivan Kelly, Vincent Leroy

Paper CIKM '11 20th ACM international conference on Information and knowledge management

Abstract

An appealing solution to scale Web search with the growth of the Internet is the use of distributed architectures. Distributed search engines rely on multiple sites deployed in distant regions across the world, where each site is specialized to serve queries issued by the users of its region. This paper investigates the problem of assigning each document to a master site. We show that by leveraging similarities between a document and the activity of the users, we can accurately detect which site is the most relevant to place a document. We conduct various experiments using two document assignment approaches, showing performance improvements of up to 20.8% over a baseline technique which assigns the documents to search sites based on their language.


@inproceedings{Blanco:2011:ADM:2063576.2063591,
 author = {Blanco, Roi and Cambazoglu, B. Barla and Junqueira, Flavio P. and Kelly, Ivan and Leroy, Vincent},
 title = {Assigning Documents to Master Sites in Distributed Search},
 booktitle = {Proceedings of the 20th ACM International Conference on Information and Knowledge Management},
 series = {CIKM '11},
 year = {2011},
 isbn = {978-1-4503-0717-8},
 location = {Glasgow, Scotland, UK},
 pages = {67--76},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2063576.2063591},
 doi = {10.1145/2063576.2063591},
 acmid = {2063591},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {distributed index, document assignment, multi-site web search engine},
}

Keyword search over RDF graphs

Shady Elbassuoni, Roi Blanco

Paper CIKM '11 - 20th ACM international conference on Information and knowledge management

Abstract

Large knowledge bases consisting of entities and relationships be- tween them have become vital sources of information for many ap- plications. Most of these knowledge bases adopt the Semantic-Web data model RDF as a representation model. They contain a large set of subject-predicate-object (SPO) triples where subjects and ob- jects are entities and predicates express relationships between them. Alternatively, such RDF collections can be also viewed as large graphs where subjects and objects represent nodes and predicates represent typed edges. Querying such knowledge bases can be done by using structured queries utilizing graph-pattern languages such as SPARQL. Even though structured queries are very expressive and enable users to formulate their information needs precisely, they are also very restrictive. Users are accustomed to keyword search which has become the paradigm to perform IR tasks on the Web. Thus, it is crucial to free the users from the burden of posing structured queries, and enable them to pose keyword queries to search RDF data. In this paper, we propose a retrieval model for keyword queries over RDF graphs. Our model retrieves a set of subgraphs that match the query keywords, and ranks them based on statis- tical language models. We show that our retrieval model outper- forms the-state-of-the-art IR and DB models for keyword search over structured data using experiments over two real-world datasets.

@inproceedings{Elbassuoni:2011:KSO:2063576.2063615,
 author = {Elbassuoni, Shady and Blanco, Roi},
 title = {Keyword Search over RDF Graphs},
 booktitle = {Proceedings of the 20th ACM International Conference on Information and Knowledge Management},
 series = {CIKM '11},
 year = {2011},
 isbn = {978-1-4503-0717-8},
 location = {Glasgow, Scotland, UK},
 pages = {237--242},
 numpages = {6},
 url = {http://doi.acm.org/10.1145/2063576.2063615},
 doi = {10.1145/2063576.2063615},
 acmid = {2063615},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {RDF, entity, graph, keywords, relationship, search, structured},
}

Coreference aware web object retrieval

Jeff Dalton, Roi Blanco, Peter Mika

Paper CIKM '11 - 20th ACM international conference on Information and knowledge management

Abstract

As user demands become increasingly sophisticated, search engines today are competing in more than just returning document results from the Web. One area of competition is providing web object results from structured data extracted from a multitude of information sources. We address the problem of performing keyword retrieval over a collection of objects containing a large degree of duplication as different Web-based information sources provide descriptions of the same object. We develop a method for coreference aware retrieval that performs topic-specific coreference resolution on retrieved objects in order to improve object search results. Our results demonstrate that coreference has a significant impact on the effectiveness of retrieval in the domain of local search. Our results show that a coreference aware system outperforms naive object retrieval by more than 20% in P5 and P10.


@inproceedings{Dalton:2011:CAW:2063576.2063612,
 author = {Dalton, Jeffrey and Blanco, Roi and Mika, Peter},
 title = {Coreference Aware Web Object Retrieval},
 booktitle = {Proceedings of the 20th ACM International Conference on Information and Knowledge Management},
 series = {CIKM '11},
 year = {2011},
 isbn = {978-1-4503-0717-8},
 location = {Glasgow, Scotland, UK},
 pages = {211--220},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/2063576.2063612},
 doi = {10.1145/2063576.2063612},
 acmid = {2063612},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {coreference, object retrieval, semantic search, structured data, vertical search},
}

Beware of Relatively Large but Meaningless Improvements

Roi Blanco, Hugo Zaragoza

Technical Report Yahoo Technical Report 2011

Abstract

In information retrieval (IR) it is customary to invent new features in order to enhance docu- ment ranking. Typically, these features are incorporated into a retrieval model and performance is optimized over a collection at hand. The objective is to find an improvement over a baseline model, measured using standard metrics (such as mean average precision); since the retrieval problem is very hard, small relative improvements in performance are considered interesting (and pub- lishable). However, in practice it is sometimes the case that fixing a bug or changing slightly some pre-processing step over the data produces this sort of improvement. We were interested in deter- mining how likely it is that pure random effects may lead to significant improvements. Results were sufficiently surprising to merit discussion and publication in our opinion.

@INPROCEEDINGS{Blanco11beware,
    author = {Roi Blanco and Hugo Zaragoza},
    title = {Beware of Relatively Large but Meaningless Improvements},
    booktitle = {Yahoo Technical Report 2011},
    year = {2011}
}

Entity Search Evaluation over Structured Web Data

Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, and Thanh Tran Duc

Workshop Paper EOS 2011 - Entity Oriented Search Workshop

Abstract

The search for entities is the most common search type on the web beside navigational searches. Whereas most common search techniques are based on the textual descriptions of web pages, semantic search approaches exploit the increasing amount of structured data on the Web in the form of annotations to web-pages and Linked Data. In many technologies, this structured data can consist of factual assertions about entities in which URIs are used to identify entities and their properties. The hypothesis is that this kind of structured data can improve entity search on the web. In order to test this hypothesis and to consistently progress in this field, a standardized evaluation is necessary. In this work, we discuss an evaluation campaign that specifically targets entity search over Linked Data by the means of keyword queries, including both queries that directly mention the entity as well as those that only describe the entities. We also discuss how crowd-sourcing was used to obtain relevance assessments from non-expert web users, the participating systems and the factors that contributed to positive results, and how the competition generalizes results from a previous crowd-sourced entity search evaluation.

@INPROCEEDINGS{Blanco11entitysearch,
    author = {Roi Blanco and Peter Mika and Harry Halpin and Jeffrey Pound and David R. Cheriton School Of and Daniel M. Herzig and Henry S. Thompson},
    title = {Entity search evaluation over structured web data},
    booktitle = {Proceedings of the Entity Oriented Workshop  EOS’11},
    year = {2011}
}

Recuperacion de Información - un enfoque práctico y multidisciplinar (in Spanish)

Book Chapters Ed. RA-MA

C.1 Introducción a la recuperación de información (Benjamin Piwowarski, Roi Blanco)

C.2 Indexación de documentos y procesado de consultas (Roi Blanco)

C.8 Construcción y compresión de índices (Roi Blanco)

Recuperación de Información. Un enfoque práctico y multidisciplinar

F. Cacheda Seijo, J.M. Fernández-Luna and J. Huete (eds.)

Este libro surge de la necesidad de disponer de un material que, con un enfoque eminentemente didáctico, permita dar una visión general de la disciplina de la Recuperación de Información, abarcando desde los fundamentos hasta las propuestas de investigación actuales. La idea es ofrecer al lector los entresijos de un área de conocimiento cuyos avances se trasladan directamente a programas que empleamos todos los días para diversas tareas cotidianas. Para alcanzar estos objetivos se ha contado con la colaboración de un plantel de expertos reconocidos internacionalmente por su investigación en el campo de la Recuperación de Información. Cada uno de ellos se ha centrado en aquellos capítulos de cuyas temáticas son especialistas y ampliamente conocedores. Además, la gran mayoría de ellos posee una inestimable experiencia docente en asignaturas de Recuperación de Información, con lo que sus experiencias y conocimientos a la hora de diseminar esta disciplina se han exportado a sus capítulos de forma directa, e implícitamente al libro completo

Caching search engine results over incremental indices

Roi Blanco, Edward Bortnikov, Flavio Junqueira, Ronny Lempel, Luca Telloli, Hugo Zaragoza

Paper SIGIR '10 - 33rd international ACM SIGIR conference on Research and development in information retrieval

Abstract

A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index. To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.


@inproceedings{Blanco:2010:CSE:1835449.1835466,
 author = {Blanco, Roi and Bortnikov, Edward and Junqueira, Flavio and Lempel, Ronny and Telloli, Luca and Zaragoza, Hugo},
 title = {Caching Search Engine Results over Incremental Indices},
 booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '10},
 year = {2010},
 isbn = {978-1-4503-0153-4},
 location = {Geneva, Switzerland},
 pages = {82--89},
 numpages = {8},
 url = {http://doi.acm.org/10.1145/1835449.1835466},
 doi = {10.1145/1835449.1835466},
 acmid = {1835466},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {real-time indexing, search engine caching},
}

Finding Support Sentences for Entities

Roi Blanco, Hugo Zaragoza

Paper SIGIR '10 - 33rd international ACM SIGIR conference on Research and development in information retrieval

Abstract

We study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, has not been addressed before. In this paper we give the first formalization of the problem, how it can be evaluated, and present a full evaluation dataset. We propose several methods to rank these sentences, namely retrieval-based, entity-ranking based and position-based. We found that traditional bag-of-words models perform relatively well when there is a match between an entity and a query in a given sentence, but they fail to find a support sentence for a substantial portion of entities. This can be improved by incorporating small windows of context sentences and ranking them appropriately


@inproceedings{Blanco:2010:FSS:1835449.1835507,
 author = {Blanco, Roi and Zaragoza, Hugo},
 title = {Finding Support Sentences for Entities},
 booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '10},
 year = {2010},
 isbn = {978-1-4503-0153-4},
 location = {Geneva, Switzerland},
 pages = {339--346},
 numpages = {8},
 url = {http://doi.acm.org/10.1145/1835449.1835507},
 doi = {10.1145/1835449.1835507},
 acmid = {1835507},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {entity ranking, sentence retrieval},
}

The 8th workshop on large-scale distributed systems for information retrieval (LSDS-IR'10)

Roi Blanco,Berkant Barla Cambazoglu, Claudio Lucchese

Journal Paper SIGIR Forum (2010)

Abstract

The size of the Web as well as user bases of search systems continue to grow exponentially. Consequently, providing subsecond query response times and high query throughput become quite challenging for large-scale information retrieval systems. Distributing different aspects of search (e.g., crawling, indexing, and query processing) is essential to achieve scalability in large-scale information retrieval systems. The 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’10) has provided a venue to discuss the current research challenges and identify new directions for distributed information retrieval. The workshop contained two industry talks as well as six research paper presentations. The hot topics in this year’s workshop were collection selection architectures, application of MapReduce to information retrieval problems, similarity search, geographically distributed web search, and optimization techniques for search efficiency.

@article{DBLP:journals/sigir/BlancoCL10,
  author    = {Roi Blanco and
               Berkant Barla Cambazoglu and
               Claudio Lucchese},
  title     = {The 8th workshop on large-scale distributed systems for information
               retrieval (LSDS-IR'10)},
  journal   = {{SIGIR} Forum},
  volume    = {44},
  number    = {2},
  pages     = {54--58},
  year      = {2010},
  url       = {http://doi.acm.org/10.1145/1924475.1924486},
  doi       = {10.1145/1924475.1924486},
  timestamp = {Wed, 19 Sep 2012 09:11:00 +0200},
  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/sigir/BlancoCL10},
  bibsource = {dblp computer science bibliography, http://dblp.org}
}

Searching through time in the New York Times

Michael Matthews, Pancho Tolchinsky, Roi Blanco, Jordi Atserias, Peter Mika, Hugo Zaragoza

Workshop Paper Fourth Workshop on Human-Computer Interaction and Information Retrieval, HCIR 2010

Abstract

In this paper we describe the Time Explorer, an application optimized for analyzing how news changes over time. We attempt to extend on current time-based systems in several important ways. First, Time Explorer is designed to help users discover how entities such as people and locations associated with a query change over time. Second, the application not only works on publication date, but also on event dates that are extracted automatically from text allowing for not only searching in the past, but also into the future. Finally, Time Explorer is designed around an intuitive interface that allows users to interact with time and entities in a powerful way. While aspects of these features can be found in other systems, they are combined in Time Explorer in a way that allows searching through time in no time at all.


@inproceedings{MatthewsHCIR10,
        author = "Michael Matthews and Pancho Tolchinsky and Roi Blanco and Jordi Atserias and Peter Mika and Hugo Zaragoza",
        booktitle = "HCIR Workshop on Bridging Human-Computer Interaction and Information Retrieval",
        series = "HCIR '10",
        title = "Searching Through Time in the New York Times",
        year = "2010"
}

Probabilistic static pruning of inverted files

Roi Blanco, Alvaro Barreiro

Journal PaperACM Transactions on information Systems (2010)

Abstract

Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content. We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion. We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.

@article{Blanco:2010:PSP:1658377.1658378,
 author = {Blanco, Roi and Barreiro, Alvaro},
 title = {Probabilistic Static Pruning of Inverted Files},
 journal = {ACM Transactions on Information Systems},
 issue_date = {January 2010},
 volume = {28},
 number = {1},
 month = jan,
 year = {2010},
 issn = {1046-8188},
 pages = {1:1--1:33},
 articleno = {1},
 numpages = {33},
 url = {http://doi.acm.org/10.1145/1658377.1658378},
 doi = {10.1145/1658377.1658378},
 acmid = {1658378},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {Pruning, compression, efficiency, inverted files, probabilistic models},
}

TAER: Time-Aware Entity Retrieval. Exploiting the Past to find Relevant Entities in News Articles

Gianluca Demartini, Malik Muhammad Saad Missen, Roi Blanco, Hugo Zaragoza

Paper CIKM 2010 - 19th ACM conference on information and knowledge management

Abstract

Retrieving entities instead of just documents has become an important task for search engines. In this paper we study entity retrieval for news applications, and in particular the importance of the news trail history (i.e., past related articles) in determining the relevant entities in current articles. This is an important problem in applications that display retrieved entities to the user, together with the news article. We analyze and discuss some statistics about entities in news trails, unveiling some unknown fi ndings such as the persistence of relevance over time. We focus on the task of query dependent entity retrieval over time. For this task we evaluate several features, and show that their combination signifi cantly improves performance.

@inproceedings{Demartini:2010:TTE:1871437.1871661,
 author = {Demartini, Gianluca and Missen, Malik Muhammad Saad and Blanco, Roi and Zaragoza, Hugo},
 title = {TAER: Time-aware Entity Retrieval-exploiting the Past to Find Relevant Entities in News Articles},
 booktitle = {Proceedings of the 19th ACM International Conference on Information and Knowledge Management},
 series = {CIKM '10},
 year = {2010},
 isbn = {978-1-4503-0099-5},
 location = {Toronto, ON, Canada},
 pages = {1517--1520},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/1871437.1871661},
 doi = {10.1145/1871437.1871661},
 acmid = {1871661},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {entity retrieval, time-aware search},
}

Entity Summarization of News Articles

Gianluca Demartini, Malik Muhammad Saad Missen, Roi Blanco, Hugo Zaragoza

Poster SIGIR '10 33rd international ACM SIGIR conference on Research and development in information retrieval

Abstract

In contrast to traditional search, semantic search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. One of the key tasks to address in this context is ad-hoc object retrieval, i.e. the retrieval of objects in response to user formulated keyword queries. Despite the significant commercial interest, this kind of semantic search has not been evaluated in a thorough and systematic manner. In this work, we discuss the first evaluation campaign that specifically targets the task of ad-hoc object retrieval. We also discuss the submitted systems, the factors that contributed to positive results and the potential for future improvements in semantic search.

@inproceedings{Demartini:2010:ESN:1835449.1835620,
 author = {Demartini, Gianluca and Missen, Malik Muhammad Saad and Blanco, Roi and Zaragoza, Hugo},
 title = {Entity Summarization of News Articles},
 booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '10},
 year = {2010},
 isbn = {978-1-4503-0153-4},
 location = {Geneva, Switzerland},
 pages = {795--796},
 numpages = {2},
 url = {http://doi.acm.org/10.1145/1835449.1835620},
 doi = {10.1145/1835449.1835620},
 acmid = {1835620},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {entity summarization, time-aware search},
}

Evaluating Ad-Hoc Object Retrieval

Harry Halpin, Daniel M. Herzig, Peter Mika, Roi Blanco, Jeffrey Pound, Henry S. Thompson, and Thanh Tran Duc

Workshop PaperIWEST 2010 - International Workshop on Evaluation of Semantic Technologies

Abstract



@inproceedings{Halpin:iswc,
        author = {Halpin, Harry and Herzig, Daniel M. and Mika, Peter and Blanco, Roi and Pound, Jeffrey and Thompson, Henry S. and Tran, Duc Thanh},
        title = {Evaluating {A}d-{H}oc {O}bject {R}etrieval},
        booktitle = {Int. Workshop on Evaluation of Semantic Technologies (IWEST 2010) at ISWC},
        url = {http://people.csail.mit.edu/pcm/tempISWC/workshops/IWEST2010/paper9.pdf},
        year = {2010}
}

Part of Speech Based Term Weighting for Information Retrieval

Christina Lioma, Roi Blanco

PaperECIR 2009 - 31th European Conference on IR Research

Abstract

Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the ‘POS contexts’ in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7 from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline

@inproceedings{Lioma:2009:PSB:1533720.1533768,
 author = {Lioma, Christina and Blanco, Roi},
 title = {Part of Speech Based Term Weighting for Information Retrieval},
 booktitle = {Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval},
 series = {ECIR '09},
 year = {2009},
 isbn = {978-3-642-00957-0},
 location = {Toulouse, France},
 pages = {412--423},
 numpages = {12},
 url = {http://dx.doi.org/10.1007/978-3-642-00958-7_37},
 doi = {10.1007/978-3-642-00958-7_37},
 acmid = {1533768},
 publisher = {Springer-Verlag},
 address = {Berlin, Heidelberg},
}

A Belief Model of Query Difficulty that uses Subjective Logic

Christina Lioma, Roi Blanco, Raquel Mochales Palau, Marie-Francine Moens

PosterICTIR 2009 - Second International Conference on the Theory of Information Retrieval

Abstract

The difficulty of a user query can affect the performance of Information Retrieval (IR) systems. This work presents a formal model for quantifying and reasoning about query difficulty as follows: Query difficulty is considered to be a subjective belief, which is formulated on the basis of various types of evidence. This allows us to define a belief model and a set of operators for combining evidence of query difficulty. The belief model uses subjective logic, a type of probabilistic logic for modeling uncertainties. An application of this model with semantic and pragmatic evidence about 150 TREC queries illustrates the potential flexibility of this framework in expressing and combining evidence. To our knowledge, this is the first application of subjective logic to IR.

@Inbook{Lioma2009,
author="Lioma, Christina
and Blanco, Roi
and Mochales Palau, Raquel
and Moens, Marie-Francine",
editor="Azzopardi, Leif
and Kazai, Gabriella
and Robertson, Stephen
and R{\"u}ger, Stefan
and Shokouhi, Milad
and Song, Dawei
and Yilmaz, Emine",
chapter="A Belief Model of Query Difficulty That Uses Subjective Logic",
title="Advances in Information Retrieval Theory: Second International Conference on the Theory of Information Retrieval, ICTIR 2009 Cambridge, UK, September 10-12, 2009 Proceedings",
year="2009",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="92--103",
isbn="978-3-642-04417-5",
doi="10.1007/978-3-642-04417-5_9",
url="http://dx.doi.org/10.1007/978-3-642-04417-5_9"
}

A Logical Inference Approach to Query Expansion with Social Tags

Christina Lioma, Roi Blanco, Marie-Francine Moens

Conference ICTIR 2009 - Second International Conference on the Theory of Information Retrieval

Abstract

Query Expansion (QE) refers to the Information Retrieval (IR) technique of adding assumed relevant terms to a query in order to render it more informative, and hence more likely to retrieve relevant documents. A key problem is how to identify the terms to be added, and how to integrate them into the original query. We address this problem by using as expansion terms social tags that are freely available on the Web. We integrate these tags into the query by treating the QE process as a logical inference and by considering the addition of tags as an extra deduction to this process. This work extends Nie’s logical inference formalisation of QE to process social tags, and proposes an estimation of tag salience, which is experimentally shown to yield competitive retrieval performance.

@incollection{LiomaICTIR2009
year={2009},
isbn={978-3-642-04416-8},
booktitle={Advances in Information Retrieval Theory},
volume={5766},
series={Lecture Notes in Computer Science},
editor={Azzopardi, Leif and Kazai, Gabriella and Robertson, Stephen and Rueger, Stefan and Shokouhi, Milad and Song, Dawei and Yilmaz, Emine},
doi={10.1007/978-3-642-04417-5_39},
title={A Logical Inference Approach to Query Expansion with Social Tags},
url={http://dx.doi.org/10.1007/978-3-642-04417-5_39},
publisher={Springer Berlin Heidelberg},
author={Lioma, Christina and Blanco, Roi and Moens, Marie-Francine},
pages={358-361},
language={English}
}

Mixed monolingual homepage finding in 34 languages - The role of language script and search domain

Roi Blanco, Christina Lioma

Journal PaperInformation Retrieval (2009), Springer

Abstract

The information that is available or sought on the World Wide Web (Web) is increasingly multilingual. Information Retrieval systems, such as the freely available search engines on the Web, need to provide fair and equal access to this information, regardless of the language in which a query is written or where the query is posted from. In this work, we ask two questions: How do existing state of the art search engines deal with languages written in different alphabets (scripts)? Do local language-based search domains actually facilitate access to information? We conduct a thorough study on the effect of multilingual queries for homepage finding, where the aim of the retrieval system is to return only one document, namely the homepage described in the query. We evaluate the effect of multilingual queries in retrieval performance with regard to (i) the alphabet in which the queries are written (e.g., Latin, Russian, Arabic), and (ii) the language domain where the queries are posted (e.g., google.com, google.fr). We query four major freely available search engines with 764 queries in 34 different languages, and look for the correct homepage in the top retrieved results. A series of thorough experiments involving over 10,000 runs, with queries both in their correct and in Latin characters, and also using both global-domain and local-domain searches, reveal that queries issued in the correct script of a language are more likely to be found and ranked in the top 3, while queries in non-Latin script languages which are however issued in Latin script are less likely to be found; also, queries issued to the correct local domain of a search engine, e.g., French queries to yahoo.fr, are likely to have better retrieval performance than queries issued to the global domain of a search engine. To our knowledge, this is the first Web retrieval study that uses such a wide range of languages.

@article{Blanco:2009:MMH:1527580.1527590,
 author = {Blanco, Roi and Lioma, Christina},
 title = {Mixed Monolingual Homepage Finding in 34 Languages: The Role of Language Script and Search Domain},
 journal = {Infornation Retrieval},
 issue_date = {June 2009},
 volume = {12},
 number = {3},
 month = jun,
 year = {2009},
 issn = {1386-4564},
 pages = {324--351},
 numpages = {28},
 url = {http://dx.doi.org/10.1007/s10791-008-9082-8},
 doi = {10.1007/s10791-008-9082-8},
 acmid = {1527590},
 publisher = {Kluwer Academic Publishers},
 address = {Hingham, MA, USA},
 keywords = {Multilingual information retrieval, Search engines and evaluation, Web information retrieval},
}

Probabilistic documents length priors for language models

Roi Blanco, Alvaro Barreiro

PaperECIR 2008 - 30th European Conference on Advances in Information Retrieval

Abstract

This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes.

@inproceedings{Blanco:2008:PDL:1793274.1793322,
 author = {Blanco, Roi and Barreiro, Alvaro},
 title = {Probabilistic Document Length Priors for Language Models},
 booktitle = {Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval},
 series = {ECIR'08},
 year = {2008},
 isbn = {3-540-78645-7, 978-3-540-78645-0},
 location = {Glasgow, UK},
 pages = {394--405},
 numpages = {12},
 url = {http://dl.acm.org/citation.cfm?id=1793274.1793322},
 acmid = {1793322},
 publisher = {Springer-Verlag},
 address = {Berlin, Heidelberg},
}

ECIR 2008 Workshop on Efficiency Issues on Information Retrieval

Roi Blanco, Fabrizio Silvestri

Journal PaperSIGIR Forum (2008)

Abstract

The goal of EIIR 2008, the first Workshop on Efficiency Issues in Information Retrieval, was to shed light on efficiency-related issues of modern high-scale information retrieval (IR), e.g., Web, distributed technologies, peer to peer architectures and also new IR environments such as desktop search, enterprise/expert search, mobile devices, etc. In addition, the workshop aimed at fostering collaboration between different research groups in this area.

@article{Blanco:2008:EWE:1394251.1394263,
 author = {Blanco, Roi and Silvestri, Fabrizio},
 title = {ECIR 2008 Workshop on Efficiency Issues on Information Retrieval},
 journal = {SIGIR Forum},
 issue_date = {June 2008},
 volume = {42},
 number = {1},
 month = jun,
 year = {2008},
 issn = {0163-5840},
 pages = {59--62},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/1394251.1394263},
 doi = {10.1145/1394251.1394263},
 acmid = {1394263},
 publisher = {ACM},
 address = {New York, NY, USA},
}

Efficiency Issues in Information Retrieval Workshop

Roi Blanco, Fabrizio Silvestri

Workshop reportECIR 2008 - 30th European Conference on Advances in Information Retrieval

Abstract

Today’s technological advancements allow for vast amounts of information to be widely generated, disseminated, and stored. This exponentially increasing amount of information renders the retrieval of relevant information a necessary and cumbersome task. The field of Information Retrieval (IR) addresses this task by developing systems in an effective and efficient way. Specifically, IR effectiveness deals with retrieving the most relevant information to a user need, while IR efficiency deals with providing fast and ordered access to large amounts of information.

@Inbook{Blanco2008,
author="Blanco, Roi
and Silvestri, Fabrizio",
editor="Macdonald, Craig
and Ounis, Iadh
and Plachouras, Vassilis
and Ruthven, Ian
and White, Ryen W.",
chapter="Efficiency Issues in Information Retrieval Workshop",
title="Advances in Information Retrieval: 30th European Conference on IR Research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings",
year="2008",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="711--711",
isbn="978-3-540-78646-7",
doi="10.1007/978-3-540-78646-7_84",
url="http://dx.doi.org/10.1007/978-3-540-78646-7_84"
}

Segmentation of legislative documents using a domain-specific lexicon

Ismael Hasan, Javier Parapar, Roi Blanco

Workshop Paper DEXA '08 - 19th International Workshop on Database and Expert Systems Application, 2008

Abstract

The amount of legal information is continuously growing. New legislative documents appear everyday in the Web. Legal documents are produced on a daily basis in briefing-format, containing changes in the current legislation, notifications, decisions, resolutions, etc. The scope of these documents includes countries, states, provinces and even city councils. This legal information is produced in a semi-structured format and distributed daily on official Web-sites; however, the huge amount of published information makes difficult for an user to find a specific issue, being lawyers probably the most representative example, who need to access to these sources regularly. This motivates the need of legislative information search engines. Standard general Web search engines return to the user full documents (Web pages typically), within hundreds of pages. As users expect only the relevant part of the document, techniques that recognise and extract these relevant bits of documents are needed to offer quick and effective results. In this paper we present a method to perform segmentation based on domain-specific lexicon information. Our method was tested with a manually tagged data-set coming from different sources of Spanish legislative documents. Results show that this technique is suitable for the task achieving values of 97'85% recall and 95'99% precision.

@inproceedings{conf/dexaw/HasanPB08,
  author = {Hasan, Ismael and Parapar, Javier and Blanco, Roi},
  biburl = {http://www.bibsonomy.org/bibtex/27c06179923a679e00574eb8b73f3607d/dblp},
  booktitle = {DEXA Workshops},
  crossref = {conf/dexaw/2008},
  ee = {http://doi.ieeecomputersociety.org/10.1109/DEXA.2008.45},
  interhash = {03ab92c5540378f4c984078fd362c940},
  intrahash = {7c06179923a679e00574eb8b73f3607d},
  isbn = {978-0-7695-3299-8},
  keywords = {dblp},
  pages = {665-669},
  publisher = {IEEE Computer Society},
  timestamp = {2015-06-19T13:03:31.000+0200},
  title = {Segmentation of Legislative Documents Using a Domain-Specific Lexicon.},
  url = {http://dblp.uni-trier.de/db/conf/dexaw/dexaw2008.html#HasanPB08},
  year = 2008
}

Boosting static pruning of inverted files

Roi Blanco, Alvaro Barreiro

PosterSIGIR '07 - 0th annual international ACM SIGIR conference on Research and development in information retrieval

Abstract

This paper revisits the static term-based pruning technique presented in Carmel et al., SIGIR 2001 for ad-hoc retrieval, addressing different issues concerning its algorithmic design not yet taken into account. Although the original technique is able to retain precision when a considerable part of the inverted file is removed, we show that it is possible to improve precision in some scenarios if some key design features are properly selected.

@inproceedings{Blanco:2007:RWT:1277741.1277930,
 author = {Blanco, Roi and Lioma, Christina},
 title = {Random Walk Term Weighting for Information Retrieval},
 booktitle = {Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '07},
 year = {2007},
 isbn = {978-1-59593-597-7},
 location = {Amsterdam, The Netherlands},
 pages = {829--830},
 numpages = {2},
 url = {http://doi.acm.org/10.1145/1277741.1277930},
 doi = {10.1145/1277741.1277930},
 acmid = {1277930},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {TextRank, random walk algorithm},
}

Random walk term weighting for information retrieval

Roi Blanco, Christina Lioma

PosterSIGIR '07 - 30th annual international ACM SIGIR conference on Research and development in information retrieval

Abstract

We present a way of estimating term weights for Information Retrieval (IR), using term co-occurrence as a measure of dependency between terms.We use the random walk graph-based ranking algorithm on a graph that encodes terms and co-occurrence dependencies in text, from which we derive term weights that represent a quantification of how a term contributes to its context. Evaluation on two TREC collections and 350 topics shows that the random walk-based term weights perform at least comparably to the traditional tf-idf term weighting, while they outperform it when the distance between co-occurring terms is between 6 and 30 terms.

@inproceedings{Blanco:2007:RWT:1277741.1277930,
 author = {Blanco, Roi and Lioma, Christina},
 title = {Random Walk Term Weighting for Information Retrieval},
 booktitle = {Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '07},
 year = {2007},
 isbn = {978-1-59593-597-7},
 location = {Amsterdam, The Netherlands},
 pages = {829--830},
 numpages = {2},
 url = {http://doi.acm.org/10.1145/1277741.1277930},
 doi = {10.1145/1277741.1277930},
 acmid = {1277930},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {TextRank, random walk algorithm},
}

Static Pruning of Terms in Inverted Files

Roi Blanco, Alvaro Barreiro

PaperECIR 2007 - 29th European Conference on IR Research,

Abstract

This paper addresses the problem of identifying collection dependent stop-words in order to reduce the size of inverted files. We present four methods to automatically recognise stop-words, analyse the tradeoff between efficiency and effectiveness, and compare them with a previous pruning approach. The experiments allow us to conclude that in some situations stop-words pruning is competitive with respect to other inverted file reduction techniques.

@Inbook{Blancoecir2007,
author="Blanco, Roi
and Barreiro, {\'A}lvaro",
editor="Amati, Giambattista
and Carpineto, Claudio
and Romano, Giovanni",
chapter="Static Pruning of Terms in Inverted Files",
title="Advances in Information Retrieval: 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007. Proceedings",
year="2007",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="64--75",
isbn="978-3-540-71496-5",
doi="10.1007/978-3-540-71496-5_9",
url="http://dx.doi.org/10.1007/978-3-540-71496-5_9"
}

A comparative performance evaluation of different implementations of the SOAP protocol

Jose A. Garcia, Roi Blanco, Antonio Blanco, Javier Paris

Conference ECOWS '07 - Fifth European Conference on Web Services

Abstract

This paper presents a study evaluation of the SOAP protocol performance between two different implementations: Java (Axis2) and Erlang. This comparison has been carried out using several testbeds with input and output data of different sizes. More concretely, we developed three different web services representing typical scenarios likely to be found in real environments. The evaluation is two-fold: we measured both the number of requests per second answered (throughput) by each server and the response to a common server workload, mixing stress and stand-by phases. The Erlang functional programming language claims to be specifically designed and suited for distributed, reliable and soft real-time concurrent systems. Moreover, its built-in lightweight process management and easiness of replication within distributed environments stand out Erlang as an appealing choice for service oriented architectures (SOAs). On the other hand, we compared this new approximation with the well-known Apache Axis2 project, as it is widely employed on the Web Services field by the Java community. This work allows us to conclude that the Erlang server is better when the computational cost of the web service is low, whereas the Axis2 server is more efficient as the service workload increases.

@article{Garcia2007,
author = {Jose A. Garcia and Roi Blanco and Antonio Blanco and Javier Paris},
title = {A Comparative Performance Evaluation of Different Implementations of the SOAP Protocol.},
journal ={Web Services, European Conference on},
volume = {0},
isbn = {0-7695-3044-3},
year = {2007},
pages = {109-118},
doi = {http://doi.ieeecomputersociety.org/10.1109/ECOWS.2007.16},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
}

TSP and cluster-based solutions to the reassignment of document identifiers

Roi Blanco, Alvaro Barreiro

Journal PaperInformation Retrieval (2006), Springer

Abstract

Recent studies demonstrated that it is possible to reduce Inverted Files (IF) sizes by reassigning the document identifiers of the original collection, as this lowers the distance between the positions of documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. This paper presents an efficient solution to the reassignment problem, which consists in reducing the input data dimensionality using a SVD transformation, as well as considering it a Travelling Salesman Problem (TSP). We also present some efficient solutions based on clustering. Finally, we combine both the TSP and the clustering strategies for reordering the document identifiers. We present experimental tests and performance results in two text TREC collections, obtaining good compression ratios with low running times, and advance the possibility of obtaining scalable solutions for web collections based on the techniques presented here.

@article{Blanco:2006:TCS:1147841.1147844,
 author = {Blanco, Roi and Barreiro, \'{A}lvaro},
 title = {TSP and Cluster-based Solutions to the Reassignment of Document Identifiers},
 journal = {Inf. Retr.},
 issue_date = {September 2006},
 volume = {9},
 number = {4},
 month = sep,
 year = {2006},
 issn = {1386-4564},
 pages = {499--517},
 numpages = {19},
 url = {http://dx.doi.org/10.1007/s10791-006-6614-y},
 doi = {10.1007/s10791-006-6614-y},
 acmid = {1147844},
 publisher = {Kluwer Academic Publishers},
 address = {Hingham, MA, USA},
 keywords = {Clustering, Compression, Document identifier reassignment, Indexing, SVD, TSP},
}

Document Identifier Reassignment Through Dimensionality Reduction

Roi Blanco, Alvaro Barreiro

PaperECIR 2005 - 27th European Conference on IR Research

Abstract

Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.

@Inbook{Blanco2005,
author="Blanco, Roi
and Barreiro, {\'A}lvaro",
editor="Losada, David E.
and Fern{\'a}ndez-Luna, Juan M.",
chapter="Document Identifier Reassignment Through Dimensionality Reduction",
title="27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005.",
year="2005",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="375--387",
isbn="978-3-540-31865-1",
doi="10.1007/978-3-540-31865-1_27",
url="http://dx.doi.org/10.1007/978-3-540-31865-1_27"
}

Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem

Roi Blanco, Alvaro Barreiro

PosterSIGIR '05 - 28th annual international ACM SIGIR conference on Research and development in information retrieval

Abstract

In this poster, we analyze recent work in the document identifiers reassignment problem. After that, we present a formalization of a simple case of the problem as a PSP (Pattern Sequencing Problem). This may facilitate future work as it opens a new research line to solve the general problem.

@inproceedings{Blanco:2005:CSC:1076034.1076141,
 author = {Blanco, Roi and Barreiro, Alvaro},
 title = {Characterization of a Simple Case of the Reassignment of Document Identifiers As a Pattern Sequencing Problem},
 booktitle = {Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR '05},
 year = {2005},
 isbn = {1-59593-034-5},
 location = {Salvador, Brazil},
 pages = {587--588},
 numpages = {2},
 url = {http://doi.acm.org/10.1145/1076034.1076141},
 doi = {10.1145/1076034.1076141},
 acmid = {1076141},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {compression, document identifier reassignment, inverted files},
}

Talks

Slide decks of paper presentations, seminars and keynotes. All of these are uploaded into my slideshare page.

Keynotes

From queries to answers in the Web

Web Intelligence Summer School (2015). Saint-Etienne, France

Keynote of the WI summer school that took place in Saint-Etienne (France), also used at the Galician Symposium of Natural Language Processing (2015) at CITIUS. The talk touches upon several topics related to semantic search and how search engines have been evolving from just displaying the infamous "ten blue links" to presenting information coming directly from (semi) structured data sources in the search results page. It also showcases many applications of this technnology inside one of the major search engines available.

Slideshare link
Introduction to Big Data

Big Data and Data Science Summer school (2013). Santiago (Spain)

Big data is the new trend in software engineering. When the term was coined it was mostly employed to refer to techniques that needed or were able to work with amounts of data over the general limit reachable by a single machine. Nowadays, the term has a broader acceptance, encompassing the techniques that process digital information for decision making using large amounts of information which change rapidly and has different formats. This kind of data can be found in several corporations that undergo software engineering or even marketing and business analytics companies, going through research labs in different scientific areas. The magnitude of the data requires, in general, of specific tools and a particular job role to process, analyze and make decisions using the data. In this talk, we present a wide perspective of the tools available to deal with massive amounts of data and of the difficulties and benefits to use them. Furthermore, we go deeper in the role of Data Science in modern corporations and how decision making using data can benefit certain corporate processes and applications. Finally, we discuss some future projects related to Big Data.

Big Data es la nueva moda en ingeniería del software. Cuando se acuñó el término, este se utilizaba para técnicas software que necesitaban o eran capaces de trabajar con cantidades datos que superaran el límite alcanzable por una única máquina. Hoy en día, el término tiene una acepción más amplia como una disciplina que trata de investigar formas de procesamiento digital para la toma de decisiones mediante el uso de activos de información de gran volumen, que cambian rápidamente y son muy variados en forma. Este tipo de datos se puede encontrar actualmente en múltiples organizaciones de diferentes tamaños, que abarcan desde compañías que realizan procesos de creación de software hasta empresas de marketing, pasando por laboratorios de investigación en diferentes áreas científicas. El orden de magnitud de los datos a procesar requiere, en general, de herramientas especiales, y de un rol de trabajo particular para procesar, analizar y tomar decisiones respecto a los datos. En esta charla presentaremos una perspectiva tanto de las herramientas disponibles para tratar cantidades masivas de datos, como de las dificultades y los beneficios de operar con ellas. Además, profundizaremos en el rol de Data Science en las corporaciones de hoy en día, y en como tomar decisiones utilizando datos puede beneficiar ciertos procesos corporativos y aplicaciones. Finalmente, discutiremos algunos de los proyectos futuros relacionados con Big Data.

Slideshare link
Mining Web content for Enhanced Search

5th International iDB workshop. Sapporo, Japan (2013)

Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.

Slideshare link
Introduction to Information Retrieval

Yahoo Bangalore Summer School on Information Retrieval and the Semantic Web (2013).

Broad introduction to information retrieval and web search (slides are a mash-up from my own and other people's presentations).

Information Retrieval is a the core of modern search engines, which makes it one of the most engaging modern technologies, and the most dominating form of information access. At their core, information retrieval techniques cover a broader appeal, spanning tools from the realm of information theory to optimized structures for fast data access. In this tutorial we will review the roots of information retrieval and web search systems, from both a theoretical and practical perspective. We will mostly focus on search over unstructured data, how to store it, how to access it and how to model user search behavior. We will review the foundations of probabilistic models of user finding information and give an overview of the key ingredients of modern search engines, which range from crawling to indexing to fighting spam.

Slideshare link
Beyond document retrieval using semantic annotations

University of Santiago de Compostela, Spain (2012)

Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talk presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.

Slideshare link
Searching over the past, present and future

Big-Data Analytics for the Temporal Web, Paris (2011)

Searching over the Past, Present and Future helps users to analyze how news topics evolve over time and, in particular, how topics are likely to evolve in the future. Thus, users are able to learn about what people are saying about the future. What are the hot topics? Is the sentiment positive or negative? What are the information sources and are they biased in anyway? To this end, the Time Explorer has been developed. It is an application designed for analyzing how news changes over time that extends upon current time-based systems in many important ways. First, Time Explorer is designed to help users discover how entities such as people and locations associated with a query change over time. Second, by searching on time expressions extracted automatically from text, the application allows the user to explore not only how topics evolved in the past, but also how they will continue to evolve in the future. Finally, Time Explorer is designed around an intuitive interface that allows users to interact with time and entities in a powerful way. Within Time Explorer searching through time becomes possible in no time at all.

Slideshare link
Large-Scale Semantic Search

Overview on large scale semantic search approaches, knowledge representation, retrieval models and evaluation.

Slideshare link

Paper presentations

Entity Linking via Graph-Distance Minimization

Joint work with Paolo Boldi and Andrea Marino (University of Milan)

Entity-linking is a natural-language--processing task that consists in identifying strings of text that refer to a particular item in some reference knowledge base.
One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items.

Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, it turns out to be solvable in linear time under some more restrictive assumptions. For the general case, we propose several heuristics: one of these tries to enforce the above assumptions while the others try to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.

Slideshare link
Influence of Timeline and Named-entity Components on User Engagement

Joint work with Yashar Moshfeghi, Joemon Jose (Glasgow University) and Michael Matthews (Yahoo Labs).

Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences..

Slideshare link
Keyword Search over RDF Graphs

Joint work with Shady Elbassuoni (Max-Planck)

Large knowledge bases consisting of entities and relationships between them have become vital sources of information for many applications. Most of these knowledge bases adopt the Semantic-Web data model RDF as a representation model. Querying these knowledge bases is typically done using structured queries utilizing graph-pattern languages such as SPARQL. However, such structured queries require some expertise from users which limits the accessibility to such data sources. To overcome this, keyword search must be supported. In this paper, we propose a retrieval model for keyword queries over RDF graphs. Our model retrieves a set of subgraphs that match the query keywords, and ranks them based on statistical language models. We show that our retrieval model outperforms the-state-of-the-art IR and DB models for keyword search over structured data using experiments over two real-world datasets.

Slideshare link
Extending BM25 with multiple query operators

Joint work with Paolo Boldi (University of Milan)

Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.

Slideshare link
Energy-Price-Driven Query Processing in Multi-center Web Search Engines

Joint work with Enver Kayaaslan, Cevdet Aykanat (Bilken University), B. Barla Cambazouglu and Flavio Junqueira (Yahoo)

Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.

Slideshare link
Effective and Efficient Entity Search in RDF data

Joint work with Peter Mika (Yahoo) and Sebastiano Vigna (University of Milan)

Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors combined have led to an increased interest in semantic search, i.e. access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.

Slideshare link
Repeatable and Reliable Search System Evaluation using Crowdsourcing

Joint work with Harry Halpin, Henry S. Thompson (University of Edinburgh), Daniel M. Herzig, Than Tran Duc (Institute AFB), Peter Mika (Yahoo) and Jeffrey Pound (University of Waterloo)

The primary problem confronting any new kind of search task is how to boot-strap a reliable and repeatable evaluation campaign, and a crowd-sourcing approach provides many advantages. However, can these crowd-sourced evaluations be repeated over long periods of time in a reliable manner? To demonstrate, we investigate creating an evaluation campaign for the semantic search task of keyword-based ad-hoc object retrieval. In contrast to traditional search over web-pages, object search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. Using the first large-scale evaluation campaign that specifically targets the task of ad-hoc Web object retrieval over a number of deployed systems, we demonstrate that crowd-sourced evaluation campaigns can be repeated over time and still maintain reliable results. Furthermore, we show how these results are comparable to expert judges when ranking systems and that the results hold over different evaluation and relevance metrics. This work provides empirical support for scalable, reliable, and repeatable search system evaluation using crowdsourcing.

Slideshare link
Caching Search Engine Results over Incremental Indices

Joint work with Edward Bornikov, Flavio Junqueira, Ronny Lempel, Luca Telloli and Hugo Zaragoza (Yahoo)

A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.

To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.

Slideshare link
Finding support sentences for entities

Joint work with Hugo Zaragoza (Yahoo)

We study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, has not been addressed before. In this paper we give the first formalization of the problem, how it can be evaluated, and present a full evaluation dataset. We propose several methods to rank these sentences, namely retrieval-based, entity-ranking based and position-based. We found that traditional bag-of-words models perform relatively well when there is a match between an entity and a query in a given sentence, but they fail to find a support sentence for a substantial portion of entities. This can be improved by incorporating small windows of context sentences and ranking them appropriately.

Slideshare link
Static Pruning of Terms in Inverted Files

Joint work with Alvaro Barreiro (University of A Coruña)

Slides used for a paper presentation at ECIR 2007. The paper addresses the problem of identifying collection dependent stop-words in order to reduce the size of inverted files. We present four methods to automatically recognise stop-words, analyse the tradeoff between efficiency and effectiveness, and compare them with a previous pruning approach. The experiments allow us to conclude that in some situations stop-words pruning is competitive with respect to other inverted file reduction techniques.

Slides PDF
Document Identifier Reassignment Through Dimensionality Reduction

Joint work with Alvaro Barreiro (University of A Coruña)

Slides used for a paper presentation at ECIR 2005. Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.

Slides PDF
Building/Compression techniques for Inverted files

Slides used at a Seminar in Glasgow University, describing how to sort and manage large amounts of data effectively, algorithms for indexing text collections and compression techniques for inverted files. Many of those techniques were implemented in the Terrier search engine.

Slides PDF

Gallery

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Contact

I'm currently based in Melbourne. Feel free to reach out for questions related to my work or interesting research projects.

roicho
@roiblanco
roi-blanco

Roi Blanco

University of A Coruña

Roi Blanco

Short Bio

Academic Positions

Research Fellow

Senior Researcher

Associate Professor

Research intern

Research intern

Education & Training

Research

Links

Software

Data

Organization

Conference organization:

Patents and defensive publications

Method or system for ranking related news predictions

Quote-Based Search

Using cache invalidation to assign documents to indexes in a distributed search engine

Caching Search Engine Results over Incremental Indices

Interactive interface for object search

Random Stuff

Other software I've used extensively for research projects:

Some (old) teaching resources in Spanish:

Publications

Filter by type:

Click Through Rate Prediction for Local Search Results

Abstract

A Concise Integer Linear Programming Formulation for Implicit Search Result Diversification

Abstract

Lightweight Multilingual Entity Extraction and Linking

Abstract

Exploiting Green Energy to Reduce the Operational Costs of Multi-Center Web Search Engines

Abstract

Term-by-Term Query Auto-Completion for Mobile Search

Abstract

Memory-based Recommendations of Entities for Web Search Users

Abstract

An In-Depth Study of Implicit Search Result Diversification

Abstract

Building Test Collections for Evaluating Temporal IR

Abstract

Re-finding Behavior in Vertical Domains

Abstract

Temporal Information Retrieval

Using graph distances for named-entity linking

Abstract

Predicting primary categories of business listings for local search ranking

Abstract

IntoNews: Online news retrieval using closed captions

Abstract

Ranking of daily deals with concept expansion

Abstract

Temporal information searching behaviour and strategies

Abstract

Predicting Re-finding Activity and Difficulty

Abstract

Online News Tracking for Ad-Hoc Information Needs

Abstract

Insights into Entity Recommendation in Web Search

Abstract

Leveraging Wikipedia Knowledge for Entity Recommendations

Abstract

Timely Semantics: A Study of a Stream-Based Ranking System for Entity Relationships

Abstract

Local Ranking Problem on the BrowseGraph

Abstract

Hierarchy Construction for News Summarizations

Abstract

Online News Tracking for Ad-Hoc Queries

Abstract

Fast and Space-Efficient Entity Linking for Queries

Abstract

From "Selena Gomez" to "Marlon Brando": Understanding Explorative Entity Search

Abstract

Using Wikipedia for Cross-Language Named Entity Recognition

Abstract

Focused Crawling for Structured Data