Select Page

Understanding and predicting Web content credibility using the Content Credibility Corpus

by | Mar 18, 2018 | Artificial Intelligence, Machine Learning, Machine Vision, Web Search

Michal Kakol, Radoslaw Nielek, Adam Wierzbicki (Polish-Japanese Academy of Information Technology, Warsaw, Poland)


The goal of our research is to create a predictive model of Web content credibility evaluations, based on human evaluations.

The model has to be based on a comprehensive set of independent factors that can be used to guide user’s credibility evaluations in crowd- sourced systems like WOT, but also to design machine classifiers of Web content credibil- ity.

The factors described in this article are based on empirical data. We have created a dataset obtained from an extensive crowdsourced Web credibility assessment study (over 15 thousand evaluations of over 5000 Web pages from over 2000 participants). First, on- line participants evaluated a multi-domain corpus of selected Web pages. Using the ac- quired data and text mining techniques we have prepared a code book and conducted another crowdsourcing round to label textual justifications of the former responses.

We have extended the list of significant credibility assessment factors described in previous re- search and analyzed their relationships to credibility evaluation scores. Discovered factors that affect Web content credibility evaluations are also weakly correlated, which makes them more useful for modeling and predicting credibility evaluations. Based on the newly identified factors, we propose a predictive model for Web content credibility. The model can be used to determine the significance and impact of discovered factors on credibility evaluations. These findings can guide future research on the design of automatic or semi- automatic systems for Web content credibility evaluation support. This study also con- tributes the largest credibility dataset currently publicly available for research: the Content Credibility Corpus (C3).

©2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license. ( ) 

1. Introduction 

According to findings of a 2011 survey ( Purcell, 2011), 92% of American adult Internet users use search engines to find information on the Web, with 59% who do so on a typical day. This and other studies confirm our intuitions regarding the important role of Web information. The Web continues to provide extremely low cost means of publishing information, often coupled with high incentives for doing so, since Web content can affect purchasing behaviors, opinions, and other important decisions of Web users. This combination of factors led to large volumes of non-credible and unreliable information being published on the Web.

Related post



From improving health care processes to predicting when you might need to go into the hospital, AI is improving many aspects of the way we obtain and pay for medical care. Most patients aren’t aware – yet – of what goes on to make AI a reality in health care.

read more

Yes, These Chickens Are on the Blockchain

Did the chicken you just buy at the supermarket have a nice life, roam free, and eat healthy grains? If you’re the kind of person who cares, Carrefour SA, the big France-based grocery chain, has the bird for you.

read more

DoIT Motto:

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
Antoine de Saint-Exupery

Follow us: