Privacy by Design: Adding Privacy to Publishing Platforms

By Joakim Aleksander Olsen Valla

Supervisor: Gisle Hannemyr

Privacy can be considered a measure of degree of control over personal data. While information system owners have control over personal data as long as it is stored on a system under their immediate control, this control diminishes when data is copied from the information system (e.g., a website) and to a different system. This happens today to personal data that is crawled by a web robot and all the information it exposes is copied into a database belonging to a web search engine provider. The web is full of personal names, which is usually attached to some contextual data. If these personal names are indexed by web search engines, along with the contextual data attached to them, both will be discoverable by anyone searching for a specific name. While some such discoveries may be beneficial to the subject, others may be harmful. The aim of this thesis is to promote the need for better methods of controlling personal data published online. Specifically, the focus is on methods for shielding personal names from web search engines. Popular Content Management Systems, like WordPress, does very little to help publishers hide personal data from search engines. This thesis conducts an empirical study on the Robots Exclusion Protocol, investigating its effectiveness as a method for controlling indexing and crawling by search engines. Our study, which is limited to the Google search engine, suggests that, as long as the different directives are used correctly, the protocol can be considered quite reliable as a method for preventing content from being indexed by Google. As a response to the privacy challenges related to personal names being discoverable through web search engines, as well as the lack of fine-grained control offered by the Robots Exclusion Protocol, we have proposed and implemented a solution for increasing control over personal data published online. It accomplishes this by preventing search engines from indexing personal names along with contextual data attached to those names. By allowing the user to specify that specific parts of the content is to be kept out of the index, the tool offers publishers more fine-grained control over their data than the Robots Exclusion Protocol. The solution has been implemented as a plugin for the WordPress publishing platform. It has been installed on an experimental website, and we have verified that while the rest of the content has been indexed and is discoverable through the Google search engine, personal names posted on the website are no longer discoverable.

Master thesis is available here.


Tags: privacy, robots, internet, search engines, crawling, indexing, robot exclusion protocol, wordpress, plugin
Published July 7, 2014 3:18 PM - Last modified July 7, 2014 4:59 PM