WeSearch: Language Technology for the Web
The objective of the WeSearch project is to prepare general purpose semantic parsing technology: automated large-scale analysis of user-generated Web content (UGC), mapping from human language to formal representations of meaning. Technology will be developed for English, but the research will result in techniques and representations that are directly applicable to other human languages.
Human Language Technology (HLT) is the interdisciplinary sub-field of computer science that aims to enable computers to ‘make sense’ of human language, for example to extract structured information from unstructured text, to determine the sentiment of a product review, to translate from one language into another, or to organize and retrieve Internet content on the basis of abstract meaning, rather than just keywords. As such, language technology is a key enabler to next-generation, user-centric ICT services and a link between technology R&D and the humanities.
The WeSearch project sets out to enable novel Web technologies in the broad realm of social networking. In user-centric information exchange and peer-to-peer interaction, so-called user-generated content (UGC) often is at the core of social networking and already accounts for a large proportion of Internet traffic. There are many facets of UGC, comprising for example audio and video sharing, community resources like Wikipedia, user forums and bulletin boards, professional or personal blogs, and of course iconic services like MySpace, Facebook, and Twitter. Human language is the ‘fabric’ of the Web, and this project targets a representative sample of textual UGC. The focus here is on language technology to automatically assign formal representations of meaning to unstructured content—a general purpose process dubbed semantic parsing. In the gradual transition from the Web 2.0 paradigm towards the Semantic Web (or Web 3.0), semantic parsing will be an essential tool.
The explosion in user-generated content has prompted an extensive interest in the field of text mining and in particular in automatic methods for tapping into the so-called ‘wisdom of the masses’: opinions and sentiments published on-line. In HLT, recent years have seen a growing interest in accessing the knowledge implicit in these massive information sources, so-called sentiment analysis or opinion mining. Recent research, however, shows that intelligent opinion mining requires a deeper analysis of the data, in particular analyses which to a larger extent reflect the meaning of human language. A full semantic sentiment analysis application is beyond the scope of a single project, but it provides a technology vision characteristic of use scenarios made possible by Web-scale semantic parsing.
The project combines the horizontal theme of social networking with the vertical pillar of user interfaces, information management, and software technology. Quite generally, ‘intelligent’ access to steadily growing volumes of on-line content in the so-called knowledge society demand language-enabled ICT services. The Semantic Web is a powerful vision, but to move from predominantly unstructured Web content to rich semantic annotation, we will require the ability to automatically relate human language content to structured, formal semantics. In addressing a broad variety of user-generated content, the project facilitates integration across services and resources. More importantly, information retrieval today confines itself to matching keywords, emphasizing recall over precision, with poor results against dynamic, low-density content (user forums, for example). Next-generation services for intelligent information access need to overcome these limitations and scale better to structured queries or advanced retrieval problems (e.g. searching specifically for critical reviews).
Finally, the project also addresses the vertical pillar on social, economic, and cultural challenges and opportunities, if mostly in terms of expected mid- and long-term impact. Multilingualism—presence and use of different languages side-by-side—is a challenge of growing importance: less than one third of Web content today is English (compared to about half around the year 2000). The EU embraces multilingualism as an economic and cultural strength, and future Internet technology will demand services that transparently work across language barriers.