Robots for web analysis
There already exists a number of robots (aka. crawlers, spiders, etc.) for analysing web pages.
In this master thesis project, the student will study the field of web analytics and in addition, also design and implement a set of web robots that will collect data on certain aspects of the web to be used in projects that will analyse these aspects over time.
There is four major parts of this thesis.
The first part is to study the field of web analytics and summarize the type of analysis available by means of such packages as Google Analytics, link checkers (e.g. WebCheck) and various robots that already exists as free software or toolkits.
The second part is to propose a set of robots that either can be designed from scratch or adapted from existing free software tools for web analysis.
The robots should be easy to configure for a researcher interested in exploring certain apects of the web. Exactly what aspects should be determined by the student, but the following are offered as a suggestion:
- Page size and page complexity and content (media, links, etc.)
- Size, growth, rate of change
- Problems: (broken links, etc.)
- Quality (Latency, Packet loss, Reachability)
- Adoption of the «Semantic Web»-vision (semantic markup)
The data and metadata collected by the robot should be stored in a database.
The robots should adhere all standards and conventions for web robots, such as metatags and robots.txt.
The robots should be implemented in a widely available scripting language (e.g. Perl or Python) for portability, and the student should try to engage a user community on sourceforge.net to augment the design and testing of the robots and analytics package.
The third part is to design a command and control centre (CCC) for the collection of robots. The CCC should allow an user to configure, deploy and direct the robot to accomplish specific tasks. The tasks should organized into project. One project, for instance, may be to inspect a certain webpage daily and measure the number of changes of the page that takes place from one inspection to the next for six months.
The fourth part is to design and build a graphical user interface (GUI) that can be used as front end to both the CCC and for displaying time series, views and visualisations of the data belonging to a specific project. Ideally, the GUI should offer an integrated visual query generator into the project database (e.g. similar to the "Views" module that comes with the Drupal CMS), and a visual front end to a suitable language for statistical computing and graphics, such as R.