Archive for August, 2010

Mapping the social web with Postrank

Monday, August 23rd, 2010

This weekend I had a chance to play with the Postrank data to get some insights it reveals about the user engagement patterns across the major social media sites. Since this might be interesting to people studying human-based computation, I decided to share my preliminary results here.

I used Postrank metrics API to retrieve a data for a set of urls. It provides counts of individual user interactions from all PostRank monitored social sites around a single web page. The metrics update in real-time as new user activity occurs and reflect the amount of user engagement the page accumulated so far. If you haven’t used Postrank metrics before, the easiest way is to try their new google reader extension which is pretty nice.

The data for an individual web page comes in the following form:

"de3d4d72ebac1e886232f4ab27bd7b46": {
"brightkite": 2.0,
"reddit_votes": 1880.0,
"delicious": 304.0,
"reddit_comments": 369.0,
"views": 9.0,
"identica": 5.0,
"gr_shared": 13.0,
"google": 40.0,
"fb_post": 5.0,
"diigo": 2.0,
"clicks": 62.0,
"blip": 1.0,
"digg": 103.0,
"buzz": 5.0,
"bookmarks": 4.0,
"twitter": 988.0,
"jaiku": 2.0,
"ff_comments": 2.0

Different social media sites implement different human-based computation techniques, so their activity metrics are not comparable to each other, in general. We can compare the same metric for different web pages, but it doesn’t tell us much about the site/algorithm that computed the metric. One way to analyse this data is to look into pairwise correlations between the metrics across multiple sites. The pairwise correlation may be indicative of some interaction among the metrics. It can be the overlap in the user base (e.g. a user shares the same sites to both diigo and delicious), common interests among users of different sites (users of each site share to the respective sites independently because of similar preferences), or some other factors.

I took a sample of 2169 urls pulled from about 200 feeds in my google reader. Those feeds cover a pretty diverse set of topics, including science, engineering, entrepreneurship, business, management, psychology, legal, photography, music, humor, lifestyle, etc. I pulled the Postrank metrics for each of those urls into a user engagement matrix. Each row of the matrix represents a url information, and each column has values of a single engagement metric (e.g. number of posts on twitter) across all the 2169 urls. I computed the Pearson correlation between every pair of columns. This resulted in a matrix visualized below:

Social media correlation matrix

We can see that the Hacker news score and the Hacker news comments highly correlate with each other (correlation 0.9 suggests that one is nearly proportional to another). However very high correlations between different sites (orange spots in the matrix) are less expected. A likely reason for very high correlations is availability of the tools that allow users to export their activity on one site into another. This might be responsible for correlation 0.8 between magnolia and delicious and correlation 0.6 between diigo and delicious. Such import/export ability is enabled by the apis, so we can expect that the sum of correlations in each row would be indicative of the quality and usage of its apis for data portability. Here is the top 10 social sites according to this metric and the top three sites are hardly surprising:

twitter 11.894999
fb_post 11.187575
ff_comments 10.898951
buzz 9.911882
identica 9.897181
ff_likes 9.319641
hn_comments 8.370196
blip 8.366225
diigo 8.334757
hn_score 8.180564

Finally, in order to better visualize the relationships among these sites/metrics, I used MDS (multi-dimensional scaling), a technique often used to map multi-dimensional points into a plane in a way that the distance between them on a plane best approximates the distance in the original multi-dimensional space. For this case, I used 1-correlation as an input to the MDS. This way, sites showing similar user engagement patterns end up close to each other.

One use of this map could be finding alternative sites to explore that have a like-minded community of people. For example, if you are using deliciousto share your bookmarks, you might consider exploring its nearest neighbors: diigo, tumblr, magnolia, and hatena.

Social media map

Unfortunately, not every social media site allows the access to their user engagement data via their activity streams. I hope more sites do this in the near future, so this map could be more complete. The landscape of social media sites is changing fast and many new sites appear. Some of these new sites might not be getting attention they deserve and this kind of data-driven social media mapping may help users to find sites that offer them the best experience.