Adrian Kuhn shared the following image of the ICSE ’12 tag cloud built from the 1,373,011 words sampled from the main conference submissions:

To make it fun (!), I decided to compute the ICSE ’12 information entropy off of the tag cloud. Here’s the fun I experienced:

First, I assigned a relative weight to the tags based merely on the visual perception of their font size, which in a tag cloud implies “gravity” or “importance”. For instance, the word ‘code’ got a relative weight of 85 whereas the word ‘pattern’ a relative weight of 9. The total weight for the 149 tags was 2387.5.

Next, I computed the empirical probabilities of each tag by using a frequency approach: the probability of a tag equals its relative weight over 2387.5, the total weight. It turns out the tags appear to follow a power-law distribution:

Having the probability distribution, I could compute the Shannon entropy, which turned out to be 7.01 bits per tag. That is, in principle, you’d need on average 7.01 bits to encode each tag in the tag cloud. Note that other constructs (such as articles, verbs, etc.) are excluded from the software engineering context. If one were to include those, then, no worries — the entropy of English language, especially in the academia, is probably less than 1 bit per character! Naturally, a more accurate first-order entropy can be computed if the count for each word is provided. That way, the empirical probabilities would be derived directly by dividing each word count with the total number of words (1,373,011). The perceived entropy value of 7.01 bits/ tag as well as the more accurate entropy value could be helpful to determine the lossless compression terminus of the 393 MB ICSE Proceedings archive! But, hey, bring it all to Canada — no need for data compression in our land.

Here’s the raw data, for reference:

i Tag Relative weight pi Hi
1 code 85 0.0356020942408377 0.171313506572643
2 requirements 10 0.00418848167539267 0.0330863117190497
3 find 9 0.0037696335078534 0.0303506765014925
4 engineering 12 0.0050261780104712 0.0383815163162604
5 programs 13 0.00544502617801047 0.0409511995374668
6 cases 15 0.00628272251308901 0.0459543105059809
7 similar 11 0.00460732984293194 0.0357614188024733
8 result 8 0.00335078534031414 0.0275477613162236
9 pattern 9 0.0037696335078534 0.0303506765014925
10 three 13 0.00544502617801047 0.0409511995374668
11 language 12 0.0050261780104712 0.0383815163162604
12 user 13 0.00544502617801047 0.0409511995374668
13 often 9 0.0037696335078534 0.0303506765014925
14 large 11 0.00460732984293194 0.0357614188024733
15 search 9 0.0037696335078534 0.0303506765014925
16 students 15 0.00628272251308901 0.0459543105059809
17 models 17 0.00712041884816754 0.0507958018854544
18 source 19 0.00795811518324607 0.0554947822337051
19 tool 18.5 0.00774869109947644 0.0543325175142862
20 data 38 0.0159162303664921 0.095073334100918
21 number 32 0.0134031413612565 0.0833847625423812
22 one 32 0.0134031413612565 0.0833847625423812
23 II 10 0.00418848167539267 0.0330863117190497
24 also 32 0.0134031413612565 0.0833847625423812
25 approach 38 0.0159162303664921 0.095073334100918
26 e.g 14 0.00586387434554974 0.0434743544881844
27 projects 13 0.00544502617801047 0.0409511995374668
28 ACM 13 0.00544502617801047 0.0409511995374668
29 call 10 0.00418848167539267 0.0330863117190497
30 information 20.5 0.00858638743455497 0.0589346708986153
31 systems 17 0.00712041884816754 0.0507958018854544
32 feature 10 0.00418848167539267 0.0330863117190497
33 ICSE 10 0.00418848167539267 0.0330863117190497
34 task 11 0.00460732984293194 0.0357614188024733
35 based 18 0.00753926701570681 0.0531620859872783
36 file 9 0.0037696335078534 0.0303506765014925
37 approaches 10.5 0.0043979057591623 0.034431061674485
38 terms 9 0.0037696335078534 0.0303506765014925
39 generated 9 0.0037696335078534 0.0303506765014925
40 use 32 0.0134031413612565 0.0833847625423812
41 bugs 13 0.00544502617801047 0.0409511995374668
42 pages 11 0.00460732984293194 0.0357614188024733
43 classes 11 0.00460732984293194 0.0357614188024733
44 process 19 0.00795811518324607 0.0554947822337051
45 Figure 31 0.0129842931937173 0.0813737172482226
46 problem 12 0.0050261780104712 0.0383815163162604
46 quality 12 0.0050261780104712 0.0383815163162604
47 execution 16.5 0.00691099476439791 0.0495994554238569
48 shown 10 0.00418848167539267 0.0330863117190497
49 et 11.5 0.00481675392670157 0.0370780377843622
50 knowledge 10 0.00418848167539267 0.0330863117190497
51 line 11 0.00460732984293194 0.0357614188024733
52 several 8 0.00335078534031414 0.0275477613162236
53 class 13 0.00544502617801047 0.0409511995374668
54 usage 8 0.00335078534031414 0.0275477613162236
55 project 20.5 0.00858638743455497 0.0589346708986153
56 IEEE 13 0.00544502617801047 0.0409511995374668
57 need 10 0.00418848167539267 0.0330863117190497
58 existing 9.5 0.00397905759162304 0.0317264487084756
59 tasks 9.5 0.00397905759162304 0.0317264487084756
60 features 12 0.0050261780104712 0.0383815163162604
61 first 18 0.00753926701570681 0.0531620859872783
62 state 14 0.00586387434554974 0.0434743544881844
63 However 15 0.00628272251308901 0.0459543105059809
64 example 24 0.0100523560209424 0.0667106766115785
65 well 8 0.00335078534031414 0.0275477613162236
66 used 38 0.0159162303664921 0.095073334100918
67 testing 16.5 0.00691099476439791 0.0495994554238569
68 changes 14 0.00586387434554974 0.0434743544881844
69 paper 12 0.0050261780104712 0.0383815163162604
70 possible 10.5 0.0043979057591623 0.034431061674485
71 support 12 0.0050261780104712 0.0383815163162604
72 pp 32 0.0134031413612565 0.0833847625423812
73 function 8 0.00335078534031414 0.0275477613162236
74 system 24 0.0100523560209424 0.0667106766115785
75 using 26 0.0108900523560209 0.0710123467189126
76 framework 8 0.00335078534031414 0.0275477613162236
77 reports 10 0.00418848167539267 0.0330863117190497
78 Section 15 0.00628272251308901 0.0459543105059809
79 level 10 0.00418848167539267 0.0330863117190497
80 development 19 0.00795811518324607 0.0554947822337051
81 University 12 0.0050261780104712 0.0383815163162604
82 design 11 0.00460732984293194 0.0357614188024733
83 vol 10 0.00418848167539267 0.0330863117190497
84 important 9 0.0037696335078534 0.0303506765014925
85 Conference 10 0.00418848167539267 0.0330863117190497
86 analysis 24 0.0100523560209424 0.0667106766115785
87 methods 18 0.00753926701570681 0.0531620859872783
88 evaluation 9 0.0037696335078534 0.0303506765014925
89 Java 12 0.0050261780104712 0.0383815163162604
90 algorithm 10.5 0.0043979057591623 0.034431061674485
91 programming 12 0.0050261780104712 0.0383815163162604
92 time 24 0.0100523560209424 0.0667106766115785
93 method 24 0.0100523560209424 0.0667106766115785
94 participants 11 0.00460732984293194 0.0357614188024733
95 values 12 0.0050261780104712 0.0383815163162604
96 provide 10.5 0.0043979057591623 0.034431061674485
97 new 19 0.00795811518324607 0.0554947822337051
98 bug 23 0.00963350785340314 0.0645225677153213
99 i.e 11.5 0.00481675392670157 0.0370780377843622
100 work 23 0.00963350785340314 0.0645225677153213
101 study 18 0.00753926701570681 0.0531620859872783
102 application 16 0.00670157068062827 0.0483939519518189
103 case 19.5 0.00816753926701571 0.0566490951118284
104 API 13 0.00544502617801047 0.0409511995374668
105 tools 12 0.0050261780104712 0.0383815163162604
106 International 9 0.0037696335078534 0.0303506765014925
107 program 24 0.0100523560209424 0.0667106766115785
108 techniques 12 0.0050261780104712 0.0383815163162604
109 input 16 0.00670157068062827 0.0483939519518189
110 related 8 0.00335078534031414 0.0275477613162236
111 research 14 0.00586387434554974 0.0434743544881844
112 specific 10 0.00418848167539267 0.0330863117190497
113 developers 24 0.0100523560209424 0.0667106766115785
114 applications 12 0.0050261780104712 0.0383815163162604
115 Proceedings 10 0.00418848167539267 0.0330863117190497
116 shows 11 0.00460732984293194 0.0357614188024733
117 performance 14 0.00586387434554974 0.0434743544881844
118 variables 8 0.00335078534031414 0.0275477613162236
119 Software 32 0.0134031413612565 0.0833847625423812
120 order 10 0.00418848167539267 0.0330863117190497
121 rules 9 0.0037696335078534 0.0303506765014925
122 software 70 0.0293193717277487 0.149294299501816
123 al 12 0.0050261780104712 0.0383815163162604
124 following 9 0.0037696335078534 0.0303506765014925
125 many 12 0.0050261780104712 0.0383815163162604
126 elements 10.5 0.0043979057591623 0.034431061674485
127 type 10 0.00418848167539267 0.0330863117190497
128 tests 11 0.00460732984293194 0.0357614188024733
129 patterns 14 0.00586387434554974 0.0434743544881844
130 show 8 0.00335078534031414 0.0275477613162236
131 Engineering 16.5 0.00691099476439791 0.0495994554238569
132 developer 13 0.00544502617801047 0.0409511995374668
133 context 10 0.00418848167539267 0.0330863117190497
134 type 9 0.0037696335078534 0.0303506765014925
135 tests 10 0.00418848167539267 0.0330863117190497
136 control 9 0.0037696335078534 0.0303506765014925
137 given 10 0.00418848167539267 0.0330863117190497
138 two 32 0.0134031413612565 0.0833847625423812
139 found 12 0.0050261780104712 0.0383815163162604
140 test 45 0.018848167539267 0.107989292760895
141 results 24 0.0100523560209424 0.0667106766115785
142 Table 13 0.00544502617801047 0.0409511995374668
143 set 26 0.0108900523560209 0.0710123467189126
144 value 12 0.0050261780104712 0.0383815163162604
145 refactoring 10 0.00418848167539267 0.0330863117190497
146 model 32 0.0134031413612565 0.0833847625423812
147 technique 11 0.00460732984293194 0.0357614188024733
148 different 24 0.0100523560209424 0.0667106766115785
149 change 13.5 0.00565445026178011 0.0422183733869045
Total 2387.5 1.0000 7.01030881472941

Under the auspices of my PhD program at UVic, I’ve started an ethnographic study at the IBM Software Lab on 770 Palladium Drive, Kanata, Ottawa, ON. I will be collaborating with the software engineering team for the following two objectives:

  1. Understand how the IBM CLM team members coordinate work items and collaborate, given the fact that the team is not collocated. As part of this objective, I’ll be presenting ProxiScientia, a visualization tool I have developed with Adrian Schröter to measure software development proximity between collaborating developers.
  2. Conduct interviews with individual software developers to gather data around requirements ecosystems

In particular, I’ll be involved with highly-distributed teams working on several aspects of the CLM solution. The Collaborative Lifecycle Management (CLM) infrastructure integrates three core products: change and control management (a.k.a. Rational Team Concert, RTC), requirements gathering (a.k.a. Rational Requirements Composer, or RRC), and testing (a.k.a. Rational Quality Management, or RQM). All three products are pillared on the IBM Jazz Foundation, also referred to as the Jazz Application Framework, or JAF. Pretty exciting stuff! To learn more on the Jazz metaphor as well as the collaboration principles, this book by Adrian Cho is sufficiently elucidating, on top of the dedicated website.

I’ll be working extensively with the teams until August, after which time I’ll be disseminating related research.