Web-Scraping (StackOverflow)

Kevin Vo

PART I

Loading Library and Parse Website

+Load XML library and Parse the first web page that we want to scrap

library(XML)
u= "http://stackoverflow.com/questions/tagged/r?page=1&sort=active&pagesize=50"
doc = htmlParse(u)

Processing the information of the first page:

There are two ways of getting the HTML code: + First method: after we call the function htmlParse on the u text contatining the website URL, we do the following step in R:

> doc
<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>Recently Active &#39;r&#39; Questions - Page 1 - Stack Overflow</title>
    <link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
....
        _qevents.push({ qacct: "p-c1rF4kxgLUzNc" });
    </script>
            
    </body>
</html>

+Second method is more convenient because we can separate R code and html into different places. Open the website in browser(Chrome, in my case). Right mouse click on the page > View Page Source. Then copy the html code to editor.

Extract information about who posted questions on Stackoverflow R first page:

However, skimming through thousands of lines of html code is not a good idea to find the class containing the information of who posted question. I used the following trick to get the job done faster.

<div class="user-details">
        <a href="/users/149223/ira-cooke">Ira Cooke</a><br>
- Second:
<div class="user-details">
        <a id="history-927358" href="/posts/927358/revisions" title="show revision history for this post">
        35 revs, 20 users 29%<br></a><a href="/users/63550">Peter Mortensen</a>

Therefore, I create a node that go into /class user-details, then go into any class /a contains href attribute to search a word '/users/'

> who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
> who=sapply(who, function(who) xmlValue(who))
> length(who)
[1] 50
> who[1:10]
 [1] "Pork Chop"       "javlacalle"      "eipi10"          "sparkle"
 [5] "ialm"            "Jordan Browne"   "Jordan Browne"   "SimonB"
 [9] "C8H10N4O2"       "Richard Scriven"

Even though we are able to extract exact 50 persons who posted questions on the first page, there is a huge problem that I get after repeat the whole process many times. Sometimes we get only 49 persons out of 50. Where is the cause of this?

<div class="user-details">
        anon<br>
    </div>

It turns out that the stackoverflow allows anonymous posting the questions. To solve this problem I go to the parent node with the /class 'started fr'

> who = getNodeSet(doc, "//div[@class = 'started fr']")
> length(who)
[1] 50

From there, I find++ where the anonymous poster is, then return the index of those anonoymous.++ After that, we extract all information of who posted questions as the above method into a who vector. Then we insert "anonymous" into who vector at where the anonymous index locates.

> wwwwwwwwwho = getNodeSet(doc, "//div[@class = 'started fr']")
> index= 1:length(who)
> anon_index= which(sapply(who,function(who) is.na(xmlValue(getNodeSet(who,".//div[@class = 'user-details']//a[contains(@href,'/users/')]")[1][[1]]))))
> if (length(anon_index) == 0){
+     who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
+     who=sapply(who, function(who) xmlValue(who))
+ }else{
+     who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
+     who=sapply(who, function(who) xmlValue(who))
+     index= index[-anon_index]
+     who = data.frame(index=index, who = who)
+     anon= data.frame(index=anon_index,who= "anonymous")
+     who = rbind(who,anon)
+     who= who[order(who$index),2]
+ }
> length(who)
[1] 50

Check:

> who[1:10]
 [1] "Pork Chop"       "javlacalle"      "eipi10"          "sparkle"
 [5] "ialm"            "Jordan Browne"   "Jordan Browne"   "SimonB"
 [9] "C8H10N4O2"       "Richard Scriven"

Extract information about when the questions were posted on Stackoverflow R first page:

> when = getNodeSet(doc,"//div[@class = 'user-action-time']/a/span")
> when= sapply(when,function(when) unname(xmlAttrs(when)[1]))
> length(when)
[1] 50
> when[1:10]
 [1] "2015-12-09 18:48:03Z" "2015-12-09 18:45:19Z" "2015-12-09 18:45:15Z"
 [4] "2015-12-09 18:41:27Z" "2015-12-09 18:40:42Z" "2015-12-09 18:40:02Z"
 [7] "2015-12-09 18:38:22Z" "2015-12-09 18:38:16Z" "2015-12-09 18:36:37Z"
[10] "2015-12-09 18:33:18Z"

+Note: In fact I could use xpathSApply to write a code in one line instead of two. The reason I am not doing that is because it is more convenient for me to debug the program if I only need to change the term of getNodeSet. For example, ```r

when = getNodeSet(doc,"//div[@class = 'user-action-time']/a/span") when[[1]] 2 mins ago when[[19]] 37 mins ago ```

By doing this, I can check the first Node's content and 19th Node's content. Especially to a person just get familar with HTML and XML, this method maximize my learning curve for this type of assignment. However, I still perform 1 example using XpathSApply for this question:

> xpathSApply(doc,"//div[@class = 'user-action-time']/a/span",function(i) xmlGetAttr(i,'title'))
 [1] "2015-12-09 18:48:03Z" "2015-12-09 18:45:19Z" "2015-12-09 18:45:15Z"
 [4] "2015-12-09 18:41:27Z" "2015-12-09 18:40:42Z" "2015-12-09 18:40:02Z"
 [7] "2015-12-09 18:38:22Z" "2015-12-09 18:38:16Z" "2015-12-09 18:36:37Z"
[10] "2015-12-09 18:33:18Z" "2015-12-09 18:30:56Z" "2015-12-09 18:30:34Z"
[13] "2015-12-09 18:30:01Z" "2015-12-09 18:29:38Z" "2015-12-09 18:27:38Z"
[16] "2015-12-09 18:27:14Z" "2015-12-09 18:22:08Z" "2015-12-09 18:18:25Z"
[19] "2015-12-09 18:13:06Z" "2015-12-09 18:03:16Z" "2015-12-09 17:38:25Z"
[22] "2015-12-09 17:34:17Z" "2015-12-09 17:29:25Z" "2015-12-09 17:26:30Z"
[25] "2015-12-09 17:26:18Z" "2015-12-09 17:25:33Z" "2015-12-09 17:20:02Z"
[28] "2015-12-09 17:19:55Z" "2015-12-09 17:17:36Z" "2015-12-09 17:08:23Z"
[31] "2015-12-09 17:08:08Z" "2015-12-09 17:07:39Z" "2015-12-09 17:04:41Z"
[34] "2015-12-09 17:02:50Z" "2015-12-09 16:58:54Z" "2015-12-09 16:56:52Z"
[37] "2015-12-09 16:46:48Z" "2015-12-09 16:46:04Z" "2015-12-09 16:43:00Z"
[40] "2015-12-09 16:42:56Z" "2015-12-09 16:41:45Z" "2015-12-09 16:39:19Z"
[43] "2015-12-09 16:37:15Z" "2015-12-09 16:35:50Z" "2015-12-09 16:35:24Z"
[46] "2015-12-09 16:33:51Z" "2015-12-09 16:30:10Z" "2015-12-09 16:30:05Z"
[49] "2015-12-09 16:27:07Z" "2015-12-09 16:19:03Z"

Extract information about the title of posted questions on Stackoverflow R first page:

Html format of title:

 <div class="summary">
        <h3><a href="/questions/1374842/building-and-installing-an-r-package-library-with-a-jnilib-extension" 
        class="question-hyperlink">Building and installing an R package library with a jnilib extension</a></h3>

To search for title of the question, we have to get a Node searching inside /class 'summary then going beneath /h3/a

> title = getNodeSet(doc,"//div[@class = 'summary']/h3/a")
> title = sapply(title, function(title) xmlValue(title))
> length(title)
[1] 50
> title[1:10]
 [1] "Two distributions in with googleVis in R"
 [2] "Replacing intercept with dummy variables in ARIMAX models in R"
 [3] "Stacked barplot with ggplot2 depending on two variables"
 [4] "Plotly: add_trace in a loop"
 [5] "vline legends not showing on geom_histogram type plot"
 [6] "microarray data, calculating mean gene expression and effect"
 [7] "How to code for independent 2 sample t-test (x,y)"
 [8] "Linear model with repeated measures factors"
 [9] "How to calculate a table of pairwise counts from long-form data frame"
[10] "Determine week number from date over several years"

Extract information about reputation of whom posted questions on Stackoverflow R first page:

Html format of the reputation:

<div class="user-details">
        <a href="/users/149223/ira-cooke">Ira Cooke</a><br>
        <span class="reputation-score" title="reputation score " dir="ltr">1,117</span>

We can see that the information we need to search is under /class 'user-details' then under /span 'reputation-score'. However, we can realize that it is in the same class 'user-details' of name of persons who posted the question. Therefore in some cases, there exists the anonymous without any reputation score. So we can borrow the anon_index of who to solve this problem.

> if (length(anon_index) == 0){
+     reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
+     reputation = sapply(reputation, function(reputation) xmlValue(reputation))
+     reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
+     }else{
+     reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
+     reputation = sapply(reputation, function(reputation) xmlValue(reputation))
+     reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
+     reputation = data.frame(index=index, reputation = reputation)
+     anon_reputation= data.frame(index=anon_index,reputation= 0)
+     reputation = rbind(reputation,anon_reputation)
+     reputation = reputation[order(reputation$index),2]
+     }
> length(reputation)
[1] 50
> reputation[1:10]
 [1]   2046    709 134000   1522   3409      8      8    166   2502 438000

Extract information about number of views for the posted questions on Stackoverflow R first page:

HTML format of number of views of a question:

<div class="views " title="385 views">
    385 views
</div>
<div class="views warm" title="1,870 views">
    2k views
</div>
~~~

Besides class 'views ' and 'views warm', we still have 'views hot' and 'views supernova'

> views = getNodeSet(doc, "//div[@class = 'views ' or @class = 'views warm' or @class = 'views hot' or @class = 'views supernova']")
> views = sapply(views,function(views) as.numeric(gsub("\\D","",xmlGetAttr(views,"title"))))
> length(views)
[1] 50
> views[1:10]
 [1]  13   5   9   5   5  20  48  16 488  13

Extract information about number of answers of posted questions on Stackoverflow R first page:

HTML format of number of answers of posted questions: html <div class="status answered-accepted"> <strong>1</strong>answer </div> Beside class 'status answered-accepted, we still have alternative class as 'status unsanswered','status answered'. I use regex "\D" to extract number only after getting the xmlValue of the Node.

> answers = getNodeSet(doc, "//div[@class = 'status unanswered' or @class = 'status answered-accepted' or @class = 'status answered']")
> answers = sapply(answers, function(answers) as.numeric(gsub("\\D","",xmlValue(answers))))
> length(answers)
[1] 50
> answers[1:10]
 [1] 1 1 1 0 1 0 1 0 4 0

Extract information about number of votes of posted questions on Stackoverflow R first page:

HTML format of number of votes:

<div class="votes">
                    <span class="vote-count-post "><strong>0</strong></span>
                    <div class="viewcount">votes</div>
                </div>

The information is in the class votes

> votes = getNodeSet(doc, "//div[@class = 'votes']")
> votes = sapply(votes, function(votes) as.numeric(gsub("\\D","",xmlValue(votes))))
> length(votes)
[1] 50
> votes[1:10]
 [1] 1 0 1 0 0 2 0 0 2 1

Extract information about URL of posted questions on Stackoverflow R first page:

HTML format of question URL:

<div class="summary">
        <h3><a href="/questions/1374842/building-and-installing-an-r-package-library-with-a-jnilib-extension" 
        class="question-hyperlink">Building and installing an R package library with a jnilib extension</a></h3>

It is in the class href inside class summary ```r

questionURL = getNodeSet(doc, "//div[@class = 'summary']/h3/a") baseURL = "http://stackoverflow.com" questionURL = sapply(questionURL,function(questionURL) paste(baseURL,unname(xmlAttrs(questionURL)[1]),sep = "")) length(questionURL) [1] 50 questionURL[1:10] [1] "http://stackoverflow.com/questions/34183203/two-distributions-in-with-googlevis-in-r" [2] "http://stackoverflow.com/questions/34182971/replacing-intercept-with-dummy-variables-in-arimax-models-in-r" [3] "http://stackoverflow.com/questions/34186123/stacked-barplot-with-ggplot2-depending-on-two-variables" [4] "http://stackoverflow.com/questions/34186560/plotly-add-trace-in-a-loop" [5] "http://stackoverflow.com/questions/34186081/vline-legends-not-showing-on-geom-histogram-type-plot" [6] "http://stackoverflow.com/questions/34123983/microarray-data-calculating-mean-gene-expression-and-effect" [7] "http://stackoverflow.com/questions/34147163/how-to-code-for-independent-2-sample-t-test-x-y" [8] "http://stackoverflow.com/questions/34185719/linear-model-with-repeated-measures-factors" [9] "http://stackoverflow.com/questions/13176741/how-to-calculate-a-table-of-pairwise-counts-from-long-form-data-frame" [10] "http://stackoverflow.com/questions/34186408/determine-week-number-from-date-over-several-years" ```

Extract information about id of posted questions on Stackoverflow R first page:

HTML format of id of posted questions:

<div class="question-summary" id="question-summary-1374842">

It is inside the class question-summary. However, if we get the id value of that class. It will contains this string question-sumary-. To remove them, I use regex "\D"

> id = getNodeSet(doc, "//div[@class = 'question-summary']")
> id = sapply(id, function(id) as.numeric(gsub("\\D","",xmlAttrs(id)[2])))
> length(id)
[1] 50
> id[1:10]
 [1] 34183203 34182971 34186123 34186560 34186081 34123983 34147163 34185719
 [9] 13176741 34186408

Extract information about tags of posted questions on Stackoverflow R first page:

HTML format for the tags of posted questions:

<div class="tags t-python t-r t-matrix">
            <a href="/questions/tagged/python" class="post-tag" title="show questions tagged 'python'" rel="tag">python</a> <a href="/questions/tagged/r" class="post-tag" title="show questions tagged 'r'" rel="tag">r</a> <a href="/questions/tagged/matrix" class="post-tag" title="show questions tagged 'matrix'" rel="tag">matrix</a> 
        </div>div

Getting the right Node is quite easy. And I use this "//div[@class = 'summary']//div[contains(@class,'tags')]" to search for the right Node. However, after we get the xmlValue for the Node, it's going to be like this:

> tags = getNodeSet(doc, "//div[@class = 'summary']//div[contains(@class,'tags')]")
> xmlValue(tags[[1]])
[1] "\r\n            r shiny googlevis density-plot \r\n        "

So I use this regex "\\r\\n\s+((\S+\s)+)\\r\\n\s+" to remove \r\n at the beginning and the end of the string, then add ; as professor Duncan's example format of tags.

> tags= sapply(tags, function(tags) gsub("\\s","; ",gsub("\\s$","",gsub("^\\\r\\\n\\s+((\\S+\\s)+)\\\r\\\n\\s+","\\1",xmlValue(tags)))))
> length(tags)
[1] 50
> tags[1:10]
 [1] "r; shiny; googlevis; density-plot" "r; time-series; intercept"
 [3] "r; ggplot2"                        "r; plot; ggplot2; plotly"
 [5] "r; ggplot2"                        "r; expression; effect"
 [7] "r"                                 "r; mixed-models"
 [9] "r; count; data.frame; long-form"   "r; date; cycle"

Extract information about URL of the next button on Stackoverflow R first page:

HTML format of next button:

<div class = "pager fl">
<a href="/questions/tagged/r?page=2&amp;sort=active&amp;pagesize=50" rel="next" title="go to page 2"> <span class="page-numbers next"> next</span> </a>

It is inside class //a wih rel = 'next'

> next_button  = getNodeSet(doc, "//a[@rel='next']")
> next_button_url = paste(baseURL,xmlGetAttr(next_button[[1]],"href"),sep ="")
> next_button_url
[1] "http://stackoverflow.com/questions/tagged/r?page=2&sort=active&pagesize=50"

Creating a data frame contains the summary information of the first page

> df= data.frame(id=id, date= when, tags=tags, title=title, url = questionURL, views= views, votes= votes,answers= answers, user= who, reputation = reputation)
> head(df)
        id                 date                              tags
1 34183203 2015-12-09 18:48:03Z r; shiny; googlevis; density-plot
2 34182971 2015-12-09 18:45:19Z         r; time-series; intercept
3 34186123 2015-12-09 18:45:15Z                        r; ggplot2
4 34186560 2015-12-09 18:41:27Z          r; plot; ggplot2; plotly
5 34186081 2015-12-09 18:40:42Z                        r; ggplot2
6 34123983 2015-12-09 18:40:02Z             r; expression; effect
                                                           title
1                       Two distributions in with googleVis in R
2 Replacing intercept with dummy variables in ARIMAX models in R
3        Stacked barplot with ggplot2 depending on two variables
4                                    Plotly: add_trace in a loop
5          vline legends not showing on geom_histogram type plot
6   microarray data, calculating mean gene expression and effect
                                                                                                         url
1                       http://stackoverflow.com/questions/34183203/two-distributions-in-with-googlevis-in-r
2 http://stackoverflow.com/questions/34182971/replacing-intercept-with-dummy-variables-in-arimax-models-in-r
3        http://stackoverflow.com/questions/34186123/stacked-barplot-with-ggplot2-depending-on-two-variables
4                                     http://stackoverflow.com/questions/34186560/plotly-add-trace-in-a-loop
5          http://stackoverflow.com/questions/34186081/vline-legends-not-showing-on-geom-histogram-type-plot
6    http://stackoverflow.com/questions/34123983/microarray-data-calculating-mean-gene-expression-and-effect
  views votes answers          user reputation
1    13     1       1     Pork Chop       2046
2     5     0       1    javlacalle        709
3     9     1       1        eipi10     134000
4     5     0       0       sparkle       1522
5     5     0       1          ialm       3409
6    20     2       0 Jordan Browne          8
> dim(df)
[1] 50 10

Create a function to scrap a whole stackoverflow web page, then sraps all the webpage of Stackoverflow R questions to put into a data frame.

I create a function as called as page_df to scrape all necessary information of a stackoverflow web page into a data frame.


page_df = 
function(url_link)
{
    doc = htmlParse(url_link)
    
    who = getNodeSet(doc, "//div[@class = 'started fr']")
    index= 1:length(who)
    anon_index= which(sapply(who,function(who) is.na(xmlValue(getNodeSet(who,".//div[@class = 'user-details']//a[contains(@href,'/users/')]")[1][[1]]))))
    if (length(anon_index) == 0){
        who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
        who=sapply(who, function(who) xmlValue(who))
    }else{
        who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
        who=sapply(who, function(who) xmlValue(who))
        index= index[-anon_index]
        who = data.frame(index=index, who = who)
        anon= data.frame(index=anon_index,who= "anonymous")
        who = rbind(who,anon)
        who= who[order(who$index),2]
    }

    when = getNodeSet(doc,"//div[@class = 'user-action-time']/a/span")
    when= sapply(when,function(when) unname(xmlAttrs(when)[1]))

    title = getNodeSet(doc,"//div[@class = 'summary']/h3/a")
    title = sapply(title, function(title) xmlValue(title))

    if (length(anon_index) == 0){
        reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
        reputation = sapply(reputation, function(reputation) xmlValue(reputation))
        reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
        }else{
        reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
        reputation = sapply(reputation, function(reputation) xmlValue(reputation))
        reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
        reputation = data.frame(index=index, reputation = reputation)
        anon_reputation= data.frame(index=anon_index,reputation= 0)
        reputation = rbind(reputation,anon_reputation)
        reputation = reputation[order(reputation$index),2]
        }
    
    views = getNodeSet(doc, "//div[@class = 'views ' or @class = 'views warm' or @class = 'views hot' or @class = 'views supernova']")
    views = sapply(views,function(views) as.numeric(gsub("\\D","",xmlGetAttr(views,"title"))))
    answersNode = getNodeSet(doc,"//div[@class='status unanswered' or @class='status answered-accepted' or @class='status answered']")
    answers = sapply(answersNode, function(answer) as.numeric(gsub("\\D","",xmlValue(answer))))
    votes = getNodeSet(doc, "//div[@class='votes']")
    votes = sapply(votes, function(votes) as.numeric(gsub("\\D","",xmlValue(votes))))
    questionURL = getNodeSet(doc, "//div[@class = 'summary']/h3/a")
    baseURL = "http://stackoverflow.com"
    questionURL = sapply(questionURL,function(questionURL) paste(baseURL,unname(xmlAttrs(questionURL)[1]),sep = ""))
    id = getNodeSet(doc, "//div[@class = 'question-summary']")
    id = sapply(id, function(id) as.numeric(gsub("\\D","",xmlAttrs(id)[2])))
    tags = getNodeSet(doc, "//div[@class = 'summary']//div[contains(@class,'tags')]")
    tags= sapply(tags, function(tags) gsub("\\s","; ",gsub("\\s$","",gsub("^\\\r\\\n\\s+((\\S+\\s)+)\\\r\\\n\\s+","\\1",xmlValue(tags)))))
    df= data.frame(id=id, date= when, tags=tags, title=title, url = questionURL, views= views, votes= votes,answers= answers, user= who, reputation = reputation)
}

For convenience and the purpose of reuse code, I put it into a R file named as function.R

Now, we move to scrape all the stackoverflow R questions

u= "http://stackoverflow.com/questions/tagged/r?page=1&sort=active&pagesize=50"
doc= htmlParse(u)
df = page_df(u)
repeat{
    next_button  = getNodeSet(doc, "//a[@rel='next']")
    next_button_url = paste(baseURL,xmlGetAttr(next_button[[1]],"href"),sep ="")
    doc = htmlParse(next_button_url)
    new_df = page_df(next_button_url)
    df = rbind(df,new_df)
    if (grepl("next",xmlValue(next_button[1][[1]])) == FALSE){
        break
    }
}

After 40 mins, it is done to scrape 2331 pages of R questions, creating a data frame named as df with 116541 rows

> dim(df)
[1] 116541     10
> head(df)
        id                 date                         tags
1 34126854 2015-12-07 05:09:51Z   r; if-statement; lubridate
2 34126806 2015-12-07 05:07:04Z r; dplyr; caret; subsampling
3 34124928 2015-12-07 05:02:04Z          r; join; data.table
4 34126026 2015-12-07 05:00:36Z                            r
5 34126467 2015-12-07 04:58:02Z                     regex; r
6 34126673 2015-12-07 04:52:46Z                            r
                                                                                                                                     title
1                                                                                                                Run script else quit in R
2                                                                                      k-fold cross validation with different sample sizes
3                                                    Can I use the R data.table join capability to select rows and perform some operation?
4                                                                                                    Mean and standard deviation by groups
5                                                                                        merging two data sets on the basis of two columns
6 Is there any way to use blast+ to blast a query protein sequence against a single genome without making a database of it in R? [on hold]
                                                                                                                           url
1                                                        http://stackoverflow.com/questions/34126854/run-script-else-quit-in-r
2                              http://stackoverflow.com/questions/34126806/k-fold-cross-validation-with-different-sample-sizes
3 http://stackoverflow.com/questions/34124928/can-i-use-the-r-data-table-join-capability-to-select-rows-and-perform-some-opera
4                                            http://stackoverflow.com/questions/34126026/mean-and-standard-deviation-by-groups
5                                http://stackoverflow.com/questions/34126467/merging-two-data-sets-on-the-basis-of-two-columns
6  http://stackoverflow.com/questions/34126673/is-there-any-way-to-use-blast-to-blast-a-query-protein-sequence-against-a-singl
  views votes answers            user reputation
1     3     0       0            Rime        277
2     5     0       0          Pascal       6461
3    25     4       1 Richard Scriven     437000
4    31     3       3           akrun     121000
5    11     1       1        bramtayl       2006
6    13     4       0          Pascal       6461

PART III

> load("rQAs.rda")
> data = rQAs
> dim(data)
[1] 58096    10

What is the distribution of the number of questions each person answered?

Distribution of the number of question each person answered is:

answer_classification = data[data$type=="answer",1]
answer_distribution = table(answer_classification)

What are the most common tags?

How many questions are about ggplot?

Subset data\(text by **question** data\)type The total number of questions is 10004 and 956 of them are about ggplot ```r

textbyquestion = data\(text[data\)type == "question"] length(textbyquestion) [1] 10004 length(which(grepl("ggplot", textbyquestion))) [1] 956 ```

How many questions involve XML, HTML or Web Scraping?

682 questions about XML 125 questions about HTML 0 questions about Web Scraping

> length(which(grepl("XML", text_by_question)))
[1] 682
> length(which(grepl("HTML", text_by_question)))
[1] 125
> length(which(grepl("Web Scraping", text_by_question)))
[1] 0