+Load XML library and Parse the first web page that we want to scrap
library(XML)
u= "http://stackoverflow.com/questions/tagged/r?page=1&sort=active&pagesize=50"
doc = htmlParse(u)
There are two ways of getting the HTML code: + First method: after we call the function htmlParse on the u text contatining the website URL, we do the following step in R:
> doc
<html itemscope itemtype="http://schema.org/QAPage">
<head>
<title>Recently Active 'r' Questions - Page 1 - Stack Overflow</title>
<link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
....
_qevents.push({ qacct: "p-c1rF4kxgLUzNc" });
</script>
</body>
</html>
+Second method is more convenient because we can separate R code and html into different places. Open the website in browser(Chrome, in my case). Right mouse click on the page > View Page Source. Then copy the html code to editor.
However, skimming through thousands of lines of html code is not a good idea to find the class containing the information of who posted question. I used the following trick to get the job done faster.
Keep looking for many users who posted the question. I realize that most of the questions of the first page having the following formats:
<div class="user-details">
<a href="/users/149223/ira-cooke">Ira Cooke</a><br>
- Second:
<div class="user-details">
<a id="history-927358" href="/posts/927358/revisions" title="show revision history for this post">
35 revs, 20 users 29%<br></a><a href="/users/63550">Peter Mortensen</a>
Therefore, I create a node that go into /class user-details, then go into any class /a contains href attribute to search a word '/users/'
> who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
> who=sapply(who, function(who) xmlValue(who))
> length(who)
[1] 50
> who[1:10]
[1] "Pork Chop" "javlacalle" "eipi10" "sparkle"
[5] "ialm" "Jordan Browne" "Jordan Browne" "SimonB"
[9] "C8H10N4O2" "Richard Scriven"
Even though we are able to extract exact 50 persons who posted questions on the first page, there is a huge problem that I get after repeat the whole process many times. Sometimes we get only 49 persons out of 50. Where is the cause of this?
<div class="user-details">
anon<br>
</div>
It turns out that the stackoverflow allows anonymous posting the questions. To solve this problem I go to the parent node with the /class 'started fr'
> who = getNodeSet(doc, "//div[@class = 'started fr']")
> length(who)
[1] 50
From there, I find++ where the anonymous poster is, then return the index of those anonoymous.++ After that, we extract all information of who posted questions as the above method into a who vector. Then we insert "anonymous" into who vector at where the anonymous index locates.
> wwwwwwwwwho = getNodeSet(doc, "//div[@class = 'started fr']")
> index= 1:length(who)
> anon_index= which(sapply(who,function(who) is.na(xmlValue(getNodeSet(who,".//div[@class = 'user-details']//a[contains(@href,'/users/')]")[1][[1]]))))
> if (length(anon_index) == 0){
+ who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
+ who=sapply(who, function(who) xmlValue(who))
+ }else{
+ who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
+ who=sapply(who, function(who) xmlValue(who))
+ index= index[-anon_index]
+ who = data.frame(index=index, who = who)
+ anon= data.frame(index=anon_index,who= "anonymous")
+ who = rbind(who,anon)
+ who= who[order(who$index),2]
+ }
> length(who)
[1] 50
Check:
> who[1:10]
[1] "Pork Chop" "javlacalle" "eipi10" "sparkle"
[5] "ialm" "Jordan Browne" "Jordan Browne" "SimonB"
[9] "C8H10N4O2" "Richard Scriven"
> when = getNodeSet(doc,"//div[@class = 'user-action-time']/a/span")
> when= sapply(when,function(when) unname(xmlAttrs(when)[1]))
> length(when)
[1] 50
> when[1:10]
[1] "2015-12-09 18:48:03Z" "2015-12-09 18:45:19Z" "2015-12-09 18:45:15Z"
[4] "2015-12-09 18:41:27Z" "2015-12-09 18:40:42Z" "2015-12-09 18:40:02Z"
[7] "2015-12-09 18:38:22Z" "2015-12-09 18:38:16Z" "2015-12-09 18:36:37Z"
[10] "2015-12-09 18:33:18Z"
+Note: In fact I could use xpathSApply to write a code in one line instead of two. The reason I am not doing that is because it is more convenient for me to debug the program if I only need to change the term of getNodeSet. For example, ```r
when = getNodeSet(doc,"//div[@class = 'user-action-time']/a/span") when[[1]] 2 mins ago when[[19]] 37 mins ago ```
By doing this, I can check the first Node's content and 19th Node's content. Especially to a person just get familar with HTML and XML, this method maximize my learning curve for this type of assignment. However, I still perform 1 example using XpathSApply for this question:
> xpathSApply(doc,"//div[@class = 'user-action-time']/a/span",function(i) xmlGetAttr(i,'title'))
[1] "2015-12-09 18:48:03Z" "2015-12-09 18:45:19Z" "2015-12-09 18:45:15Z"
[4] "2015-12-09 18:41:27Z" "2015-12-09 18:40:42Z" "2015-12-09 18:40:02Z"
[7] "2015-12-09 18:38:22Z" "2015-12-09 18:38:16Z" "2015-12-09 18:36:37Z"
[10] "2015-12-09 18:33:18Z" "2015-12-09 18:30:56Z" "2015-12-09 18:30:34Z"
[13] "2015-12-09 18:30:01Z" "2015-12-09 18:29:38Z" "2015-12-09 18:27:38Z"
[16] "2015-12-09 18:27:14Z" "2015-12-09 18:22:08Z" "2015-12-09 18:18:25Z"
[19] "2015-12-09 18:13:06Z" "2015-12-09 18:03:16Z" "2015-12-09 17:38:25Z"
[22] "2015-12-09 17:34:17Z" "2015-12-09 17:29:25Z" "2015-12-09 17:26:30Z"
[25] "2015-12-09 17:26:18Z" "2015-12-09 17:25:33Z" "2015-12-09 17:20:02Z"
[28] "2015-12-09 17:19:55Z" "2015-12-09 17:17:36Z" "2015-12-09 17:08:23Z"
[31] "2015-12-09 17:08:08Z" "2015-12-09 17:07:39Z" "2015-12-09 17:04:41Z"
[34] "2015-12-09 17:02:50Z" "2015-12-09 16:58:54Z" "2015-12-09 16:56:52Z"
[37] "2015-12-09 16:46:48Z" "2015-12-09 16:46:04Z" "2015-12-09 16:43:00Z"
[40] "2015-12-09 16:42:56Z" "2015-12-09 16:41:45Z" "2015-12-09 16:39:19Z"
[43] "2015-12-09 16:37:15Z" "2015-12-09 16:35:50Z" "2015-12-09 16:35:24Z"
[46] "2015-12-09 16:33:51Z" "2015-12-09 16:30:10Z" "2015-12-09 16:30:05Z"
[49] "2015-12-09 16:27:07Z" "2015-12-09 16:19:03Z"
Html format of title:
<div class="summary">
<h3><a href="/questions/1374842/building-and-installing-an-r-package-library-with-a-jnilib-extension"
class="question-hyperlink">Building and installing an R package library with a jnilib extension</a></h3>
To search for title of the question, we have to get a Node searching inside /class 'summary then going beneath /h3/a
> title = getNodeSet(doc,"//div[@class = 'summary']/h3/a")
> title = sapply(title, function(title) xmlValue(title))
> length(title)
[1] 50
> title[1:10]
[1] "Two distributions in with googleVis in R"
[2] "Replacing intercept with dummy variables in ARIMAX models in R"
[3] "Stacked barplot with ggplot2 depending on two variables"
[4] "Plotly: add_trace in a loop"
[5] "vline legends not showing on geom_histogram type plot"
[6] "microarray data, calculating mean gene expression and effect"
[7] "How to code for independent 2 sample t-test (x,y)"
[8] "Linear model with repeated measures factors"
[9] "How to calculate a table of pairwise counts from long-form data frame"
[10] "Determine week number from date over several years"
Html format of the reputation:
<div class="user-details">
<a href="/users/149223/ira-cooke">Ira Cooke</a><br>
<span class="reputation-score" title="reputation score " dir="ltr">1,117</span>
We can see that the information we need to search is under /class 'user-details' then under /span 'reputation-score'. However, we can realize that it is in the same class 'user-details' of name of persons who posted the question. Therefore in some cases, there exists the anonymous without any reputation score. So we can borrow the anon_index of who to solve this problem.
> if (length(anon_index) == 0){
+ reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
+ reputation = sapply(reputation, function(reputation) xmlValue(reputation))
+ reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
+ }else{
+ reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
+ reputation = sapply(reputation, function(reputation) xmlValue(reputation))
+ reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
+ reputation = data.frame(index=index, reputation = reputation)
+ anon_reputation= data.frame(index=anon_index,reputation= 0)
+ reputation = rbind(reputation,anon_reputation)
+ reputation = reputation[order(reputation$index),2]
+ }
> length(reputation)
[1] 50
> reputation[1:10]
[1] 2046 709 134000 1522 3409 8 8 166 2502 438000
HTML format of number of views of a question:
<div class="views " title="385 views">
385 views
</div>
<div class="views warm" title="1,870 views">
2k views
</div>
~~~
Besides class 'views ' and 'views warm', we still have 'views hot' and 'views supernova'
> views = getNodeSet(doc, "//div[@class = 'views ' or @class = 'views warm' or @class = 'views hot' or @class = 'views supernova']")
> views = sapply(views,function(views) as.numeric(gsub("\\D","",xmlGetAttr(views,"title"))))
> length(views)
[1] 50
> views[1:10]
[1] 13 5 9 5 5 20 48 16 488 13
HTML format of number of answers of posted questions:
html
<div class="status answered-accepted">
<strong>1</strong>answer
</div>
Beside class 'status answered-accepted, we still have alternative class as 'status unsanswered','status answered'. I use regex "\D" to extract number only after getting the xmlValue of the Node.
> answers = getNodeSet(doc, "//div[@class = 'status unanswered' or @class = 'status answered-accepted' or @class = 'status answered']")
> answers = sapply(answers, function(answers) as.numeric(gsub("\\D","",xmlValue(answers))))
> length(answers)
[1] 50
> answers[1:10]
[1] 1 1 1 0 1 0 1 0 4 0
HTML format of number of votes:
<div class="votes">
<span class="vote-count-post "><strong>0</strong></span>
<div class="viewcount">votes</div>
</div>
The information is in the class votes
> votes = getNodeSet(doc, "//div[@class = 'votes']")
> votes = sapply(votes, function(votes) as.numeric(gsub("\\D","",xmlValue(votes))))
> length(votes)
[1] 50
> votes[1:10]
[1] 1 0 1 0 0 2 0 0 2 1
HTML format of question URL:
<div class="summary">
<h3><a href="/questions/1374842/building-and-installing-an-r-package-library-with-a-jnilib-extension"
class="question-hyperlink">Building and installing an R package library with a jnilib extension</a></h3>
It is in the class href inside class summary ```r
questionURL = getNodeSet(doc, "//div[@class = 'summary']/h3/a") baseURL = "http://stackoverflow.com" questionURL = sapply(questionURL,function(questionURL) paste(baseURL,unname(xmlAttrs(questionURL)[1]),sep = "")) length(questionURL) [1] 50 questionURL[1:10] [1] "http://stackoverflow.com/questions/34183203/two-distributions-in-with-googlevis-in-r" [2] "http://stackoverflow.com/questions/34182971/replacing-intercept-with-dummy-variables-in-arimax-models-in-r" [3] "http://stackoverflow.com/questions/34186123/stacked-barplot-with-ggplot2-depending-on-two-variables" [4] "http://stackoverflow.com/questions/34186560/plotly-add-trace-in-a-loop" [5] "http://stackoverflow.com/questions/34186081/vline-legends-not-showing-on-geom-histogram-type-plot" [6] "http://stackoverflow.com/questions/34123983/microarray-data-calculating-mean-gene-expression-and-effect" [7] "http://stackoverflow.com/questions/34147163/how-to-code-for-independent-2-sample-t-test-x-y" [8] "http://stackoverflow.com/questions/34185719/linear-model-with-repeated-measures-factors" [9] "http://stackoverflow.com/questions/13176741/how-to-calculate-a-table-of-pairwise-counts-from-long-form-data-frame" [10] "http://stackoverflow.com/questions/34186408/determine-week-number-from-date-over-several-years" ```
HTML format of id of posted questions:
<div class="question-summary" id="question-summary-1374842">
It is inside the class question-summary. However, if we get the id value of that class. It will contains this string question-sumary-. To remove them, I use regex "\D"
> id = getNodeSet(doc, "//div[@class = 'question-summary']")
> id = sapply(id, function(id) as.numeric(gsub("\\D","",xmlAttrs(id)[2])))
> length(id)
[1] 50
> id[1:10]
[1] 34183203 34182971 34186123 34186560 34186081 34123983 34147163 34185719
[9] 13176741 34186408
HTML format for the tags of posted questions:
<div class="tags t-python t-r t-matrix">
<a href="/questions/tagged/python" class="post-tag" title="show questions tagged 'python'" rel="tag">python</a> <a href="/questions/tagged/r" class="post-tag" title="show questions tagged 'r'" rel="tag">r</a> <a href="/questions/tagged/matrix" class="post-tag" title="show questions tagged 'matrix'" rel="tag">matrix</a>
</div>div
Getting the right Node is quite easy. And I use this "//div[@class = 'summary']//div[contains(@class,'tags')]" to search for the right Node. However, after we get the xmlValue for the Node, it's going to be like this:
> tags = getNodeSet(doc, "//div[@class = 'summary']//div[contains(@class,'tags')]")
> xmlValue(tags[[1]])
[1] "\r\n r shiny googlevis density-plot \r\n "
So I use this regex "\\r\\n\s+((\S+\s)+)\\r\\n\s+" to remove \r\n at the beginning and the end of the string, then add ; as professor Duncan's example format of tags.
> tags= sapply(tags, function(tags) gsub("\\s","; ",gsub("\\s$","",gsub("^\\\r\\\n\\s+((\\S+\\s)+)\\\r\\\n\\s+","\\1",xmlValue(tags)))))
> length(tags)
[1] 50
> tags[1:10]
[1] "r; shiny; googlevis; density-plot" "r; time-series; intercept"
[3] "r; ggplot2" "r; plot; ggplot2; plotly"
[5] "r; ggplot2" "r; expression; effect"
[7] "r" "r; mixed-models"
[9] "r; count; data.frame; long-form" "r; date; cycle"
HTML format of next button:
<div class = "pager fl">
<a href="/questions/tagged/r?page=2&sort=active&pagesize=50" rel="next" title="go to page 2"> <span class="page-numbers next"> next</span> </a>
It is inside class //a wih rel = 'next'
> next_button = getNodeSet(doc, "//a[@rel='next']")
> next_button_url = paste(baseURL,xmlGetAttr(next_button[[1]],"href"),sep ="")
> next_button_url
[1] "http://stackoverflow.com/questions/tagged/r?page=2&sort=active&pagesize=50"
> df= data.frame(id=id, date= when, tags=tags, title=title, url = questionURL, views= views, votes= votes,answers= answers, user= who, reputation = reputation)
> head(df)
id date tags
1 34183203 2015-12-09 18:48:03Z r; shiny; googlevis; density-plot
2 34182971 2015-12-09 18:45:19Z r; time-series; intercept
3 34186123 2015-12-09 18:45:15Z r; ggplot2
4 34186560 2015-12-09 18:41:27Z r; plot; ggplot2; plotly
5 34186081 2015-12-09 18:40:42Z r; ggplot2
6 34123983 2015-12-09 18:40:02Z r; expression; effect
title
1 Two distributions in with googleVis in R
2 Replacing intercept with dummy variables in ARIMAX models in R
3 Stacked barplot with ggplot2 depending on two variables
4 Plotly: add_trace in a loop
5 vline legends not showing on geom_histogram type plot
6 microarray data, calculating mean gene expression and effect
url
1 http://stackoverflow.com/questions/34183203/two-distributions-in-with-googlevis-in-r
2 http://stackoverflow.com/questions/34182971/replacing-intercept-with-dummy-variables-in-arimax-models-in-r
3 http://stackoverflow.com/questions/34186123/stacked-barplot-with-ggplot2-depending-on-two-variables
4 http://stackoverflow.com/questions/34186560/plotly-add-trace-in-a-loop
5 http://stackoverflow.com/questions/34186081/vline-legends-not-showing-on-geom-histogram-type-plot
6 http://stackoverflow.com/questions/34123983/microarray-data-calculating-mean-gene-expression-and-effect
views votes answers user reputation
1 13 1 1 Pork Chop 2046
2 5 0 1 javlacalle 709
3 9 1 1 eipi10 134000
4 5 0 0 sparkle 1522
5 5 0 1 ialm 3409
6 20 2 0 Jordan Browne 8
> dim(df)
[1] 50 10
I create a function as called as page_df to scrape all necessary information of a stackoverflow web page into a data frame.
page_df =
function(url_link)
{
doc = htmlParse(url_link)
who = getNodeSet(doc, "//div[@class = 'started fr']")
index= 1:length(who)
anon_index= which(sapply(who,function(who) is.na(xmlValue(getNodeSet(who,".//div[@class = 'user-details']//a[contains(@href,'/users/')]")[1][[1]]))))
if (length(anon_index) == 0){
who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
who=sapply(who, function(who) xmlValue(who))
}else{
who = getNodeSet(doc, "//div[@class = 'user-details']//a[contains(@href,'/users/')]")
who=sapply(who, function(who) xmlValue(who))
index= index[-anon_index]
who = data.frame(index=index, who = who)
anon= data.frame(index=anon_index,who= "anonymous")
who = rbind(who,anon)
who= who[order(who$index),2]
}
when = getNodeSet(doc,"//div[@class = 'user-action-time']/a/span")
when= sapply(when,function(when) unname(xmlAttrs(when)[1]))
title = getNodeSet(doc,"//div[@class = 'summary']/h3/a")
title = sapply(title, function(title) xmlValue(title))
if (length(anon_index) == 0){
reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
reputation = sapply(reputation, function(reputation) xmlValue(reputation))
reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
}else{
reputation = getNodeSet(doc,"//div[@class = 'user-details']//span[@class = 'reputation-score']")
reputation = sapply(reputation, function(reputation) xmlValue(reputation))
reputation = as.numeric(gsub("[,|.]","",gsub("[k]","000",reputation)))
reputation = data.frame(index=index, reputation = reputation)
anon_reputation= data.frame(index=anon_index,reputation= 0)
reputation = rbind(reputation,anon_reputation)
reputation = reputation[order(reputation$index),2]
}
views = getNodeSet(doc, "//div[@class = 'views ' or @class = 'views warm' or @class = 'views hot' or @class = 'views supernova']")
views = sapply(views,function(views) as.numeric(gsub("\\D","",xmlGetAttr(views,"title"))))
answersNode = getNodeSet(doc,"//div[@class='status unanswered' or @class='status answered-accepted' or @class='status answered']")
answers = sapply(answersNode, function(answer) as.numeric(gsub("\\D","",xmlValue(answer))))
votes = getNodeSet(doc, "//div[@class='votes']")
votes = sapply(votes, function(votes) as.numeric(gsub("\\D","",xmlValue(votes))))
questionURL = getNodeSet(doc, "//div[@class = 'summary']/h3/a")
baseURL = "http://stackoverflow.com"
questionURL = sapply(questionURL,function(questionURL) paste(baseURL,unname(xmlAttrs(questionURL)[1]),sep = ""))
id = getNodeSet(doc, "//div[@class = 'question-summary']")
id = sapply(id, function(id) as.numeric(gsub("\\D","",xmlAttrs(id)[2])))
tags = getNodeSet(doc, "//div[@class = 'summary']//div[contains(@class,'tags')]")
tags= sapply(tags, function(tags) gsub("\\s","; ",gsub("\\s$","",gsub("^\\\r\\\n\\s+((\\S+\\s)+)\\\r\\\n\\s+","\\1",xmlValue(tags)))))
df= data.frame(id=id, date= when, tags=tags, title=title, url = questionURL, views= views, votes= votes,answers= answers, user= who, reputation = reputation)
}
For convenience and the purpose of reuse code, I put it into a R file named as function.R
Now, we move to scrape all the stackoverflow R questions
u= "http://stackoverflow.com/questions/tagged/r?page=1&sort=active&pagesize=50"
doc= htmlParse(u)
df = page_df(u)
repeat{
next_button = getNodeSet(doc, "//a[@rel='next']")
next_button_url = paste(baseURL,xmlGetAttr(next_button[[1]],"href"),sep ="")
doc = htmlParse(next_button_url)
new_df = page_df(next_button_url)
df = rbind(df,new_df)
if (grepl("next",xmlValue(next_button[1][[1]])) == FALSE){
break
}
}
After 40 mins, it is done to scrape 2331 pages of R questions, creating a data frame named as df with 116541 rows
> dim(df)
[1] 116541 10
> head(df)
id date tags
1 34126854 2015-12-07 05:09:51Z r; if-statement; lubridate
2 34126806 2015-12-07 05:07:04Z r; dplyr; caret; subsampling
3 34124928 2015-12-07 05:02:04Z r; join; data.table
4 34126026 2015-12-07 05:00:36Z r
5 34126467 2015-12-07 04:58:02Z regex; r
6 34126673 2015-12-07 04:52:46Z r
title
1 Run script else quit in R
2 k-fold cross validation with different sample sizes
3 Can I use the R data.table join capability to select rows and perform some operation?
4 Mean and standard deviation by groups
5 merging two data sets on the basis of two columns
6 Is there any way to use blast+ to blast a query protein sequence against a single genome without making a database of it in R? [on hold]
url
1 http://stackoverflow.com/questions/34126854/run-script-else-quit-in-r
2 http://stackoverflow.com/questions/34126806/k-fold-cross-validation-with-different-sample-sizes
3 http://stackoverflow.com/questions/34124928/can-i-use-the-r-data-table-join-capability-to-select-rows-and-perform-some-opera
4 http://stackoverflow.com/questions/34126026/mean-and-standard-deviation-by-groups
5 http://stackoverflow.com/questions/34126467/merging-two-data-sets-on-the-basis-of-two-columns
6 http://stackoverflow.com/questions/34126673/is-there-any-way-to-use-blast-to-blast-a-query-protein-sequence-against-a-singl
views votes answers user reputation
1 3 0 0 Rime 277
2 5 0 0 Pascal 6461
3 25 4 1 Richard Scriven 437000
4 31 3 3 akrun 121000
5 11 1 1 bramtayl 2006
6 13 4 0 Pascal 6461
> load("rQAs.rda")
> data = rQAs
> dim(data)
[1] 58096 10
Distribution of the number of question each person answered is:
answer_classification = data[data$type=="answer",1]
answer_distribution = table(answer_classification)
Subset data\(text by **question** data\)type The total number of questions is 10004 and 956 of them are about ggplot ```r
textbyquestion = data\(text[data\)type == "question"] length(textbyquestion) [1] 10004 length(which(grepl("ggplot", textbyquestion))) [1] 956 ```
682 questions about XML 125 questions about HTML 0 questions about Web Scraping
> length(which(grepl("XML", text_by_question)))
[1] 682
> length(which(grepl("HTML", text_by_question)))
[1] 125
> length(which(grepl("Web Scraping", text_by_question)))
[1] 0