R跟單車熱有什麼關係?基本上是沒關係,但是我們可以用R來分析到底有沒有退(亂掰~)
比如說我們想了解在01的登山車討論區,每年發表的文章數情況。
首先輸入網址(先用第一頁做測試),並且讀取網頁原始碼
x <- "http://www.mobile01.com/topiclist.php?f=315&p=1"
txt <- readLines(x, encoding = "UTF-8")
如果用Google-Chrome的話可以按右鍵->檢查元素
大概就可以知道我們要找的資料在哪邊了
在一大堆網頁原始碼中找到我們要的資料, grep( 條件 , 資料來源 ) 可以告訴你,符合條件的資料在哪裡。
其中 <p>2012-07-03 11:36</p> 裡面的日期就是我們要的資料。
txt <- txt[grep("img=0", txt)]
txt[1]
## [1] " <td width=\"17%\" class=\"authur\"><a href=\"topicdetail.php?f=315&t=2822797&p=1&img=0\"><p>2012-07-03 11:36</p><p>mine_city</p></a></td>"
temp <- matrix(unlist(strsplit(txt, "<p>")), nrow = length(txt),
byrow = T)[, 2]
temp
## [1] "2012-07-03 11:36</p>" "2012-07-05 20:23</p>" "2009-06-15 21:45</p>"
## [4] "2010-10-11 01:35</p>" "2012-07-03 23:04</p>" "2012-07-04 21:01</p>"
## [7] "2012-03-30 21:03</p>" "2012-06-28 16:27</p>" "2010-07-30 10:08</p>"
## [10] "2012-07-04 09:50</p>" "2012-07-03 10:55</p>" "2012-07-02 12:26</p>"
## [13] "2012-07-02 14:31</p>" "2012-06-19 23:55</p>" "2012-06-25 13:12</p>"
## [16] "2011-04-28 23:23</p>" "2007-10-20 00:00</p>" "2012-06-29 23:06</p>"
## [19] "2007-05-09 22:35</p>" "2008-07-21 01:00</p>" "2012-06-24 20:12</p>"
## [22] "2012-06-29 01:30</p>" "2012-06-29 09:19</p>" "2012-06-27 13:51</p>"
## [25] "2011-11-07 17:41</p>" "2012-06-18 02:35</p>" "2009-12-11 02:53</p>"
## [28] "2012-06-22 18:38</p>" "2012-06-21 00:35</p>"
去頭去尾後就是我們需要的資料啦
temp <- matrix(unlist(strsplit(temp, "[[:punct:] ]")), nrow = length(txt),
byrow = T)[, 1:5]
temp
## [,1] [,2] [,3] [,4] [,5]
## [1,] "2012" "07" "03" "11" "36"
## [2,] "2012" "07" "05" "20" "23"
## [3,] "2009" "06" "15" "21" "45"
## [4,] "2010" "10" "11" "01" "35"
## [5,] "2012" "07" "03" "23" "04"
## [6,] "2012" "07" "04" "21" "01"
## [7,] "2012" "03" "30" "21" "03"
## [8,] "2012" "06" "28" "16" "27"
## [9,] "2010" "07" "30" "10" "08"
## [10,] "2012" "07" "04" "09" "50"
## [11,] "2012" "07" "03" "10" "55"
## [12,] "2012" "07" "02" "12" "26"
## [13,] "2012" "07" "02" "14" "31"
## [14,] "2012" "06" "19" "23" "55"
## [15,] "2012" "06" "25" "13" "12"
## [16,] "2011" "04" "28" "23" "23"
## [17,] "2007" "10" "20" "00" "00"
## [18,] "2012" "06" "29" "23" "06"
## [19,] "2007" "05" "09" "22" "35"
## [20,] "2008" "07" "21" "01" "00"
## [21,] "2012" "06" "24" "20" "12"
## [22,] "2012" "06" "29" "01" "30"
## [23,] "2012" "06" "29" "09" "19"
## [24,] "2012" "06" "27" "13" "51"
## [25,] "2011" "11" "07" "17" "41"
## [26,] "2012" "06" "18" "02" "35"
## [27,] "2009" "12" "11" "02" "53"
## [28,] "2012" "06" "22" "18" "38"
## [29,] "2012" "06" "21" "00" "35"
最後把文字轉成數字
matrix(as.integer(temp), nrow = length(txt))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2012 7 3 11 36
## [2,] 2012 7 5 20 23
## [3,] 2009 6 15 21 45
## [4,] 2010 10 11 1 35
## [5,] 2012 7 3 23 4
## [6,] 2012 7 4 21 1
## [7,] 2012 3 30 21 3
## [8,] 2012 6 28 16 27
## [9,] 2010 7 30 10 8
## [10,] 2012 7 4 9 50
## [11,] 2012 7 3 10 55
## [12,] 2012 7 2 12 26
## [13,] 2012 7 2 14 31
## [14,] 2012 6 19 23 55
## [15,] 2012 6 25 13 12
## [16,] 2011 4 28 23 23
## [17,] 2007 10 20 0 0
## [18,] 2012 6 29 23 6
## [19,] 2007 5 9 22 35
## [20,] 2008 7 21 1 0
## [21,] 2012 6 24 20 12
## [22,] 2012 6 29 1 30
## [23,] 2012 6 29 9 19
## [24,] 2012 6 27 13 51
## [25,] 2011 11 7 17 41
## [26,] 2012 6 18 2 35
## [27,] 2009 12 11 2 53
## [28,] 2012 6 22 18 38
## [29,] 2012 6 21 0 35
到這邊才完成20%,可是通常需要80%的努力 XD
接下來把程式包起來
f <- function(x) {
txt <- readLines(x, encoding = "UTF-8")
txt <- txt[grep("img=0", txt)]
temp <- matrix(unlist(strsplit(txt, "<p>")), nrow = length(txt), byrow = T)[,
2]
temp <- matrix(unlist(strsplit(temp, "[[:punct:] ]")), nrow = length(txt),
byrow = T)[, 1:5]
matrix(as.integer(temp), nrow = length(txt))
}
設定好網址後,開始翻滾啦!!
Sys.sleep(1)讓每次計算都延遲一秒後在繼續,因為如果連續大量讀取網頁的話,會被當成是在攻擊XD
所以這邊為求方便就用 for 啦!(自從學了Mathematica之後,是幾乎不用迴圈的)
x <- paste("http://www.mobile01.com/topiclist.php?f=315&p=", 1:433,
sep = "")
txt <- f(x[1])
for (i in 2:length(x)) {
txt <- rbind(txt, f(x[i]))
Sys.sleep(1)
}
這麼重要的東西還是先存起來比較妥當
write.csv(txt, file = "mytxt.csv", row.names = F)
read.csv("mytxt.csv")
畫些圖看看
barplot(table(txt[, 1]))
嗯~單車熱真的退了~(其實是回復正常,之前那叫不正常)
以月份來看
barplot(table(txt[, 2]))
看起來好像差不多,可以檢定看看
chisq.test(table(txt[, 2]))
##
## Chi-squared test for given probabilities
##
## data: table(txt[, 2])
## X-squared = 132.1, df = 11, p-value < 2.2e-16
##
以時間來看
barplot(table(txt[, 4]))
大多在凌晨,清晨最少
以每年每月來看
barplot(table(apply(txt[, 1:2], 1, function(x) paste(x[1], "-", x[2]))))
就這樣。
結論:
雖然回不到過去那種盛況,也證明了"蛋塔熱"、"單車熱"這種一窩蜂的熱潮遲早會退,
不過熱陶退去後騎車的人還是有比以前增加,也算好事嚕。
聽說這種現像12年一輪,蛋塔、單車,下一個會是?
我希望是R,大家一窩蜂來學R吧!!!!!XD
MGM Grand, Las Vegas: The Dream at the MGM Grand
回覆刪除MGM Grand Hotel 목포 출장마사지 in 울산광역 출장안마 Las Vegas features 50000 square 서울특별 출장안마 feet of meeting and event space, including 제주 출장안마 a retail arcade, six restaurants, 서귀포 출장안마