統計與R: R 與自行車（單車熱已退！？）

R跟單車熱有什麼關係？基本上是沒關係，但是我們可以用R來分析到底有沒有退（亂掰～）

比如說我們想了解在01的登山車討論區，每年發表的文章數情況。

首先輸入網址（先用第一頁做測試），並且讀取網頁原始碼

x <- "http://www.mobile01.com/topiclist.php?f=315&p=1"
txt <- readLines(x, encoding = "UTF-8")

如果用Google-Chrome的話可以按右鍵->檢查元素

大概就可以知道我們要找的資料在哪邊了

在一大堆網頁原始碼中找到我們要的資料， grep( 條件 , 資料來源 ) 可以告訴你，符合條件的資料在哪裡。

其中 <p>2012-07-03 11:36</p> 裡面的日期就是我們要的資料。

txt <- txt[grep("img=0", txt)]
txt[1]

## [1] "                <td width=\"17%\" class=\"authur\"><a href=\"topicdetail.php?f=315&t=2822797&p=1&img=0\"><p>2012-07-03 11:36</p><p>mine_city</p></a></td>"

找到後還要經過一些處理，資料萃取出來。利用 strsplit( 來源資料 , 條件 ) 把資料切開。

temp <- matrix(unlist(strsplit(txt, "<p>")), nrow = length(txt), 
    byrow = T)[, 2]
temp

##  [1] "2012-07-03 11:36</p>" "2012-07-05 20:23</p>" "2009-06-15 21:45</p>"
##  [4] "2010-10-11 01:35</p>" "2012-07-03 23:04</p>" "2012-07-04 21:01</p>"
##  [7] "2012-03-30 21:03</p>" "2012-06-28 16:27</p>" "2010-07-30 10:08</p>"
## [10] "2012-07-04 09:50</p>" "2012-07-03 10:55</p>" "2012-07-02 12:26</p>"
## [13] "2012-07-02 14:31</p>" "2012-06-19 23:55</p>" "2012-06-25 13:12</p>"
## [16] "2011-04-28 23:23</p>" "2007-10-20 00:00</p>" "2012-06-29 23:06</p>"
## [19] "2007-05-09 22:35</p>" "2008-07-21 01:00</p>" "2012-06-24 20:12</p>"
## [22] "2012-06-29 01:30</p>" "2012-06-29 09:19</p>" "2012-06-27 13:51</p>"
## [25] "2011-11-07 17:41</p>" "2012-06-18 02:35</p>" "2009-12-11 02:53</p>"
## [28] "2012-06-22 18:38</p>" "2012-06-21 00:35</p>"

去頭去尾後就是我們需要的資料啦

temp <- matrix(unlist(strsplit(temp, "[[:punct:] ]")), nrow = length(txt), 
    byrow = T)[, 1:5]
temp

##       [,1]   [,2] [,3] [,4] [,5]
##  [1,] "2012" "07" "03" "11" "36"
##  [2,] "2012" "07" "05" "20" "23"
##  [3,] "2009" "06" "15" "21" "45"
##  [4,] "2010" "10" "11" "01" "35"
##  [5,] "2012" "07" "03" "23" "04"
##  [6,] "2012" "07" "04" "21" "01"
##  [7,] "2012" "03" "30" "21" "03"
##  [8,] "2012" "06" "28" "16" "27"
##  [9,] "2010" "07" "30" "10" "08"
## [10,] "2012" "07" "04" "09" "50"
## [11,] "2012" "07" "03" "10" "55"
## [12,] "2012" "07" "02" "12" "26"
## [13,] "2012" "07" "02" "14" "31"
## [14,] "2012" "06" "19" "23" "55"
## [15,] "2012" "06" "25" "13" "12"
## [16,] "2011" "04" "28" "23" "23"
## [17,] "2007" "10" "20" "00" "00"
## [18,] "2012" "06" "29" "23" "06"
## [19,] "2007" "05" "09" "22" "35"
## [20,] "2008" "07" "21" "01" "00"
## [21,] "2012" "06" "24" "20" "12"
## [22,] "2012" "06" "29" "01" "30"
## [23,] "2012" "06" "29" "09" "19"
## [24,] "2012" "06" "27" "13" "51"
## [25,] "2011" "11" "07" "17" "41"
## [26,] "2012" "06" "18" "02" "35"
## [27,] "2009" "12" "11" "02" "53"
## [28,] "2012" "06" "22" "18" "38"
## [29,] "2012" "06" "21" "00" "35"

最後把文字轉成數字

matrix(as.integer(temp), nrow = length(txt))

##       [,1] [,2] [,3] [,4] [,5]
##  [1,] 2012    7    3   11   36
##  [2,] 2012    7    5   20   23
##  [3,] 2009    6   15   21   45
##  [4,] 2010   10   11    1   35
##  [5,] 2012    7    3   23    4
##  [6,] 2012    7    4   21    1
##  [7,] 2012    3   30   21    3
##  [8,] 2012    6   28   16   27
##  [9,] 2010    7   30   10    8
## [10,] 2012    7    4    9   50
## [11,] 2012    7    3   10   55
## [12,] 2012    7    2   12   26
## [13,] 2012    7    2   14   31
## [14,] 2012    6   19   23   55
## [15,] 2012    6   25   13   12
## [16,] 2011    4   28   23   23
## [17,] 2007   10   20    0    0
## [18,] 2012    6   29   23    6
## [19,] 2007    5    9   22   35
## [20,] 2008    7   21    1    0
## [21,] 2012    6   24   20   12
## [22,] 2012    6   29    1   30
## [23,] 2012    6   29    9   19
## [24,] 2012    6   27   13   51
## [25,] 2011   11    7   17   41
## [26,] 2012    6   18    2   35
## [27,] 2009   12   11    2   53
## [28,] 2012    6   22   18   38
## [29,] 2012    6   21    0   35

到這邊才完成20%，可是通常需要80%的努力 XD

接下來把程式包起來

f <- function(x) {
    txt <- readLines(x, encoding = "UTF-8")
    txt <- txt[grep("img=0", txt)]
    temp <- matrix(unlist(strsplit(txt, "<p>")), nrow = length(txt), byrow = T)[, 
        2]
    temp <- matrix(unlist(strsplit(temp, "[[:punct:] ]")), nrow = length(txt), 
        byrow = T)[, 1:5]
    matrix(as.integer(temp), nrow = length(txt))
}

設定好網址後，開始翻滾啦！！

Sys.sleep(1)讓每次計算都延遲一秒後在繼續，因為如果連續大量讀取網頁的話，會被當成是在攻擊XD
所以這邊為求方便就用 for 啦！（自從學了Mathematica之後，是幾乎不用迴圈的）

x <- paste("http://www.mobile01.com/topiclist.php?f=315&p=", 1:433, 
    sep = "")
txt <- f(x[1])
for (i in 2:length(x)) {
    txt <- rbind(txt, f(x[i]))
    Sys.sleep(1)
}

這麼重要的東西還是先存起來比較妥當

write.csv(txt, file = "mytxt.csv", row.names = F)
read.csv("mytxt.csv")

畫些圖看看

barplot(table(txt[, 1]))

嗯～單車熱真的退了～（其實是回復正常，之前那叫不正常）

以月份來看

barplot(table(txt[, 2]))

看起來好像差不多，可以檢定看看

chisq.test(table(txt[, 2]))

## 
##  Chi-squared test for given probabilities
## 
## data:  table(txt[, 2]) 
## X-squared = 132.1, df = 11, p-value < 2.2e-16
##

以時間來看

barplot(table(txt[, 4]))

大多在凌晨，清晨最少

以每年每月來看

barplot(table(apply(txt[, 1:2], 1, function(x) paste(x[1], "-", x[2]))))

就這樣。

結論：

雖然回不到過去那種盛況，也證明了"蛋塔熱"、"單車熱"這種一窩蜂的熱潮遲早會退，

不過熱陶退去後騎車的人還是有比以前增加，也算好事嚕。

聽說這種現像12年一輪，蛋塔、單車，下一個會是？

我希望是R，大家一窩蜂來學R吧！！！！！XD

統計與R

網頁

2012年7月5日星期四

R 與自行車（單車熱已退！？）

1 則留言:

網頁

2012年7月5日 星期四

R 與自行車（單車熱已退！？）

1 則留言:

2012年7月5日星期四