2012年7月5日 星期四

R 與自行車(單車熱已退!?)

R跟單車熱有什麼關係?基本上是沒關係,但是我們可以用R來分析到底有沒有退(亂掰~)


比如說我們想了解在01的登山車討論區,每年發表的文章數情況。




首先輸入網址(先用第一頁做測試),並且讀取網頁原始碼
x <- "http://www.mobile01.com/topiclist.php?f=315&p=1"
txt <- readLines(x, encoding = "UTF-8")


如果用Google-Chrome的話可以按右鍵->檢查元素


大概就可以知道我們要找的資料在哪邊了


在一大堆網頁原始碼中找到我們要的資料, grep( 條件 , 資料來源 ) 可以告訴你,符合條件的資料在哪裡。
其中 <p>2012-07-03 11:36</p> 裡面的日期就是我們要的資料。
txt <- txt[grep("img=0", txt)]
txt[1]
## [1] "                <td width=\"17%\" class=\"authur\"><a href=\"topicdetail.php?f=315&t=2822797&p=1&img=0\"><p>2012-07-03 11:36</p><p>mine_city</p></a></td>"


找到後還要經過一些處理,資料萃取出來。利用 strsplit( 來源資料 , 條件 ) 把資料切開。
temp <- matrix(unlist(strsplit(txt, "<p>")), nrow = length(txt), 
    byrow = T)[, 2]
temp
##  [1] "2012-07-03 11:36</p>" "2012-07-05 20:23</p>" "2009-06-15 21:45</p>"
##  [4] "2010-10-11 01:35</p>" "2012-07-03 23:04</p>" "2012-07-04 21:01</p>"
##  [7] "2012-03-30 21:03</p>" "2012-06-28 16:27</p>" "2010-07-30 10:08</p>"
## [10] "2012-07-04 09:50</p>" "2012-07-03 10:55</p>" "2012-07-02 12:26</p>"
## [13] "2012-07-02 14:31</p>" "2012-06-19 23:55</p>" "2012-06-25 13:12</p>"
## [16] "2011-04-28 23:23</p>" "2007-10-20 00:00</p>" "2012-06-29 23:06</p>"
## [19] "2007-05-09 22:35</p>" "2008-07-21 01:00</p>" "2012-06-24 20:12</p>"
## [22] "2012-06-29 01:30</p>" "2012-06-29 09:19</p>" "2012-06-27 13:51</p>"
## [25] "2011-11-07 17:41</p>" "2012-06-18 02:35</p>" "2009-12-11 02:53</p>"
## [28] "2012-06-22 18:38</p>" "2012-06-21 00:35</p>"


去頭去尾後就是我們需要的資料啦
temp <- matrix(unlist(strsplit(temp, "[[:punct:] ]")), nrow = length(txt), 
    byrow = T)[, 1:5]
temp
##       [,1]   [,2] [,3] [,4] [,5]
##  [1,] "2012" "07" "03" "11" "36"
##  [2,] "2012" "07" "05" "20" "23"
##  [3,] "2009" "06" "15" "21" "45"
##  [4,] "2010" "10" "11" "01" "35"
##  [5,] "2012" "07" "03" "23" "04"
##  [6,] "2012" "07" "04" "21" "01"
##  [7,] "2012" "03" "30" "21" "03"
##  [8,] "2012" "06" "28" "16" "27"
##  [9,] "2010" "07" "30" "10" "08"
## [10,] "2012" "07" "04" "09" "50"
## [11,] "2012" "07" "03" "10" "55"
## [12,] "2012" "07" "02" "12" "26"
## [13,] "2012" "07" "02" "14" "31"
## [14,] "2012" "06" "19" "23" "55"
## [15,] "2012" "06" "25" "13" "12"
## [16,] "2011" "04" "28" "23" "23"
## [17,] "2007" "10" "20" "00" "00"
## [18,] "2012" "06" "29" "23" "06"
## [19,] "2007" "05" "09" "22" "35"
## [20,] "2008" "07" "21" "01" "00"
## [21,] "2012" "06" "24" "20" "12"
## [22,] "2012" "06" "29" "01" "30"
## [23,] "2012" "06" "29" "09" "19"
## [24,] "2012" "06" "27" "13" "51"
## [25,] "2011" "11" "07" "17" "41"
## [26,] "2012" "06" "18" "02" "35"
## [27,] "2009" "12" "11" "02" "53"
## [28,] "2012" "06" "22" "18" "38"
## [29,] "2012" "06" "21" "00" "35"


最後把文字轉成數字
matrix(as.integer(temp), nrow = length(txt))
##       [,1] [,2] [,3] [,4] [,5]
##  [1,] 2012    7    3   11   36
##  [2,] 2012    7    5   20   23
##  [3,] 2009    6   15   21   45
##  [4,] 2010   10   11    1   35
##  [5,] 2012    7    3   23    4
##  [6,] 2012    7    4   21    1
##  [7,] 2012    3   30   21    3
##  [8,] 2012    6   28   16   27
##  [9,] 2010    7   30   10    8
## [10,] 2012    7    4    9   50
## [11,] 2012    7    3   10   55
## [12,] 2012    7    2   12   26
## [13,] 2012    7    2   14   31
## [14,] 2012    6   19   23   55
## [15,] 2012    6   25   13   12
## [16,] 2011    4   28   23   23
## [17,] 2007   10   20    0    0
## [18,] 2012    6   29   23    6
## [19,] 2007    5    9   22   35
## [20,] 2008    7   21    1    0
## [21,] 2012    6   24   20   12
## [22,] 2012    6   29    1   30
## [23,] 2012    6   29    9   19
## [24,] 2012    6   27   13   51
## [25,] 2011   11    7   17   41
## [26,] 2012    6   18    2   35
## [27,] 2009   12   11    2   53
## [28,] 2012    6   22   18   38
## [29,] 2012    6   21    0   35

到這邊才完成20%,可是通常需要80%的努力 XD


接下來把程式包起來
f <- function(x) {
    txt <- readLines(x, encoding = "UTF-8")
    txt <- txt[grep("img=0", txt)]
    temp <- matrix(unlist(strsplit(txt, "<p>")), nrow = length(txt), byrow = T)[, 
        2]
    temp <- matrix(unlist(strsplit(temp, "[[:punct:] ]")), nrow = length(txt), 
        byrow = T)[, 1:5]
    matrix(as.integer(temp), nrow = length(txt))
}


設定好網址後,開始翻滾啦!!


Sys.sleep(1)讓每次計算都延遲一秒後在繼續,因為如果連續大量讀取網頁的話,會被當成是在攻擊XD
所以這邊為求方便就用 for 啦!(自從學了Mathematica之後,是幾乎不用迴圈的)
x <- paste("http://www.mobile01.com/topiclist.php?f=315&p=", 1:433, 
    sep = "")
txt <- f(x[1])
for (i in 2:length(x)) {
    txt <- rbind(txt, f(x[i]))
    Sys.sleep(1)
}


這麼重要的東西還是先存起來比較妥當
write.csv(txt, file = "mytxt.csv", row.names = F)
read.csv("mytxt.csv")


畫些圖看看
barplot(table(txt[, 1]))
plot of chunk unnamed-chunk-10
嗯~單車熱真的退了~(其實是回復正常,之前那叫不正常)


以月份來看
barplot(table(txt[, 2]))
plot of chunk unnamed-chunk-11

看起來好像差不多,可以檢定看看
chisq.test(table(txt[, 2]))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(txt[, 2]) 
## X-squared = 132.1, df = 11, p-value < 2.2e-16
## 


以時間來看
barplot(table(txt[, 4]))
plot of chunk unnamed-chunk-12
大多在凌晨,清晨最少


以每年每月來看
barplot(table(apply(txt[, 1:2], 1, function(x) paste(x[1], "-", x[2]))))
plot of chunk unnamed-chunk-13
就這樣。



結論:

雖然回不到過去那種盛況,也證明了"蛋塔熱"、"單車熱"這種一窩蜂的熱潮遲早會退,

不過熱陶退去後騎車的人還是有比以前增加,也算好事嚕。

聽說這種現像12年一輪,蛋塔、單車,下一個會是?

我希望是R,大家一窩蜂來學R吧!!!!!XD




1 則留言:

  1. MGM Grand, Las Vegas: The Dream at the MGM Grand
    MGM Grand Hotel 목포 출장마사지 in 울산광역 출장안마 Las Vegas features 50000 square 서울특별 출장안마 feet of meeting and event space, including 제주 출장안마 a retail arcade, six restaurants, 서귀포 출장안마

    回覆刪除