博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
『Data Science』R语言学习笔记,获取数据
阅读量:6367 次
发布时间:2019-06-23

本文共 7877 字,大约阅读时间需要 26 分钟。

  hot3.png

obtaining Data Motivation

  • This course covers the basic ideas behind getting data ready for analysis
    • Finding and extracting raw data
    • Tidy data principles and how to make data tiny
    • practical implementation through a range of R packages
  • What this course depends on
  • What would be useful
    • Exploratory analysis
    • Reporting Data and Reproducible Research

ps:

GOAL: Raw data -> Processing script -> tidy data -> data analysis -> data communication

Raw and Processed Data

Data are values of qualitative or quantitative variables, belonging to a set of items.

  • Qualitative: Country of origin, sex, treatment
  • Quantitative: Height, weight, blood pressure

The components of tidy data

  1. The raw data.
  2. A tidy data set.
  3. A code book describing each variable and its values in the tidy data set.
  4. An explicit and recipe you used to go from 1 -> 2,3.

The tidy data

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.

Others:

  • Include a row at the top of each file with variable names.
  • Make variable names human readable AgeAtDiagnosis instead of AgeDx.
  • In general data should be saved in one file per table.

Dowdloading Data

  1. Get/set your working directory
  • getwd()
  • setwd()
  1. Checking for and creating directories
  • file.exists("directoryName")
  • dir.create("directoryName")
  1. Getting data from the internet
  • download.file()
if(!file.exists(('db'))){  dir.create('db')}fileUrl <- "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"download.file(fileUrl, destfile = "./db/callsforservice.csv", method = "curl")list.files('./db')

PS: 在使用上面的方法下载数据文件的时候,出现了下面的的错误信息,这个是由于我系统里面没有安装curl造成的,把method = "curl"改成method = "auto"解决。

Warning messages:1: running command 'curl  "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"  -o "./db/callsforservice.csv"' had status 1272: In download.file(fileUrl, destfile = "./db/callsforservice.csv",  :  下载退出状态不是零

Loading flat files

  • read.table()
calssData <- read.table('./db/callsforservice.csv', sep = ',', header = T)head(calssData)

All Reading Functions

read.table(file, header = FALSE, sep = "", quote = "\"'",           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),           row.names, col.names, as.is = !stringsAsFactors,           na.strings = "NA", colClasses = NA, nrows = -1,           skip = 0, check.names = TRUE, fill = !blank.lines.skip,           strip.white = FALSE, blank.lines.skip = TRUE,           comment.char = "#",           allowEscapes = FALSE, flush = FALSE,           stringsAsFactors = default.stringsAsFactors(),           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)read.csv(file, header = TRUE, sep = ",", quote = "\"",         dec = ".", fill = TRUE, comment.char = "", ...)read.csv2(file, header = TRUE, sep = ";", quote = "\"",          dec = ",", fill = TRUE, comment.char = "", ...)read.delim(file, header = TRUE, sep = "\t", quote = "\"",           dec = ".", fill = TRUE, comment.char = "", ...)read.delim2(file, header = TRUE, sep = "\t", quote = "\"",            dec = ",", fill = TRUE, comment.char = "", ...)

Reading XML Data

  • Extensible markup language
  • Frequently used to store structured data
  • Particularly widely used in internet applications
  • Extracting XML is the basis for most web scraping
  • Components
    • Markup - label that give the text structured
    • Content - the actual text of the document
library(XMl)html <- "http://stackoverflow.com/search?q=XML+content+does+not+seem+to+be+XML%3A"doc <- htmlTreeParse(html, useInternal = T)content <- xpathSApply(doc, "//div[[[[@class](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758) = 'result-link']", xmlValue)

Reading JSON Data

  • jsonlite
install.packages('jsonlite')library(jsonlite)jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")names(jsonData)names(jsonData$owner$login)## print JSON data in pretty waymyjson <- toJSON(jsonData$owner, pretty = T)cat(myjson)

data.table()

> library(data.table)> DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))> head(DF, 3)         x y          z1 1.239493 a -0.39172452 1.090748 a  0.36401523 2.462106 a  1.3424369> DT = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))> head(DT)           x y           z1  0.1235667 a  0.947657082 -1.1491418 a  1.232647153 -2.3339784 a -0.706254634  0.4896532 b  0.071440385  0.7731791 b  0.452620966  0.1601838 b -0.30345490
DT[2,]DT[DT$y == "a",]DT[, c(2,3)]DT[, list(mean(x), sum(z))]DT[, table(y)]DT[, w:=z^2]DT[, m:= {tmp <- (x+y); log2(tmp+5)}]

Reading from MySQL

install.packages("RMySQL")library(RMySQL)ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")allTables <- dbListTables(hg19)length(allTables)dbListFields(hg19, "affyU133Plus2")dbGetQuery(hg19, "select count(*) from affyU13Plus2")affyData <- dbReadTable(hg19, "affyU133Plus2")head(affyData)## processing big data tablequery <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")affyMis <- fetch(query); quantitle(affyMis$misMatches)affyMisSmall <- fetch(query, n = 10); dbClearResult(query);dim(affyMisSmall)dbDisconnect(hg19)    ## close db connection

Reading from HDF5

  • Used for stroing large data sets.
  • Supports storing a range of data types
  • Heirarchical data format
  • groups containting zero or more data sets and metadata
    • Have a group header with group name and list of attributes
    • Have a group symbol table with a list of bjects in group
  • datasets multidmensional array of data elements with metadata
    • Have a header with name, datatype, dataspace, and storage layout
    • Have a data array with the data

Reading from web

Get web document

  1. Use built-in functions, url() and readLines
> con = url("http://www.baidu.com")> htmlCode = readLines(con)> close(con)> htmlCode
  1. Use XML package
> library(XML)> url <- "http://www.baidu.com"> html <- htmlTreeParse(url, useInternalNodes = T)> xpathSApply(html, "//div", xmlValue)
  1. Use httr and XML packages
install.packages("httr")library(httr)url <- "http://www.baidu.com"html <- GET(url)content = content(html, as="text")library(XML)parsedHtml = htmlParse(content, asText = T)xpathSApply(parsedHtml, "//div", xmlValue)

Accessing websites with passwords

  1. before Login
> pg1 = GET("http://httpbin.org/basic-auth/user/passwd")> pg1Response [http://httpbin.org/basic-auth/user/passwd]  Date: 2016-07-17 15:33  Status: 401  Content-Type: 
  1. Loging In
> pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))> pg2Response [http://httpbin.org/basic-auth/user/passwd]  Date: 2016-07-17 15:34  Status: 200  Content-Type: application/json  Size: 47 B{  "authenticated": true,  "user": "user"}> names(pg2) [1] "url"         "status_code" "headers"     "all_headers" "cookies"     "content"     "date"        "times"       [9] "request"     "handle"
  1. handle the web site with cookies, sessions and so on.
> pg = handle("http://httpbin.org")> login = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))> pg1 = GET(handle = pg, path = "/")> pg2 = GET(handle = pg, path = "about")

Reading data from APIs

library(oauth_app)myapp = oauth_app("twitter", key = "yourConsumerKeyHere", secret = "yourConsumerSecretHere")sig = sign_oauth1.0(myapp, token = "youerTokenHere", token_secret = "yourTokenSecretHere")homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)json1 = content(homtTL)json2 = jsonlite::fromJSON(toJSON(json1))json2[1, 1:4]

转载于:https://my.oschina.net/skyler/blog/713563

你可能感兴趣的文章
WPF MVVM中在ViewModel中关闭或者打开Window
查看>>
页面搜索引擎优化技术将提升你的排名
查看>>
unity3d自定义Toggle组件,解决设置isOn自动调用方法
查看>>
10 行代码提取复杂 Excel 数据
查看>>
是什么样的骚操作让应用上线节省90%的时间
查看>>
初入阿里云
查看>>
Swift 2.0初探
查看>>
代理模式
查看>>
Ubuntu 创始人谈 IBM 收购红帽:对 Ubuntu 是件好事
查看>>
阿里云申请免费SSL证书,并配置到Tomcat,实现https访问
查看>>
VS2005 “无法在证书存储区中找到清单签名证书”错误的解决方法
查看>>
学习编程难不难呢?
查看>>
Greenplum 逻辑备份增强 - 备份期间不锁metadata(pg_class),不堵塞DDL
查看>>
ES6新特性:var与let区别
查看>>
免费的多平台使用的桌面日程安排软件?
查看>>
Java 实现阿里云直播
查看>>
javascript中的内存管理和垃圾回收
查看>>
Hbase java 常见操作
查看>>
Python网络编程——协程
查看>>
laravel中短信发送验证码的实现方法
查看>>