obtaining Data Motivation
- This course covers the basic ideas behind getting data ready for analysis
- Finding and extracting raw data
- Tidy data principles and how to make data tiny
- practical implementation through a range of R packages
- What this course depends on
- What would be useful
- Exploratory analysis
- Reporting Data and Reproducible Research
ps:
GOAL: Raw data -> Processing script -> tidy data -> data analysis -> data communication
Raw and Processed Data
Data are values of qualitative or quantitative variables, belonging to a set of items.
- Qualitative: Country of origin, sex, treatment
- Quantitative: Height, weight, blood pressure
The components of tidy data
- The raw data.
- A tidy data set.
- A code book describing each variable and its values in the tidy data set.
- An explicit and recipe you used to go from 1 -> 2,3.
The tidy data
- Each variable you measure should be in one column.
- Each different observation of that variable should be in a different row.
- There should be one table for each "kind" of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.
Others:
- Include a row at the top of each file with variable names.
- Make variable names human readable AgeAtDiagnosis instead of AgeDx.
- In general data should be saved in one file per table.
Dowdloading Data
- Get/set your working directory
getwd()
setwd()
- Checking for and creating directories
file.exists("directoryName")
dir.create("directoryName")
- Getting data from the internet
download.file()
if(!file.exists(('db'))){ dir.create('db')}fileUrl <- "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD"download.file(fileUrl, destfile = "./db/callsforservice.csv", method = "curl")list.files('./db')
PS: 在使用上面的方法下载数据文件的时候,出现了下面的的错误信息,这个是由于我系统里面没有安装curl
造成的,把method = "curl"
改成method = "auto"
解决。
Warning messages:1: running command 'curl "https://data.baltimorecity.gov/api/views/xviu-ezkt/rows.csv?accessType=DOWNLOAD" -o "./db/callsforservice.csv"' had status 1272: In download.file(fileUrl, destfile = "./db/callsforservice.csv", : 下载退出状态不是零
Loading flat files
read.table()
calssData <- read.table('./db/callsforservice.csv', sep = ',', header = T)head(calssData)
All Reading Functions
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)read.delim(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
Reading XML Data
- Extensible markup language
- Frequently used to store structured data
- Particularly widely used in internet applications
- Extracting XML is the basis for most web scraping
- Components
- Markup - label that give the text structured
- Content - the actual text of the document
library(XMl)html <- "http://stackoverflow.com/search?q=XML+content+does+not+seem+to+be+XML%3A"doc <- htmlTreeParse(html, useInternal = T)content <- xpathSApply(doc, "//div[[[[@class](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758)](http://my.oschina.net/liwenlong7758) = 'result-link']", xmlValue)
Reading JSON Data
jsonlite
install.packages('jsonlite')library(jsonlite)jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")names(jsonData)names(jsonData$owner$login)## print JSON data in pretty waymyjson <- toJSON(jsonData$owner, pretty = T)cat(myjson)
data.table()
> library(data.table)> DF = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))> head(DF, 3) x y z1 1.239493 a -0.39172452 1.090748 a 0.36401523 2.462106 a 1.3424369> DT = data.frame(x=rnorm(9), y=rep(c("a", "b", "c"),each=3),z=rnorm(9))> head(DT) x y z1 0.1235667 a 0.947657082 -1.1491418 a 1.232647153 -2.3339784 a -0.706254634 0.4896532 b 0.071440385 0.7731791 b 0.452620966 0.1601838 b -0.30345490
DT[2,]DT[DT$y == "a",]DT[, c(2,3)]DT[, list(mean(x), sum(z))]DT[, table(y)]DT[, w:=z^2]DT[, m:= {tmp <- (x+y); log2(tmp+5)}]
Reading from MySQL
install.packages("RMySQL")library(RMySQL)ucscDb <- dbConnect(MySQL(), user = "genome", host = "genome-mysql.cse.ucsc.edu")result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);hg19 <- dbConnect(MySQL(), user = "genome", db = "hg19", host = "genome-mysql.cse.ucsc.edu")allTables <- dbListTables(hg19)length(allTables)dbListFields(hg19, "affyU133Plus2")dbGetQuery(hg19, "select count(*) from affyU13Plus2")affyData <- dbReadTable(hg19, "affyU133Plus2")head(affyData)## processing big data tablequery <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")affyMis <- fetch(query); quantitle(affyMis$misMatches)affyMisSmall <- fetch(query, n = 10); dbClearResult(query);dim(affyMisSmall)dbDisconnect(hg19) ## close db connection
Reading from HDF5
- Used for stroing large data sets.
- Supports storing a range of data types
- Heirarchical data format
- groups containting zero or more data sets and metadata
- Have a group header with group name and list of attributes
- Have a group symbol table with a list of bjects in group
- datasets multidmensional array of data elements with metadata
- Have a header with name, datatype, dataspace, and storage layout
- Have a data array with the data
Reading from web
Get web document
- Use built-in functions,
url()
andreadLines
> con = url("http://www.baidu.com")> htmlCode = readLines(con)> close(con)> htmlCode
- Use
XML
package
> library(XML)> url <- "http://www.baidu.com"> html <- htmlTreeParse(url, useInternalNodes = T)> xpathSApply(html, "//div", xmlValue)
- Use
httr
andXML
packages
install.packages("httr")library(httr)url <- "http://www.baidu.com"html <- GET(url)content = content(html, as="text")library(XML)parsedHtml = htmlParse(content, asText = T)xpathSApply(parsedHtml, "//div", xmlValue)
Accessing websites with passwords
- before Login
> pg1 = GET("http://httpbin.org/basic-auth/user/passwd")> pg1Response [http://httpbin.org/basic-auth/user/passwd] Date: 2016-07-17 15:33 Status: 401 Content-Type:
- Loging In
> pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))> pg2Response [http://httpbin.org/basic-auth/user/passwd] Date: 2016-07-17 15:34 Status: 200 Content-Type: application/json Size: 47 B{ "authenticated": true, "user": "user"}> names(pg2) [1] "url" "status_code" "headers" "all_headers" "cookies" "content" "date" "times" [9] "request" "handle"
- handle the web site with cookies, sessions and so on.
> pg = handle("http://httpbin.org")> login = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))> pg1 = GET(handle = pg, path = "/")> pg2 = GET(handle = pg, path = "about")
Reading data from APIs
library(oauth_app)myapp = oauth_app("twitter", key = "yourConsumerKeyHere", secret = "yourConsumerSecretHere")sig = sign_oauth1.0(myapp, token = "youerTokenHere", token_secret = "yourTokenSecretHere")homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)json1 = content(homtTL)json2 = jsonlite::fromJSON(toJSON(json1))json2[1, 1:4]