【数据分析·实战】北京的雾霾是大风吹走的吗

【数据分析·实战】北京的雾霾是大风吹走的吗

前言

2013年9月,国务院专门出台治理大气污染的条例,王安顺代表北京市与中央签订责任状,立下壮士断腕的决心。“也是生死状,因为中央领导说,2017年实现不了空气治理就‘提头来见’。既是玩笑话,也说明了这句话的分量很重。”

1月3日,北京市环保局发布了2016年北京市全年空气质量状况报告,“2016年我市空气质量达标天数198天,其中,一级优68天,二级良130天,达标天数较2015年增加12天;2016年共发生重污染39天,其中O3重污染1天,PM2.5重污染38天,较2015年减少7天。”

网友们看到这份报告纷纷调侃,“不过是今年多刮了几天风罢了”。

空气质量的改善是政府的功绩,还是“风”的垂怜呢?本文将基于天气以及空气质量数据,用R语言一探究竟!



一、准备工作

library(RCurl)
#现成的数据找不到,需要用爬虫爬网页上的数据
library(XML)
#提取网页中的表格
library(dplyr)
#数据清理
library(ggplot2)
#画图
library(stringr)
#文本处理

二、数据收集

本文的空气质量数据来自2016年12月北京空气质量指数查询(AQI)_12月份北京PM2.5历史数据查询_天气后报,天气数据来自北京历史天气预报查询_2016年12月份北京天气记录_北京2016年12月份天气情况_天气后报

时间跨度为2014-01-01至2016-12-31,总共应该是1096天,但2014年的空气质量数据有4天缺失,本文根据2014年08月北京空气质量指数AQI_PM2.5日历史数据_中国空气质量在线监测分析平台历史数据提供的数据予以补全(分别是2014-01-23、2014-03-24、2014-08-08、2014-08-22)。


空气质量数据和天气数据的爬虫代码:

#Air Quality
airdata <- matrix(data = NA, nrow = 1, ncol = 10)
airdata <- data.frame(airdata)
colnames(airdata) <- c("date", "airrank", "aqi", "aqirank", "pm2.5", "pm10","no2","so2", "co", "o3")

months <- sprintf("%02d", 1:12)
years <- 2014:2016
for(i in years) {
  for(j in months) {
    print(paste(i, j, sep = ""))
    url_a="http://www.tianqihoubao.com/aqi/beijing-"
    url_b=".html"
    url_all=paste(url_a, i, j, url_b, sep = "")
    temp <- getURL(url_all, .encoding="GBK")
    temp2 <- iconv(temp,"gb2312","UTF-8") 
    doc <- htmlParse(temp2, asText=T, encoding="UTF-8")
    tables <-readHTMLTable(doc)
    airdatatemp <- as.data.frame(tables)
    colnames(airdatatemp) <- c("date", "airrank", "aqi", "aqirank", "pm2.5", "pm10","no2","so2", "co", "o3")
    airdata <- rbind(airdata, airdatatemp)
  }
}


airdata1 <- c("2014-01-23", "重度污染", "271", "182", "261","263","125", "118", "4.06", "10")
airdata2 <- c("2014-03-24", "重度污染", "270", "188", "220","249","93",  "53",  "1.99", "144")
airdata3 <- c("2014-08-08", "轻度污染", "107", "178", "64", "103","45",  "8",   "1.08", "220")
airdata4 <- c("2014-08-22", "良",       "80",  "98",  "51", "80", "58",  "3",   "0.80", "93")

airdata <- rbind(airdata, airdata1, airdata2, airdata3, airdata4)

#Wind
winddata <- matrix(data = NA, nrow = 1, ncol = 4)
winddata <- data.frame(winddata)
colnames(winddata) <- c("date", "weather", "temperature","wind")

months <- sprintf("%02d", 1:12)
years <- 2014:2016
for(i in years) {
  for(j in months) {
    url_a="http://www.tianqihoubao.com/lishi/beijing/month/"
    url_b=".html"
    url_all=paste(url_a, i, j, url_b, sep = "")
    temp <- getURL(url_all, .encoding="GBK")
    temp2 <- iconv(temp,"gb2312","UTF-8") 
    doc <- htmlParse(temp2, asText=T, encoding="UTF-8")
    tables <-readHTMLTable(doc)
    winddatatemp <- as.data.frame(tables)
    colnames(winddatatemp) <- c("date", "weather", "temperature","wind")
    winddata <- rbind(winddata, winddatatemp)
  }
}

三、数据清洗与整理

#根据日期合并两组数据
airdata$date <- as.Date(airdata$date)
winddata$date <- as.Date(winddata$date, "%Y年%m月%d日")

mydata <- merge(airdata, winddata, by="date")
mydata <- na.omit(mydata)
mydata <- mydata[-which(duplicated(mydata$date, )),]

airdata$date <- as.Date(airdata$date)
mydata[,3:10] <- as.numeric(unlist(mydata[,3:10]))


#日期等数据整理
mydata$weekday <- weekdays(mydata$date, abbreviate = T)
mydata$weekday <- factor(mydata$weekday, levels = c("周一","周二","周三","周四","周五","周六","周日"))
mydata$month <- months(mydata$date, abbreviate = T)
mydata$month <- factor(x = mydata$month, levels = c("1月","2月","3月","4月","5月","6月","7月","8月","9月","10月","11月","12月"))
mydata$year <- format(mydata$date, "%Y")
mydata$airrank <- factor(x = mydata$airrank, levels =c("优","良","轻度污染","中度污染","重度污染","严重污染"))

#气温数据整理
mydata$temp.max <- substr(mydata$temperature, start = 1, stop=str_locate(mydata$temperature, "/")[,1]-1)
mydata$temp.min <- substr(mydata$temperature, start = str_locate(mydata$temperature, "/")[,1]+1, stop = nchar(mydata$temperature))
mydata$temp.max <- as.numeric(substr(mydata$temp.max, start = 1, stop = nchar(mydata$temp.max)-43))
mydata$temp.min <- as.numeric(substr(mydata$temp.min, start = 43, stop =nchar(mydata$temp.min)-1))
mydata$temp.avg <- (mydata$temp.max + mydata$temp.min) / 2 

#风速数据整理
mydata$wind.day <- substr(mydata$wind, start = 1, stop=str_locate(mydata$wind, "/")[,1]-1)
mydata$wind.night <- substr(mydata$wind, start =str_locate(mydata$wind, "/")[,1]+1, stop = nchar(mydata$wind))
mydata$wind.day[which(is.na(mydata$wind.day))] <- mydata$wind[which(is.na(mydata$wind.day))]
mydata$wind.night[which(is.na(mydata$wind.night))] <- mydata$wind[which(is.na(mydata$wind.night))]
mydata$wind.day.direction <- str_split_fixed(mydata$wind.day, pattern = " ", n = 2)[,1]
mydata$wind.night.direction <- str_split_fixed(mydata$wind.night, pattern = " ", n = 2)[,1]
mydata$wind.day.speed <- str_split_fixed(mydata$wind.day, pattern = " ", n = 2)[,2]
mydata$wind.day.speed <- str_split_fixed(mydata$wind.day.speed, pattern = "\r", n = 2)[,1]
mydata$wind.day.speed[which(mydata$wind.day.speed=="")] <- "≤3级"
mydata$wind.night.speed <- str_split_fixed(mydata$wind.night, pattern = " ", n = 2)[,2]
mydata$wind.night.speed <- str_split_fixed(mydata$wind.night.speed, pattern = "\r", n = 2)[,1]
mydata$wind.night.speed[which(mydata$wind.night.speed=="")] <- "≤3级"

四、绘图与分析

首先,我们根据所得到的AQI(空气质量指数)数据绘制了如下图所示的箱线图+小提琴图

ggplot(data = mydata, aes(x = year, y = aqi)) + 
  geom_violin(fill="lightblue") +
  geom_boxplot(fill="lightgreen",  width=.2) +
  labs(x="",y="", title="北京AQI分布2014-2016") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))

可以很明显地看到,2014-2016年北京的空气质量整体上有所改善,AQI四分位线、平均值线均向下移动,总体的分布也逐渐向下偏。

参与空气质量评价的主要污染物为细颗粒物、可吸入颗粒物、二氧化硫、二氧化氮、臭氧、一氧化碳等。其中PM2.5最受关注,秋冬时节PM2.5爆表的新闻常见于报端,根据同样的方法我们构造了PM2.5的趋势图,结果与AQI类似。

ggplot(data = mydata, aes(x = year, y = pm2.5)) + 
  geom_violin(fill="lightblue") +
  geom_boxplot(fill="lightgreen",  width=.2) +
  labs(x="",y="", title="北京PM2.5分布2014-2016") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))

那我们不禁要问了,近年来北京空气质量的改善,“风”——这一因素到底起了多大的作用呢?

首先我们考察不同风速情况下的PM2.5分布,看看大风是不是真的吹走了雾霾。显然,空气污染(PM2.5>100)主要分布在风速小于等于3级的气象条件下,风力大于3级的日子空气污染较少。但如果从各风速看,白天和夜间呈现不同的特征,白天风速越大,PM2.5均值水平越低;夜间风速越大,PM2.5均值水平变化不太大(5-6级、6-7级均值水平较高,可能与样本太少有关)。这其中的道理,不太清楚。

ggplot(data = mydata, aes(x = wind.day.speed, y = pm2.5)) +
  geom_boxplot(fill="cornflowerblue", col=1, na.rm = T) +
  geom_point(position = "jitter", alpha=0.3, col="blue", na.rm = T) +
  labs(x="",y="", title="白天风速与PM2.5") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))

ggplot(data = mydata, aes(x = wind.night.speed, y = pm2.5)) +
  geom_boxplot(fill="cornflowerblue", col=1, na.rm = T) +
  geom_point(position = "jitter", alpha=0.3, col="blue", na.rm = T) +
  labs(x="",y="", title="夜间风速与PM2.5") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))

进一步地我们对比下不同年份各空气质量等级天数与各风速天数。可以看到北京的空气质量整体向好,且趋势较为明显:空气质量为优的天数明显增加,轻度污染的天数有所增加;良、中度污染的天数大致不变;中度、严重污染的天数则有所减少。而反观214-2016年的整体风速情况,则无法看到的较为明显的趋势,相对来说,2014年的整体风速较小,而2015年和2016年则大致相当。

ggplot(data = mydata, aes(x = year, fill=airrank)) +
  geom_bar(col=1) +
  scale_fill_brewer(palette = "YlOrRd") +
  labs(x="",y="", title="空气质量2014-2016") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold")) 

ggplot(data = mydata, aes(x = year, fill=wind.day.speed)) +
  geom_bar(col=1, width = 0.4) +
  scale_fill_brewer() +
  labs(x="",y="", title="白天风速2014-2016") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold")) 

ggplot(data = mydata, aes(x = year, fill=wind.night.speed)) +
  geom_bar(col=1, width = 0.4) +
  scale_fill_brewer() +
  labs(x="",y="", title="夜间风速2014-2016") +
  theme(plot.title = element_text(hjust=0.5, size = 15, face = "bold"))

综合本文的分析来看,2014-2016年北京的空气质量确实是在逐渐改善,而且这其中“风”并没有扮演十分重要的角色,所以呢...

编辑于 2017-02-02

文章被以下专栏收录