Extracting Table from Webpage and Analyzing using R

Saurav Das

March 7, 2020

Extracting the coronavirus cases from website “https://www.worldometers.info/coronavirus/

loading the library

library(rvest)
## Loading required package: xml2
library(xml2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

extracting the data from webpage

webpage_url <- "https://www.worldometers.info/coronavirus/"
webpage <- xml2::read_html(webpage_url)
ExOffndrsRaw <- rvest::html_table(webpage)[[1]] %>% 
  tibble::as_tibble(.name_repair = "unique") # repair the repeated columns
CV <- ExOffndrsRaw %>% dplyr::glimpse(45)
## Observations: 106
## Variables: 9
## $ `Country,Other`    <chr> "China", "S. ...
## $ TotalCases         <chr> "80,703", "7,...
## $ NewCases           <chr> "+52", "+272"...
## $ TotalDeaths        <chr> "3,098", "50"...
## $ NewDeaths          <int> 28, 2, 49, NA...
## $ TotalRecovered     <chr> "57,333", "13...
## $ ActiveCases        <chr> "20,272", "7,...
## $ `Serious,Critical` <chr> "5,264", "36"...
## $ `Tot Cases/1M pop` <dbl> 56.1, 142.6, ...
str(CV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    106 obs. of  9 variables:
##  $ Country,Other   : chr  "China" "S. Korea" "Iran" "Italy" ...
##  $ TotalCases      : chr  "80,703" "7,313" "6,566" "5,883" ...
##  $ NewCases        : chr  "+52" "+272" "+743" "" ...
##  $ TotalDeaths     : chr  "3,098" "50" "194" "233" ...
##  $ NewDeaths       : int  28 2 49 NA NA NA NA 7 1 NA ...
##  $ TotalRecovered  : chr  "57,333" "130" "2,134" "589" ...
##  $ ActiveCases     : chr  "20,272" "7,133" "4,238" "5,061" ...
##  $ Serious,Critical: chr  "5,264" "36" "" "567" ...
##  $ Tot Cases/1M pop: num  56.1 142.6 78.2 97.3 12.2 ...
View(CV)

Plotting

loading ggplot2 for plotting

library(ggplot2)
CV %>% ggplot(aes(`Country,Other`, TotalCases)) + geom_bar(stat = "identity")

you can see x-asis is overcrowded and we can’t really figure out what is going on, so we need to modify the plot and if you pay close attention to y axis also the values seems little off as some of the values have “commas” in the data frame for Total Case, which need to be fixed before we use the value as continuous_y_axis.

so for that I used mutate function from dplyr package with gsub to remove the comma and created one extra column named as “Total_cases”

CV <- mutate(CV, Total_cases = as.numeric(gsub(",", "", gsub("\\,", ",,", CV$TotalCases))))
str(CV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    106 obs. of  10 variables:
##  $ Country,Other   : chr  "China" "S. Korea" "Iran" "Italy" ...
##  $ TotalCases      : chr  "80,703" "7,313" "6,566" "5,883" ...
##  $ NewCases        : chr  "+52" "+272" "+743" "" ...
##  $ TotalDeaths     : chr  "3,098" "50" "194" "233" ...
##  $ NewDeaths       : int  28 2 49 NA NA NA NA 7 1 NA ...
##  $ TotalRecovered  : chr  "57,333" "130" "2,134" "589" ...
##  $ ActiveCases     : chr  "20,272" "7,133" "4,238" "5,061" ...
##  $ Serious,Critical: chr  "5,264" "36" "" "567" ...
##  $ Tot Cases/1M pop: num  56.1 142.6 78.2 97.3 12.2 ...
##  $ Total_cases     : num  80703 7313 6566 5883 1018 ...
View(CV)

Modifying plot

here i modified the x axis for better readibility, I rotated the axis text at 45 angle.

CV %>% ggplot(aes(`Country,Other`, Total_cases)) + 
  geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

however if you still go through x-axis, its quite tough to read, so we can remove few of the countries which has reported only few cases like lets say 15. So, I will use filter function from dplyr to remove those.

CV %>% filter(Total_cases >= 15) %>% ggplot(aes(as.character(`Country,Other`), Total_cases)) + 
  geom_bar(stat = "identity") +theme(axis.text.x = element_text(angle = 45,hjust = 1))

lets scale the axis and transform it for better comparison. I scaled the y axis with log10, you can transform accordingly.

CV %>% filter(Total_cases > 15) %>% ggplot(aes(`Country,Other`, Total_cases)) + 
  geom_bar(stat = "identity") +theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(trans = "log10")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s