Share this:

R Exercise: Working with Data from JSON API and Performing EDA

Import any data from JSON API

This was prepared on Google Colab. Please refer to this link for running R in Google Colab.

We are using data from https://data.askbhunte.com/api/v1/covid.

We are going to use below packages.

  • jsonlite : It implements a bidirectional mapping between JSON data and the most important R data types.
  • RCurl : Which help to provides the necessary tools for accessing URIs, data and services via HTTP.
install.packages('RCurl')
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘bitops’
library(jsonlite)
library(RCurl)
url <- getURL("https://data.askbhunte.com/api/v1/covid",
.opts=list(followlocation=TRUE, ssl.verifyhost=FALSE, ssl.verifypeer=FALSE))

Now, save the imported data as "covidtbl" data frame in R

covidtbl <- fromJSON(txt=url, flatten=TRUE)
#covidtbl 

Check whether the saved "covidtbl" data passes the three conditions of the "tidy" data or not! If not, make it a tidy data with explanations of rows, columns and cells in the data

The three conditions of tidy data are,

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell.

Apply these condition to our covidtbl data set satisfy all the conditions mention above so, without a doubt our dataset is tidy data.

covidtbl

Check if there are duplicate cases in the data using "id" variable, remove duplicate cases, if found, using R base functions: duplicated or unique

Function duplicated() ensure that wether there is present of duplicated value or not. If duplicate value is found it return TRUE other wise FALSE. So, in our care there no present of duplicated values.

similarly function unique() is used to erase the duplicated value or to find the unique value in dataframe.

duplicated(covidtbl)
unique(covidtbl)

Clean the "gender" variable and show the number and percentage of males and females in the data

We must take care of following points while cleaning our data,

  • Free of duplicate rows/values
  • Error-free (e.g. free of misspellings)
  • Relevant (e.g. free of special characters)
  • The appropriate data type for analysis
  • Free of outliers (or only contain outliers have been identified/understood), and
  • Follows a “tidy data” structure

In case of our data there are some missing value as well as starting letter of some data are in small letter or in capital letter. To slove this problem we install the package

stringer : It is used to convert the first letter of every word of a string to Uppercase and the rest of the letters are converted to lower case.

install.packages("stringr")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Following code loads the stringr library and remove the missing values present in our gender column.

df<- covidtbl[complete.cases(covidtbl$gender),]

In following case, first load the string library and then change all the gender data in upper case and by using str_to_title() function change all letter except first to lower case.

library(stringr)
df <-toupper(df$gender)
df <- str_to_title(df)

Now, it is time to calculate total number of male and female from gender column.

df1<- table(df)
df1
df
Female   Male 
 21403  56355 

To find the percentage of male and female we can do,

prop.table(df1)
df
   Female      Male 
0.2752514 0.7247486 

Clean the "age" variable and show the summary statistics of the age variable

In our age column there are so many missing value. First we should remove the missing values.


df_age <- covidtbl[complete.cases(covidtbl$age),]
age <- df_age$age

age
  1. 28
  2. 34
  3. 26
  4. 29
  5. 20
  6. 72
  7. 60
  8. 24
  9. 58
  10. 52
  11. 41
  12. 41
  13. 34
  14. 41
  15. 28
  16. 28
  17. 28
  18. 22
  19. 25
  20. 40
  21. 33
  22. 18
  23. 65
  24. 32
  25. 20
  26. 37
  27. 55
  28. 36
  29. 55
  30. 19
  31. 65
  32. 21
  33. 34
  34. 65
  35. 81
  36. 19
  37. 27
  38. 32
  39. 57
  40. 26
  41. 44
  42. 35
  43. 9
  44. 32
  45. 34
  46. 50
  47. 25
  48. 62
  49. 21
  50. 34
  51. 40
  52. 55
  53. 60
  54. 30
  55. 55
  56. 29
  57. 65
  58. 80
  59. 17
  60. 60
  61. 26
  62. 27
  63. 58
  64. 60
  65. 52
  66. 30
  67. 65
  68. 37
  69. 4
  70. 35
  71. 11
  72. 40
  73. 22
  74. 24
  75. 65
  76. 32
  77. 18
  78. 18
  79. 38
  80. 45
  81. 63
  82. 30
  83. 25
  84. 27
  85. 28
  86. 28
  87. 45
  88. 32
  89. 60
  90. 16
  91. 61
  92. 22
  93. 24
  94. 25
  95. 28
  96. 28
  97. 32
  98. 59
  99. 55
  100. 61
  101. 22
  102. 28
  103. 32
  104. 32
  105. 4
  106. 10
  107. 28
  108. 20
  109. 35
  110. 22
  111. 40
  112. 18
  113. 36
  114. 32
  115. 27
  116. 25
  117. 25
  118. 22
  119. 49
  120. 36
  121. 29
  122. 20
  123. 28
  124. 23
  125. 30
  126. 36
  127. 75
  128. 36
  129. 9
  130. 23
  131. 47
  132. 10
  133. 28
  134. 29
  135. 65
  136. 17
  137. 18
  138. 18
  139. 30
  140. 36
  141. 37
  142. 26
  143. 25
  144. 25
  145. 32
  146. 25
  147. 48
  148. 39
  149. 48
  150. 28
  151. 30
  152. 33
  153. 21
  154. 25
  155. 25
  156. 29
  157. 42
  158. 21
  159. 30
  160. 22
  161. 18
  162. 19
  163. 19
  164. 22
  165. 23
  166. 20
  167. 28
  168. 30
  169. 40
  170. 27
  171. 42
  172. 41
  173. 51
  174. 27
  175. 37
  176. 26
  177. 26
  178. 28
  179. 30
  180. 46
  181. 17
  182. 18
  183. 53
  184. 55
  185. 22
  186. 22
  187. 45
  188. 74
  189. 19
  190. 35
  191. 34
  192. 35
  193. 24
  194. 30
  195. 27
  196. 34
  197. 19
  198. 38
  199. 45
  200. 27
  201. 62
  202. 73
  203. 53
  204. 67
  205. 54
  206. 56
  207. 33
  208. 58
  209. 62
  210. 74
  211. 47
  212. 30
  213. 67
  214. 29
  215. 65
  216. 51
  217. 30
  218. 64
  219. 66
  220. 66
  221. 76
  222. 74
  223. 22
  224. 88
  225. 48
  226. 66
  227. 41
  228. 82
  229. 31
  230. 75
  231. 77
  232. 65
  233. 70
  234. 70
  235. 62
  236. 50
  237. 32
  238. 23
  239. 37
  240. 90
  241. 83
  242. 76
  243. 69
  244. 39
  245. 63
  246. 59
  247. 87
  248. 62
  249. 35
  250. 34
  251. 86
  252. 54
  253. 82
  254. 76
  255. 67
  256. 39
  257. 56
  258. 65
  259. 83
  260. 52
  261. 55
  262. 45
  263. 51
  264. 52
  265. 71
  266. 72
  267. 72
  268. 92
  269. 77
  270. 40
  271. 77
  272. 65
  273. 40
  274. 70
  275. 50
  276. 20
  277. 74
  278. 88
  279. 65
  280. 72
  281. 86
  282. 65
  283. 74
  284. 55
  285. 89
  286. 70
  287. 75
  288. 55
  289. 18
  290. 65
  291. 20
  292. 70
  293. 47
  294. 40
  295. 97
  296. 60
  297. 60
  298. 72
  299. 80
  300. 67
  301. 40
  302. 75
  303. 35
  304. 70
  305. 64
  306. 45
  307. 77
  308. 73
  309. 80
  310. 67
  311. 83
  312. 38
  313. 61
  314. 76
  315. 43
  316. 78
  317. 85
  318. 51
  319. 45
  320. 70
  321. 59
  322. 83
  323. 55
  324. 62
  325. 85
  326. 70
  327. 61
  328. 73
  329. 47
  330. 40
  331. 73
  332. 55
  333. 63
  334. 72
  335. 60
  336. 55
  337. 36
  338. 61
  339. 55
  340. 55
  341. 58
  342. 35
  343. 64
  344. 86
  345. 34
  346. 96
  347. 66
  348. 69
  349. 20
  350. 61
  351. 62
  352. 50
  353. 51
  354. 50
  355. 34
  356. 38
  357. 47
  358. 86
  359. 60
  360. 66
  361. 85
  362. 50
  363. 59
  364. 72
  365. 49
  366. 58
  367. 61
  368. 17
  369. 68
  370. 88
  371. 58
  372. 73
  373. 84
  374. 62
  375. 55
  376. 52
  377. 63
  378. 50
  379. 34
  380. 75
  381. 78
  382. 55
  383. 50
  384. 39
  385. 68
  386. 70
  387. 32
  388. 84
  389. 39
  390. 73
  391. 75
  392. 51
  393. 58
  394. 72
  395. 60
  396. 70
  397. 57
  398. 53
  399. 39
  400. 76

Now, we removed the NA values. To show summary statistics of the age,

summary(age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   23.00   35.00   38.93   53.00  523.00 

Looking at summary statistics of age minimum age of people in covidbl table is 1,

  • first quartile is 23
  • median age of group is 35
  • average age of people is 38.93
  • third quartile age group is 53
  • maximum age group of people in this table is 523.

Transform cleaned age variable into broad age groups i.e. <15, 15-59, 60+ years, define it as factor variable and get number and percentage of this variable

less_then_15 <- df_age$age < 15
below_15<- df_age[less_then_15,]$age
below_15<- factor(below_15)
table(below_15)
below_15
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 
 4  8  2 12  5  5  4  7  6  5  5  5  3  6 
prop.table(table(below_15))*100
below_15
        1         2         3         4         5         6         7         8 
 5.194805 10.389610  2.597403 15.584416  6.493506  6.493506  5.194805  9.090909 
        9        10        11        12        13        14 
 7.792208  6.493506  6.493506  6.493506  3.896104  7.792208 
between_15_59 <- (df_age$age <= 59)
between_15_59<- df_age[between_15_59,]$age
between_15_59 <- between_15_59[between_15_59>=15]
between_15_59<- factor(between_15_59)
table(between_15_59)
between_15_59
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
11 19 27 50 35 47 24 45 34 24 47 26 36 32 21 43 16 36 14 31 41 34 17 24 16 48 
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 
18 14  6  3 34 11  8 14  9 28 13 14  9 10 27  8  7 16 11 
prop.table(table(between_15_59))*100
between_15_59
       15        16        17        18        19        20        21        22 
1.0396975 1.7958412 2.5519849 4.7258979 3.3081285 4.4423440 2.2684310 4.2533081 
       23        24        25        26        27        28        29        30 
3.2136106 2.2684310 4.4423440 2.4574669 3.4026465 3.0245747 1.9848771 4.0642722 
       31        32        33        34        35        36        37        38 
1.5122873 3.4026465 1.3232514 2.9300567 3.8752363 3.2136106 1.6068053 2.2684310 
       39        40        41        42        43        44        45        46 
1.5122873 4.5368620 1.7013233 1.3232514 0.5671078 0.2835539 3.2136106 1.0396975 
       47        48        49        50        51        52        53        54 
0.7561437 1.3232514 0.8506616 2.6465028 1.2287335 1.3232514 0.8506616 0.9451796 
       55        56        57        58        59 
2.5519849 0.7561437 0.6616257 1.5122873 1.0396975 
greater_than_60 <- df_age$age >= 60
above_60 <- df_age[greater_than_60,]$age
above_60 <- factor(above_60)
print(above_60)
  [1] 72  60  65  65  65  81  62  60  65  80  60  60  65  65  63  60  61  61 
 [19] 75  65  74  69  70  62  60  60  60  76  76  60  70  60  74  85  65  85 
 [37] 83  72  61  62  82  77  61  68  74  68  85  69  75  62  60  64  60  72 
 [55] 75  73  67  78  65  64  62  60  84  62  70  72  78  68  523 60  63  70 
 [73] 77  80  84  75  60  70  60  60  60  70  60  60  68  65  70  65  65  82 
 [91] 75  63  78  70  62  70  76  64  72  65  68  65  68  60  70  64  80  85 
[109] 71  75  78  67  72  61  60  62  60  63  82  79  80  82  85  73  68  76 
[127] 71  74  87  60  76  60  81  77  65  65  70  64  69  65  76  68  60  70 
[145] 70  71  74  65  70  68  77  62  73  67  62  74  67  65  64  66  66  76 
[163] 74  88  66  82  75  77  65  70  70  62  90  83  76  69  63  87  62  86 
[181] 82  76  67  65  83  71  72  72  92  77  77  65  70  74  88  65  72  86 
[199] 65  74  89  70  75  65  70  97  60  60  72  80  67  75  70  64  77  73 
[217] 80  67  83  61  76  78  85  70  83  62  85  70  61  73  73  63  72  60 
[235] 61  64  86  96  66  69  61  62  86  60  66  85  72  61  68  88  73  84 
[253] 62  63  75  78  68  70  84  73  75  72  60  70  76 
35 Levels: 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 ... 523

Number of days between recovered and reported dates, clean it if required, and get the summary statistics of this variable

In column reportedOn and reportedOn we can see many values are missing so our first attempt is removing these values and then our columns are in string format another task to do is change them in date formate finally difference is calculate.

colnames(covidtbl)
  1. 'id'
  2. 'province'
  3. 'district'
  4. 'municipality'
  5. 'createdOn'
  6. 'modifiedOn'
  7. 'label'
  8. 'gender'
  9. 'age'
  10. 'occupation'
  11. 'reportedOn'
  12. 'recoveredOn'
  13. 'deathOn'
  14. 'currentState'
  15. 'isReinfected'
  16. 'source'
  17. 'comment'
  18. 'type'
  19. 'nationality'
  20. 'ward'
  21. 'relatedTo'
  22. 'point.type'
  23. 'point.coordinates'
covidtbl_2 <-covidtbl[complete.cases(covidtbl$recoveredOn),]
date_of_recoveredOn <-as.Date(covidtbl_2$recoveredOn)
covidtbl_1 <-covidtbl[complete.cases(covidtbl$reportedOn),]
date_of_reportedOn<- as.Date(covidtbl_2$reportedOn)

diff1<- date_of_recoveredOn - date_of_reportedOn
#print(diff1)
fivenum(diff1)
Time differences in days
[1]   0   1  15  27 179

From the above fivenum summary average days different between recovered and reported dates is 15 and median days difference is 27 similarly maximum days difference is 179. Data are highly skewed.

Number of days between deaths and reported dates, and summary statistics of this variable

covidtbl_3 <-covidtbl[complete.cases(covidtbl$deathOn),]
death_date <- covidtbl_3$deathOn

death_date<- as.Date(death_date)
diff2 <- death_date - date_of_reportedOn
#list(diff2)
Warning message in unclass(time1) - unclass(time2):
“longer object length is not a multiple of shorter object length”
fivenum(diff2) 
Time differences in days
[1] -135   -8   16   53  156

From above data we can see that average days difference beween deaths and reported dates is 16 and median days different beween corresponding variable is 53 maximum days difference is 156.

Which measures of central tendency and dispersion is most appropriate for the age, diff1 and diff2 variables?

In case of age median is appropriate measure of central tendency. Because of following reasons.

  • There are a few extreme scores in the distribution of the data. (NOTE: R
  • single outlier can have a great effect on the mean.
  • There are some missing or undetermined values in your data. c.
  • There is an open ended distribution
    Corresponding measure of dispersion for age is IQR.

In case of diff1 median is approprate measure of central tendency. Because while watching data there are negative value, extermely differnt values. And corresponding measure of disperson is IQR.

Similarly, for diff2 data are skewed more so appropriate measure of central tendency is median and corresponding measure of dispersion is IRQ.

hist(df_age$age, xlim=c(1,100), ylim=c(0,300), breaks = 100)

png

From histogram above we can notice that age groups with heigest frequency belong to the interval 20 - 40.

summary(age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   23.00   35.00   38.93   53.00  523.00 
boxplot(age)

png

From boxt plot we can see minimum value of age are 1 average value is appromately 39. We can see that there is presence of outliers.

hist(as.numeric(diff1))

png

colnames(covidtbl)
  1. 'id'
  2. 'province'
  3. 'district'
  4. 'municipality'
  5. 'createdOn'
  6. 'modifiedOn'
  7. 'label'
  8. 'gender'
  9. 'age'
  10. 'occupation'
  11. 'reportedOn'
  12. 'recoveredOn'
  13. 'deathOn'
  14. 'currentState'
  15. 'isReinfected'
  16. 'source'
  17. 'comment'
  18. 'type'
  19. 'nationality'
  20. 'ward'
  21. 'relatedTo'
  22. 'point.type'
  23. 'point.coordinates'

Number and percentage of the "current state" variable

current_state <- covidtbl$currentState
num_current_sate<-table(current_state)
prop.table(num_current_sate)*100
current_state
   active     death recovered 
26.846319  0.639963 72.513718 

Current state variable have three types of values active, death, recovered and about 72% people are recovered similarly approximately 27% people have covid posite and as compared to recoved ratio only 0.63% were deaths. From the data above maximum people can recovered from covid.

Number and percentage of the "isReinfected" variable, what percentage of cases were re-infected in Nepal at the given time period in the database? Was it realistic?

isreinfected <- covidtbl$isReinfected
num_isreinfected <- table(isreinfected)
prop.table(num_isreinfected)*100
isreinfected
       FALSE         TRUE 
99.996144801  0.003855199 

About 0.0038% people were reinfected from covid In Nepal. That ratio is too small few of the recovered people might be reinfected.

Number and percentage of "type" variable

type <- covidtbl$type
type_count <- table(type)
prop.table(type_count)*100
type
          imported local_transmission 
          57.89474           42.10526 

Number and percentage of "nationality" variable

nationality <- covidtbl$nationality
covidtbl_4<- covidtbl[complete.cases(covidtbl$nationality),]
currted_nationality <- covidtbl_4$nationality
nationality_count <- table(currted_nationality)
print(nationality_count)
prop.table(nationality_count)*100
currted_nationality
 2  3  4 
42 14  1 

currted_nationality
        2         3         4 
73.684211 24.561404  1.754386 

Nationality column is also suffered from missing values. Due to the missing values our analysis might not be accurate.

Cross-tabulation of province (row variable) and current status (column variable) with row percentage

cross_tabul<- table(covidtbl$province, covidtbl$currentState)
cross_tabul
    active death recovered
  1    643    56      6276
  2   1825   133     13061
  3  13086   187     16462
  4   1108    28      3293
  5   2496    76      8563
  6    593     5      2950
  7   1140    13      5823
prop.table(cross_tabul)*100
          active        death    recovered
  1  0.826297596  0.071963710  8.065075755
  2  2.345245897  0.170913811 16.784250228
  3 16.816376884  0.240307388 21.154760528
  4  1.423853400  0.035981855  4.231723145
  5  3.207525348  0.097665035 11.004022257
  6  0.762044283  0.006425331  3.790945423
  7  1.464975519  0.016705861  7.482940746

Maximum corona active were in provience 3, minimum corona active were in provience 6 similarly maximum people were died from corona in provience 3 also maximum people recovered from coron in same provience.

Cross-tabulation of sex (row variable) and current status (column variable) with row percentage

gender<- covidtbl$gender
gender <- str_to_title(gender)
cross_tabul<- table(gender, covidtbl$currentState)
cross_tabul
gender   active death recovered
  Female   6960   149     14294
  Male    13931   348     42076
prop.table(cross_tabul)*100
gender       active      death  recovered
  Female  8.9508475  0.1916202 18.3826745
  Male   17.9158415  0.4475424 54.1114741

Male were affected from corona more than female similarly death, recovered ratio of male is maximum than female.

Cross-tabulation of broad age groups (row variable) and current status (column variable) with row percentage

combined <- unlist(list(below_15,above_60, between_15_59))
borad_age <- as.numeric(combined)
cross_tabul<- table(borad_age, df_age$currentState)
cross_tabul

Above data show that death rate of broad age above 60 is maximum compared to the others. Recovered rate of broad age group between 15 to 59 is maximum compared others.

Scatterplot of province (x-axis) and cleaned age (y-axis) and appropriate correlation coefficient for this bi-variate data

plot(covidtbl$province,covidtbl$age)

png

Scatter plot does not show any specific pattern, it is not linear so in above case spearman rank correlation is appropriate.

cor.test(covidtbl$province,covidtbl$age, method = "spearman")
Warning message in cor.test.default(covidtbl$province, covidtbl$age, method = "spearman"):
“Cannot compute exact p-value with ties”

    Spearman's rank correlation rho

data:  covidtbl$province and covidtbl$age
S = 509628932, p-value = 1.796e-05
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.1143495 

There is low degree of negative correlation

Scatterplot of age (x-axis) and diff1 (y-axis) and appropriate correlation coefficient

plot(covidtbl$age,diff1)

png

Above scatterplot do not show any specific pattern so spearman rank correlation coefficent is appropriate.

cor.test(covidtbl$age,as.numeric(diff1), method = "spearman")
Warning message in cor.test.default(covidtbl$age, as.numeric(diff1), method = "spearman"):
“Cannot compute exact p-value with ties”

    Spearman's rank correlation rho

data:  covidtbl$age and as.numeric(diff1)
S = 592527576, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.2956149 

Between age and difference there is low degree of negative correlation.

Leave a Reply

Share this:

Subscribe to our Newsletter

Hello surfer, thank you for being here. We are as excited as you are to share what we know about data. Please subscribe to our newsletter for weekly data blogs and many more. If you’ve already done it, please close this popup.



No, thank you. I do not want.
100% secure.
Scroll to Top