Insights from Data with BigQuery: Challenge Lab
I have recently completed google cloud “Insights from Data with BigQuery” quest and have earned a badge. The lab is for 9 credits and the time to complete the lab is 1 hour 30 minutes. Before starting the lab, I looked at all the scenarios and queries that needs to be executed and I was sure enough to complete it before time and I had looked at the big query covid 19 open dataset as part of other Qwiklabs.
However, while I was doing this challenge lab, and after query 1, when I started running queries 2, 3 and so on, I was getting everything as incorrect and I started wondering what was going wrong. I was sure my queries were correct and the initial queries were pretty straight forward. Not sure how qwiklabs is programmed to monitor the results but I realized it the validation is done based on specific format and also columns.
For example, I had used country_code and subregion1_code for most of my queries but looks like Qwiklabs is looking for country_name and subregion1_name:-)
For example, Query 5, which is to identify specific day, I executed below query
SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_code="IT" and cumulative_deceased>10000
ORDER BY date
LIMIT 1
However, Qwiklabs was giving error but successfully validated below query even though the output from previous query is same as this query.One more thing that I realized here was “order by”. Your query will not be validated if you do not add order by.
SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_name="Italy" and cumulative_deceased>10000
ORDER BY date
LIMIT 1
Hence, I thought of writing this post so that most of the people who are trying to take this challenge lab can get through the challenge for the first time without paying any extra credit.
I have tried my best to provide the queries here that I executed during the lab challenge as I did not take the screen shots or saved the queries while I was doing the challenge. However, I am sure it will help you in terms of what Qwiklabs is exactly looking for in the query.
Query 1: Total Confirmed Cases
SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where date='2020-04-15'
Query 2: Worst Affected Areas
SELECT count(*) as count_of_states
FROM (
SELECT sum(cumulative_deceased) as total,subregion1_name,date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date='2020-04-10' and country_name='United States of America'
AND subregion1_code is not null and new_deceased is not null
GROUP BY subregion1_name,date
HAVING total >100
)
The same query can also be written using WITH as below.
WITH deaths_by_states AS (
SELECT subregion1_name as state, sum(cumulative_deceased) AS death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_name='United States of America' and date='2020-04-10' and subregion1_code is not null and new_deceased is not null
GROUP BY subregion1_name
)SELECT count(*) as count_of_states
FROM deaths_by_states
WHERE death_count > 100
Query 3: Identifying Hotspots
SELECT subregion1_code as state,SUM(cumulative_confirmed) as total_confirmed_cases
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date='2020-04-10' and country_code='US'
GROUP BY state
HAVING total_confirmed_cases > 1000
ORDER BY state desc;
Query 4: Fatality Ratio
SELECT SUM(cumulative_confirmed) AS total_confirmed_cases, sum(cumulative_deceased) as total_deaths,
(sum(cumulative_deceased)/SUM(cumulative_confirmed)*100) as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date between '2020-04-01' and '2020-04-30' and country_code='IT'
GROUP BY country_code
Query 5: Identifying specific day
SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_name='Italy' and cumulative_deceased>10000
ORDER BY date
LIMIT 1
Query 6: Finding days with zero net new cases
WITH india_cases_by_date AS (
SELECT
date,
SUM(cumulative_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name="India"
AND date between '2020-02-21' and '2020-03-15'
GROUP BY
date
ORDER BY
date ASC
), india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
SELECT count(*)
FROM india_previous_day_comparison
WHERE net_new_cases=0
Query 7: Doubling rate
WITH US_cases_by_date AS (
SELECT
date,
SUM(new_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_code="US"
AND date between '2020-03-22' and '2020-04-22'
GROUP BY
date
ORDER BY
date ASC
)
, US_previous_day_comparison AS
(SELECT
date as DATE,
cases AS Confirmed_Cases_On_Day,
LAG(cases) OVER(ORDER BY date) AS Confirmed_Cases_Previous_Day,
cases - LAG(cases) OVER(ORDER BY date) AS Increase,
(cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase
FROM US_cases_by_date
)SELECT DATE,Confirmed_Cases_On_Day,Confirmed_Cases_Previous_Day,percentage_increase AS Percentage_Increase_In_Cases
FROM US_previous_day_comparison
WHERE percentage_increase > 10
Query 8: Recovery rate
WITH cases_by_country AS (
SELECT
country_name AS country,
SUM(cumulative_confirmed) AS cases,
SUM(cumulative_recovered) AS recovered_cases,
(sum(cumulative_recovered)/sum( cumulative_confirmed))*100 as recovery_rate
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
date='2020-05-10'
GROUP BY
country_name
)SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate
FROM
cases_by_country
WHERE
cases > 50000
ORDER BY recovery_rate DESC
LIMIT 10
Query 9: CDGR — Cumulative Daily Growth Rate
WITH
france_cases AS (
SELECT
date,
SUM(cumulative_confirmed) AS total_cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name="France"
AND date IN ('2020-01-24',
'2020-05-10')
GROUP BY
date
ORDER BY
date)
, summary as (
SELECT
total_cases AS first_day_cases,
LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
france_cases
LIMIT 1
)SELECT first_day_cases, last_day_cases, days_diff, pow((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary
Create a Datastudio report
SELECT
date, SUM(cumulative_confirmed) AS country_cases,
SUM(cumulative_deceased) AS country_deaths
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
date BETWEEN '2020-03-15'
AND '2020-04-30'
AND country_name='United States of America'
GROUP BY date
Once the above query is executed, execute below steps,
- Click on explore data and “Explore with data studio”
- Authorize Data Studio to access BigQuery.
- For the first time the report will fail to create. Click the blank report and accept the terms of service.
- Go back to the BigQuery page and click Explore with Data Studio again.
- In the new Data Studio report, select Add a chart > Time series Chart.
- Add
country_cases
andcountry_deaths
to the Metric field. - Click Save to commit the change.
Hope this helps:-)
Thank you all!
Regards
Radhika