-
Notifications
You must be signed in to change notification settings - Fork 11
finshed the lab #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -49,11 +49,12 @@ pop = read_csv(file.path('data', 'county_population.csv')) | |||||
|
|
||||||
| #' 1. *For the county-level cumulative data, are there any limitations or caveats that we should keep in mind for our analysis?* | ||||||
| #' | ||||||
| #' | ||||||
| #' yes, it was gathered by multiple independent journalists, and there were some discrepencies int he data gathering methods such as cases being changed or even disappearing from record (acknowledged in the docs) It s abig data set so I dont feel like that is dooming but it is something to keep in mind | ||||||
| #' | ||||||
|
|
||||||
| #' 2. *How about the Google Mobility data?* | ||||||
| #' | ||||||
| #' yes becaus that data was only for i phones | ||||||
| #' | ||||||
|
|
||||||
|
|
||||||
|
|
@@ -66,18 +67,18 @@ data("nytcovcounty") | |||||
|
|
||||||
| #' 1. *How current are these data?* | ||||||
| #' | ||||||
| #' | ||||||
| #' as current as 10.01 so les than a week old | ||||||
| #' | ||||||
|
|
||||||
| #' 2. *What is the `fips` variable?* | ||||||
| #' | ||||||
| #' | ||||||
| #' geographic coding | ||||||
| #' | ||||||
|
|
||||||
| #' 3. *`cases` and `deaths` are nominally 100% complete. Does this mean that we have case and death counts for every county over the entire time span of the data?* | ||||||
| #' | ||||||
| #' | ||||||
| #' | ||||||
| #' No this refers there not being empty cells or cimomplete rows in the data, not that there couldnt have been errors in data recording in the first place. | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
@@ -91,6 +92,10 @@ data("nytcovcounty") | |||||
| #' 1. *Write a pipe that starts with `nytcovcounty` as input, filters down to California and April 1-August 31, 2020, and assigns the result to a variable `filtered_df`. Hint: You can compare dates as though they were strings, e.g., `date <= '1980-05-17'` gives you dates on or before May 17, 1980.* | ||||||
| #' | ||||||
|
|
||||||
| filtered_df = nytcovcounty %>% | ||||||
| filter(state == "California", date >= '2020-04-01', date <= '2020-08-31') | ||||||
|
|
||||||
|
|
||||||
| #' 2. To go from daily changes to cumulative counts, we'll use the following function. | ||||||
|
|
||||||
| daily_diff = function(x, order_var) { | ||||||
|
|
@@ -101,25 +106,49 @@ daily_diff = function(x, order_var) { | |||||
| #' *Write a pipe that takes `filter_df` as input, groups the data by county and FIPS code, sorts the dataframe by date, and then converts the cumulative cases and death counts to daily changes using `daily_diff`. (`date` is the order variable.) Assign the result to `daily_df`. Hint: `mutate()` can replace the value of existing variables.* | ||||||
| #' | ||||||
|
|
||||||
| daily_df = filtered_df %>% | ||||||
| group_by(county, fips) %>% | ||||||
| mutate(cases = daily_diff(cases, date), deaths = daily_diff(deaths, date)) %>% | ||||||
| ungroup() | ||||||
|
|
||||||
| #when I was viewing this before ungroup the rows were still not grouped by those variables. Am i using group by wrong or misunderstanding how it works? | ||||||
|
|
||||||
| #' 3. *Finally we need to calculate rates per 1 million residents. Write a pipe that takes `daily_diff` as input, joins it with the `pop` dataframe using appropriate variables, removes any rows with missing FIPS codes, and constructs the variables `cases_per_pop` and `deaths_per_pop`. When constructing these variables, multiply by `per_pop` to get rates per 1 million residents. Assign the result to `covid_df`, since this contains the Covid data for our analysis.* | ||||||
| #' | ||||||
|
|
||||||
| covid_df = daily_df %>% | ||||||
| left_join(pop, by = c('state', 'county', 'fips')) %>% | ||||||
| filter(!is.na(fips)) %>% | ||||||
| mutate(cases_per_pop = cases/population * per_pop, deaths_per_pop = deaths/population * per_pop) | ||||||
|
|
||||||
| # there are many na values fro cases and deaths in this. Did I lose some or did I misunderstand the nominally complete part before? | ||||||
|
|
||||||
|
|
||||||
| #' # Problem 4 # | ||||||
| #' 1. *To explore these time-series data visually, we'll want to use line plots of cases or deaths over time. The line group needs the `group` aesthetic to determine which values should be treated as part of a single line. Uncomment and fill in the blanks to plot cases per 1,000,000 residents over time for each county.* | ||||||
| #' | ||||||
|
|
||||||
| # ggplot(covid_df, aes(---, ---, group = ---)) + | ||||||
| # geom_line() | ||||||
| ggplot(covid_df, aes(date, cases_per_pop, group = county)) + | ||||||
| geom_line() + | ||||||
|
|
||||||
| #' 2. *Because there are so many counties, the lines are heavily overplotted. Modify your code from the last problem to facet by county. Try both `scales = 'fixed'` and `scales = 'free_y'`.* | ||||||
| #' | ||||||
|
|
||||||
| ggplot(covid_df, aes(date, cases_per_pop, group = county)) + | ||||||
| geom_line() + | ||||||
| facet_wrap(~county, scales = 'free_y') | ||||||
|
|
||||||
| ggplot(covid_df, aes(date, cases_per_pop, group = county)) + | ||||||
| geom_line() + | ||||||
| facet_wrap(~county, scales = 'fixed') | ||||||
|
|
||||||
| #' 3. *The plot indicates that, on a few days, some counties gain or lose thousands of cases per million residents. What's up this that? Hints: `plotly::ggplotly()` to create an interactive version of the most recent plot. Use `filter()` to narrow down the data to particular counties during short periods of time, and `View()` to peruse the data after filtering.* | ||||||
| #' | ||||||
| #' | ||||||
| #' | ||||||
|
|
||||||
|
|
||||||
| #I dont know how to do this | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A plotly approach: Generate the static plots as above. Then call A filter approach: Look at the static plot to find a good threshold for "large" swings. Say, Next, filter down to those counties, then check things like the population. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
| #' 4. *We can pass data through a pipe before calling `ggplot()`. Let's focus on 4 counties of interest, two rural and two urban: Butte, Merced, Sacramento, and Santa Clara. Uncomment and fill in the blanks:* | ||||||
|
|
||||||
|
|
@@ -133,9 +162,16 @@ focal_counties = c('Butte', 'Merced', 'Sacramento', 'Santa Clara') | |||||
| #' Note that we need to use plus `+` to connect ggplot layers, not the pipe `%>%`. You can get weird errors if you accidentally use the wrong one. I do this all the time. | ||||||
| #' | ||||||
|
|
||||||
| covid_df %>% | ||||||
| filter(county %in% focal_counties) %>% | ||||||
| ggplot(aes(date, cases_per_pop, group = county)) + | ||||||
| geom_line() + | ||||||
| facet_wrap(~county, scales = 'fixed') | ||||||
|
|
||||||
|
|
||||||
| #' 5. *The common narrative of Covid-19 in California runs something like this: "Despite being one of the locations where Covid-19 was detected early on, California mostly avoided the large outbreak that hit New York in the spring. About a month after stay-at-home rules were relaxed in late May, cases began to increase in June, leading to a large outbreak across the state that peaked in July. This outbreak has largely faded by September." Based on your (brief) visual EDA of the data, does this narrative seem accurate? * | ||||||
| #' | ||||||
| #' | ||||||
| #' No, merced peaked in late august and everyopne elses seems to not have peaked yet but is maybe still growing | ||||||
| #' | ||||||
|
|
||||||
| #' *(Just an aside. Most presentations of Covid-19 data use 7-day rolling averages. Either they don't show raw counts at all, or they emphasize the rolling averages rather than the raw counts. In the `plots` folder, `chronicle.png` shows an example from the _San Francisco Chronicle_. Because this lab is already super long and complicated, I decided to skip the rolling averages. Two common packages for calculating rolling averages are (`zoo`)[https://cran.r-project.org/web/packages/zoo/index.html] and (`slider`)[https://cran.r-project.org/web/packages/slider/].)* | ||||||
|
|
@@ -153,30 +189,53 @@ mob_df = read_csv(file.path('data', 'mobility.csv')) | |||||
|
|
||||||
| #' 1. *How many distinct values of `type` are there? What does this variable indicate?* | ||||||
| #' | ||||||
| #' 6 type of location | ||||||
| #' | ||||||
| #' | ||||||
| mob_df %>% | ||||||
| pull(type) %>% | ||||||
| unique() | ||||||
|
|
||||||
| #' 2. *How about `pct_diff`?* | ||||||
| #' | ||||||
| #' | ||||||
| #' | ||||||
| #' it looks like percent change from normal. I dont know where it says this in the docs. I didnt see it. | ||||||
|
|
||||||
| mob_df %>% | ||||||
| pull(pct_diff) %>% | ||||||
| unique() | ||||||
|
|
||||||
| #' 3. *During our time period of interest, does `mob_df` contain mobility data for every county in California? If some counties are missing data, which ones are they? Hints: There are 58 counties in California. Try counting and then filtering to identify counties with outlying row counts.* | ||||||
| #' | ||||||
| #' | ||||||
| #' | ||||||
| #' DTVAR = count(mob_df, county) | ||||||
| #' trinity has 666. very ominous lol | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
| #' 4. *In the `plots` folder, take a look at `mobility.png`. Recreate this plot. (Use whatever theme and colors that you like. To create a horizontal line: `geom_hline(yintercept = 0, alpha = .5)`. You don't need to save to disk.)* | ||||||
| #' | ||||||
| #' | ||||||
| #' i dont know why this is making dumb lines but im alomst out of time :/ | ||||||
|
|
||||||
| locs_we_want = c('parks', 'residential', 'retail') | ||||||
|
|
||||||
| mob_df %>% | ||||||
| filter(county %in% focal_counties) %>% | ||||||
| filter(type %in% locs_we_want) %>% | ||||||
| ggplot(aes(date, pct_diff, group = county)) + | ||||||
| geom_line(aes(color = type)) + | ||||||
| facet_wrap(~county, scales = 'fixed') | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
| #' 5. *Again, the standard narrative of Covid-19 in California says that people were staying home in the spring, then going out more in May-June as stay-at-home orders were lifted. Does this data support that narrative?* | ||||||
| #' | ||||||
| #' | ||||||
| #' | ||||||
| #' yeah kinda, parks definintely went up and residential seems to have gone down? | ||||||
|
|
||||||
| #' 6. *What other potentially interesting patterns do you see in these mobility data?* | ||||||
| #' | ||||||
| #' | ||||||
| #' people still arent going shopping in santa clara, people really missed being outside | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
@@ -188,24 +247,46 @@ mob_df = read_csv(file.path('data', 'mobility.csv')) | |||||
|
|
||||||
| #' 1. *This is just one way we could get at the relationship between stay-at-home in June and the peak of the outbreak in July. What are some other approaches we might take?* | ||||||
| #' | ||||||
| #' | ||||||
| #' i would think we should be more granular. like looking at weekly averages instead of monthly. we could also look rate of change rather than total and see 2 week lag correlates well. | ||||||
| #' | ||||||
|
|
||||||
| #' 2. *Construct a dataframe `parks_june` that reports the mean level of "parks" mobility for each county in June 2020. Call the variable `parks`. (Just so the automatic checks know where to look.) Note that the Google data has a lot of gaps for the `parks` type. Use the `na.rm = TRUE` argument in `mean()` to handle missing values, and then filter out missing values. The final dataframe should have three columns: county name, FIPS code, and `parks`. And it should have one row for each county in the mobility data for which we have an estimate for "parks".* | ||||||
| #' | ||||||
|
|
||||||
| parks_june = mob_df %>% | ||||||
| filter(type=='parks', | ||||||
| date>='2020-06-01', | ||||||
| date<='2020-06-30') %>% | ||||||
| group_by(county, fips) %>% | ||||||
| summarize(parks=mean(pct_diff, rm.na =TRUE)) %>% | ||||||
|
||||||
| summarize(parks=mean(pct_diff, rm.na =TRUE)) %>% | |
| summarize(parks=mean(pct_diff, na.rm =TRUE)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double check the number of days in July
| date<='2020-07-30') %>% | |
| date<='2020-07-31') %>% |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hanging pipe here is also causing the script to fail when it's run.
| ungroup() %>% | |
| ungroup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this hanging
+, R tries to combine this ggplot object with the thing next (the ggplot object on line 137). It can't do this so the lab script fails before running the tests. This is whytestthat::test_dir('tests')is failing.