(practice with the babynames data)
This package exists in order to give us the data table babynames.
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
How popular has the name “Mary” been over the years?
How popular has your name been, over the years? (Work with percentages, not absolute number of births.)
Let’s investigate the popularity of the name “Prince” as a name for boys, since the year 1970.
We will make special note of 1978, the year that Prince released his classic album Purple Rain.
babynames |>
filter(name == "Prince" & year >= 1970 & sex == "M") |>
mutate(perc = prop * 100) |>
ggplot(aes(x = year, y = perc)) +
geom_line() +
geom_vline(aes(xintercept = 1978), color = "purple") +
labs(
x = NULL,
y = 'Percentage of males named "Prince"',
title = "There are more Princes, Now!",
subtitle = "(after Purple Rain was released in 1978)"
)Which name for males has been more popular over the years: “Homer”, or {insert unusual name of your choice}?
What are the top 5 most popular female names for each decade from the 1950s through the 2000-oughts?
To do this, we would need to find the decade in which each year occurs.
Try this, for various years:
tops <-
babynames |>
# just 1950 through 2019, females:
filter(year >= 1950 & year <= 2019 & sex == "F") |>
# make the decade column:
mutate(
decade = paste(
floor(year / 10),
"0s", sep = ""
)
) |>
# get total babies for each name, in each decade:
summarize(total = sum(n), .by = c(decade, name)) |>
# get top 5 names in each decade:
slice_max(n = 5, order_by = total, by = decade) |>
arrange(decade, desc(total))… you had reversed the order of grouping?
bad_tops <-
babynames |>
filter(year >= 1950 & year <= 2019 & sex == "F") |>
mutate(decade = paste(
floor(year / 10),
"0s", sep = ""
)
) |>
# group by decades within each name,
# rather than names within each decade:
summarize(total = sum(n), .by = c(name, decade)) |>
top_n(5, wt = total, by = decade) |>
arrange(decade, desc(total))