-
Notifications
You must be signed in to change notification settings - Fork 3
Description
There appears to be a problem with group_by() in some instances. If we look at a non-authoritative species list (such as this one), we can use the 'View occurrence records' button to go to biocache. The resulting URL then contains a query ID that we can use in galah:
galah_call() |>
filter(qid == "1744761573646") |>
count() |>
collect()
# A tibble: 1 × 1
count
<int>
1 510928
Something that we might be interested in is adding a group_by() statement to investigate which taxonConceptIDs are associated with each species in the list:
result <- galah_call() |>
filter(qid == "1744761573646") |>
group_by(species, taxonConceptID) |>
count() |>
collect()
This has two problems, first that the second column name is parsed incorrectly:
> colnames(result)
[1] "species" "taxonConceptID.https://biodiversity.org"
[3] "count"
And second (more importantly) that taxonConceptIDs are repeated across species, which is not only incorrect but should be actively impossible:
> result |>
+ group_by(`taxonConceptID.https://biodiversity.org`) |>
+ summarize(count = n())
# A tibble: 15 × 2
`taxonConceptID.https://biodiversity.org` count
<chr> <int>
1 https://biodiversity.org.au/afd/taxa/0480b9ae-ba82-46a1-902e-fdcf4bd8e7c7 8
2 https://biodiversity.org.au/afd/taxa/23a8017a-3a2b-4a52-8ca6-d168bf52659c 8
3 https://biodiversity.org.au/afd/taxa/428ea60d-7f8b-401e-b63b-83910f9ef8b8 8
4 https://biodiversity.org.au/afd/taxa/617e069f-eb5c-40ec-a027-f5fd40e5145d 8
5 https://biodiversity.org.au/afd/taxa/645b287c-e547-4602-9275-ad3f972328bb 8
6 https://biodiversity.org.au/afd/taxa/715a2874-1942-4762-866c-1194990e7a91 8
7 https://biodiversity.org.au/afd/taxa/8c178318-773f-4ddf-a4c1-01967698054c 8
8 https://biodiversity.org.au/afd/taxa/a4ef7496-ba95-481c-b3a5-a6ed66f37394 8
Interestingly, if we use an authoritative list that is indexed in biocache, we get the former problem but not the latter:
> result2 <- galah_call() |>
+ filter(species_list_uid == "dr656") |>
+ group_by(species, taxonConceptID) |>
+ count() |>
+ collect()
> colnames(result2)
[1] "species" "taxonConceptID.https://biodiversity.org"
[3] "count"
> table(result2[, 2]) |>
+ max()
[1] 1 # i.e. no duplicate taxonConceptIDs
It is unclear at present whether the two issues are related, nor what is causing them.