Skip to content

group_by() not performing correctly when taxonConceptID and species are supplied #267

@mjwestgate

Description

@mjwestgate

There appears to be a problem with group_by() in some instances. If we look at a non-authoritative species list (such as this one), we can use the 'View occurrence records' button to go to biocache. The resulting URL then contains a query ID that we can use in galah:

galah_call() |>
    filter(qid == "1744761573646") |>
    count() |>
    collect()
# A tibble: 1 × 1
   count
   <int>
1 510928

Something that we might be interested in is adding a group_by() statement to investigate which taxonConceptIDs are associated with each species in the list:

result <- galah_call() |>
    filter(qid == "1744761573646") |>
    group_by(species, taxonConceptID) |>
    count() |>
    collect()

This has two problems, first that the second column name is parsed incorrectly:

> colnames(result)
[1] "species"                                 "taxonConceptID.https://biodiversity.org"
[3] "count"  

And second (more importantly) that taxonConceptIDs are repeated across species, which is not only incorrect but should be actively impossible:

> result |>
+     group_by(`taxonConceptID.https://biodiversity.org`) |>
+     summarize(count = n())
# A tibble: 15 × 2
   `taxonConceptID.https://biodiversity.org`                                 count
   <chr>                                                                     <int>
 1 https://biodiversity.org.au/afd/taxa/0480b9ae-ba82-46a1-902e-fdcf4bd8e7c7     8
 2 https://biodiversity.org.au/afd/taxa/23a8017a-3a2b-4a52-8ca6-d168bf52659c     8
 3 https://biodiversity.org.au/afd/taxa/428ea60d-7f8b-401e-b63b-83910f9ef8b8     8
 4 https://biodiversity.org.au/afd/taxa/617e069f-eb5c-40ec-a027-f5fd40e5145d     8
 5 https://biodiversity.org.au/afd/taxa/645b287c-e547-4602-9275-ad3f972328bb     8
 6 https://biodiversity.org.au/afd/taxa/715a2874-1942-4762-866c-1194990e7a91     8
 7 https://biodiversity.org.au/afd/taxa/8c178318-773f-4ddf-a4c1-01967698054c     8
 8 https://biodiversity.org.au/afd/taxa/a4ef7496-ba95-481c-b3a5-a6ed66f37394     8

Interestingly, if we use an authoritative list that is indexed in biocache, we get the former problem but not the latter:

> result2 <- galah_call() |>
+     filter(species_list_uid == "dr656") |>
+     group_by(species, taxonConceptID) |>
+     count() |>
+     collect()

> colnames(result2)
[1] "species"                                 "taxonConceptID.https://biodiversity.org"
[3] "count"

> table(result2[, 2]) |>
+     max()
[1] 1 # i.e. no duplicate taxonConceptIDs

It is unclear at present whether the two issues are related, nor what is causing them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions