Every now and them I spend some of my time playing with bird-related data. Lately I’ve been tinkering with code to interface with eBird, a global citizen science initiative that records bird sightings throughout the world, and even helps you maintain your own lists (for those of you who are listers, that is…).
The eBird API interface allows for downloading their most up-to-date latest checklist, and can be accessed in R using the ebirdtaxonomy
function in the rebird package. Unfortunately, this checklist doesn’t include any high-level taxonomic information, such as order and family, for each species.
I was looking for a way to compare my eBird sightings across taxa, so after a while of searching for different taxonomic checklists and ways of merging them, I’ve come up with a way that’s only slightly convoluted to include higher taxonomic info into any eBird checklist.
library(dplyr)
library(tidyr)
library(lazyeval)
library(rebird)
library(tidyr)
library(xml2)
eBird’s taxonomic checklist
First, we load eBird’s taxonomic checklist and separate the scientific name into two columns, genus
and species
:
ebirdtax <- ebirdtaxonomy(cat = "species") %>%
separate(sciName, c("genus", "species"), sep = " ", remove = FALSE)
ebirdtax %>% select(genus, species, everything())
## Source: local data frame [10,473 x 11]
##
## genus species speciesCode category comName
## (chr) (chr) (chr) (chr) (chr)
## 1 Struthio camelus ostric2 species Common Ostrich
## 2 Struthio molybdophanes ostric3 species Somali Ostrich
## 3 Rhea americana grerhe1 species Greater Rhea
## 4 Rhea pennata lesrhe2 species Lesser Rhea
## 5 Nothocercus julius tabtin1 species Tawny-breasted Tinamou
## 6 Nothocercus bonapartei higtin1 species Highland Tinamou
## 7 Nothocercus nigrocapillus hootin1 species Hooded Tinamou
## 8 Tinamus tao grytin1 species Gray Tinamou
## 9 Tinamus solitarius soltin1 species Solitary Tinamou
## 10 Tinamus osgoodi blatin1 species Black Tinamou
## .. ... ... ... ... ...
## Variables not shown: sciNameCodes (chr), sciName (chr), taxonID (chr),
## taxonOrder (dbl), comNameCodes (chr), bandingCodes (chr)
This checklist has 10473 species and, as already mentioned, doesn’t contain any columns with order or family information. We need to obtain that information somewhere else and match it to this checklist. For ease of comparison throughout this post, we’ll just focus on distinct genera.
ebirdgen <- select(ebirdtax, genus) %>%
distinct(genus)
ebirdgen
## Source: local data frame [2,226 x 1]
##
## genus
## (chr)
## 1 Struthio
## 2 Rhea
## 3 Nothocercus
## 4 Tinamus
## 5 Crypturellus
## 6 Rhynchotus
## 7 Nothoprocta
## 8 Nothura
## 9 Taoniscus
## 10 Eudromia
## .. ...
This leaves us with 2226 unique genera.
BirdLife’s taxonomic checklist
There are other organizations that also maintain their own complete bird species checklists, but do include information on order and families for each species. One of these is BirdLife International. We need to download the latest version of this checklist; for simplicity I’ve already turned this list into a .csv file which you can download here (you can also download the original .zip file with the complete information here).
A little bit of wrangling is required to make this checklist comparable with eBird’s taxonomy:
birdlifetax <- read.csv("~/Projects/stats-ebird/BirdLife-v8.csv",
stringsAsFactors = FALSE) %>% tbl_df %>%
select(Order, Family_name, Scientific.name, everything())
birdlifetax
## Source: local data frame [11,862 x 14]
##
## Order Family_name Scientific.name Family
## (chr) (chr) (chr) (chr)
## 1 STRUTHIONIFORMES Struthionidae Struthio camelus Ostriches
## 2 STRUTHIONIFORMES Struthionidae Struthio molybdophanes Ostriches
## 3 STRUTHIONIFORMES Struthionidae Struthio camelus Ostriches
## 4 STRUTHIONIFORMES Rheidae Pterocnemia tarapacensis Rheas
## 5 STRUTHIONIFORMES Rheidae Rhea americana Rheas
## 6 STRUTHIONIFORMES Rheidae Rhea pennata Rheas
## 7 STRUTHIONIFORMES Rheidae Rhea tarapacensis Rheas
## 8 STRUTHIONIFORMES Rheidae Rhea nana Rheas
## 9 STRUTHIONIFORMES Rheidae Rhea pennata Rheas
## 10 STRUTHIONIFORMES Tinamidae Nothocercus julius Tinamous
## .. ... ... ... ...
## Variables not shown: Common.name (chr), Authority (chr),
## BirdLife.taxonomic.treatment (chr), IUCN_RL_2015 (chr), Synonyms
## (chr), Alternative.common.names (chr), Taxonomic.notes (chr),
## Taxonomic.sources. (chr), SISRecID (int), SpcRecID (int)
As you can see, this second checklist contains more species than eBird (11862). We’ll split the scientific name into genus and species, capitalize only the first letter of each order (using a small function) , and select all unique genera with their respective order and family columns:
capFirst <- function(x) {
paste0(toupper(substring(x, 1, 1)), tolower(substring(x, 2)))
}
birdlifegen <- birdlifetax %>%
separate(Scientific.name, c("genus", "species"), sep = " ") %>%
mutate(order = capFirst(Order)) %>%
select(order, family = Family_name, genus) %>%
distinct(genus)
birdlifegen
## Source: local data frame [2,219 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Struthioniformes Struthionidae Struthio
## 2 Struthioniformes Rheidae Pterocnemia
## 3 Struthioniformes Rheidae Rhea
## 4 Struthioniformes Tinamidae Nothocercus
## 5 Struthioniformes Tinamidae Tinamus
## 6 Struthioniformes Tinamidae Crypturellus
## 7 Struthioniformes Tinamidae Rhynchotus
## 8 Struthioniformes Tinamidae Nothoprocta
## 9 Struthioniformes Tinamidae Nothura
## 10 Struthioniformes Tinamidae Taoniscus
## .. ... ... ...
The BirdLife checklist also contains more unique genera than eBird’s checklist. What is interesting to see is which ones (and how many) genera are different between these checklists.
anti_join(ebirdgen, birdlifegen, by = "genus") %>%
select(genus) %>% distinct(genus)
## Source: local data frame [173 x 1]
##
## genus
## (chr)
## 1 Euodice
## 2 Spermestes
## 3 Odontospiza
## 4 Paludipasser
## 5 Sporaeginthus
## 6 Granatina
## 7 Coccopygia
## 8 Pachyphantes
## 9 Carpospiza
## 10 Alario
## .. ...
There are 173 genera in the eBird checklist that are not in the BirdLife checklist! It’s important to know that the difference from anti_join
(and also setdiff
) is asymmetrical, therefore the order in which we write the arguments matters. Just out of curiosity, we can have a look at the symmetric difference between these sets:
sym_diff <- function(a,b) unique(c(setdiff(a,b), setdiff(b,a)))
ebirdbl.diff <- sym_diff(ebirdgen$genus, birdlifegen$genus)
data.frame(genus = ebirdbl.diff) %>% tbl_df
## Source: local data frame [339 x 1]
##
## genus
## (fctr)
## 1 Chen
## 2 Oressochen
## 3 Oceanodroma
## 4 Ichthyophaga
## 5 Crecopsis
## 6 Anurolimnas
## 7 Aenigmatolimnas
## 8 Porphyriops
## 9 Mustelirallus
## 10 Chroicocephalus
## .. ...
So between the eBird and BirdLife checklists, there are 339 genera that are are in one but not the other!
IOC’s taxonomic checklist
There are too many genera missing to enter by hand, so we need to find another data source to fill in the gaps. The IOC World Bird List is another well-maintained bird species checklist. Their spreadsheet file is way too messy to work with, however we can parse the XML file they provide. I have very little experience with XML code but I am very grateful that Scott Chamberlain, who works at ROpenSci, conjured some black magic and came up with code to turn the XML into a data frame (you can see his gist here):
# sourcing the gist creates a data frame named df
devtools::source_gist("https://gist.github.com/sckott/c0437a71a889793e30d5")
iocgen <- mutate(df, order = capFirst(order))
iocgen
## Source: local data frame [2,284 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Tinamiformes Tinamidae Tinamus
## 2 Tinamiformes Tinamidae Nothocercus
## 3 Tinamiformes Tinamidae Crypturellus
## 4 Tinamiformes Tinamidae Rhynchotus
## 5 Tinamiformes Tinamidae Nothoprocta
## 6 Tinamiformes Tinamidae Nothura
## 7 Tinamiformes Tinamidae Taoniscus
## 8 Tinamiformes Tinamidae Eudromia
## 9 Tinamiformes Tinamidae Tinamotis
## 10 Struthioniformes Struthionidae Struthio
## .. ... ... ...
The IOC checklist has more genera than either of the checklists examined previously. Now we combine the IOC and BirdLife genera and compare it with the eBird checklist:
allgenera <- full_join(iocgen, birdlifegen,
by = c("order", "family", "genus"))
allgenera %>% arrange(genus)
## Source: local data frame [2,833 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Apodiformes Trochilidae Abeillia
## 2 Caprimulgiformes Trochilidae Abeillia
## 3 Passeriformes Cettiidae Abroscopus
## 4 Passeriformes Sylviidae Abroscopus
## 5 Galliformes Cracidae Aburria
## 6 Passeriformes Meliphagidae Acanthagenys
## 7 Passeriformes Thraupidae Acanthidops
## 8 Passeriformes Emberizidae Acanthidops
## 9 Passeriformes Fringillidae Acanthis
## 10 Passeriformes Acanthisittidae Acanthisitta
## .. ... ... ...
By sorting the data frame by genus, we can see that in certain cases (like for the genus Abeillia) the order and family information does not match between these two checklists. Here I made a judgement call and after a quick scan through Wikipedia I decided that the IOC checklist had the most updated taxonomic placement. For example, the family Trochilidae are currently placed in the order Apodiformes rather than Caprimulgiformes, therefore, the first instance of this genus is the correct one. As full_join
places rows from the data frame specified first (iocgen
in this case) before those from the other data frame, we can use distinct
to keep the first instance of each genus:
allgenera <- allgenera %>% distinct(genus)
allgenera %>% arrange(genus)
## Source: local data frame [2,406 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Apodiformes Trochilidae Abeillia
## 2 Passeriformes Cettiidae Abroscopus
## 3 Galliformes Cracidae Aburria
## 4 Passeriformes Meliphagidae Acanthagenys
## 5 Passeriformes Thraupidae Acanthidops
## 6 Passeriformes Fringillidae Acanthis
## 7 Passeriformes Acanthisittidae Acanthisitta
## 8 Passeriformes Acanthizidae Acanthiza
## 9 Passeriformes Meliphagidae Acanthorhynchus
## 10 Passeriformes Acanthizidae Acanthornis
## .. ... ... ...
So, we kept the first instance of Abeillia, and so on. Now we a have a comprehensive list of genera, with their respective family and order, from two separate checklists. We can now match it to the eBird checklist:
ebirdorders <- left_join(ebirdgen, allgenera, by = "genus") %>%
select(order, family, genus)
ebirdorders
## Source: local data frame [2,226 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Struthioniformes Struthionidae Struthio
## 2 Rheiformes Rheidae Rhea
## 3 Tinamiformes Tinamidae Nothocercus
## 4 Tinamiformes Tinamidae Tinamus
## 5 Tinamiformes Tinamidae Crypturellus
## 6 Tinamiformes Tinamidae Rhynchotus
## 7 Tinamiformes Tinamidae Nothoprocta
## 8 Tinamiformes Tinamidae Nothura
## 9 Tinamiformes Tinamidae Taoniscus
## 10 Tinamiformes Tinamidae Eudromia
## .. ... ... ...
We have merged all genera, with their order and family data, from the BirdLife and IOC checklists and now successfully merged it with the eBird checklist. Even though the BirdLife and IOC checklists are up-to-date, it is likely that some genera included in eBird aren’t included in either of the previous lists.
ebirdmissing <- ebirdorders %>% filter(is.na(order))
ebirdmissing
## Source: local data frame [39 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 NA NA Oressochen
## 2 NA NA Ichthyophaga
## 3 NA NA Crecopsis
## 4 NA NA Mustelirallus
## 5 NA NA Rhamphomantis
## 6 NA NA Damophila
## 7 NA NA Calorhamphus
## 8 NA NA Guarouba
## 9 NA NA Euchrepomis
## 10 NA NA Cercomacroides
## .. ... ... ...
Filling in the gaps
There are still 39 genera for which we do not have order or family info. Given that the eBird checklist is ordered taxonomically, one way around this is by checking the order and family values around each NA
value: if they are the same, then we can use this value to fill in the blanks:
NAs <- which(ebirdorders$genus %in% ebirdmissing$genus)
NAs
## [1] 30 304 342 351 526 721 806 946 966 997 1234 1241 1248
## [14] 1303 1317 1403 1405 1406 1407 1408 1427 1428 1540 1733 1737 1741
## [27] 1745 1754 1758 1853 1878 1909 1976 2162 2163 2186 2202 2209 2221
print(ebirdorders[(NAs[1]-1):(NAs[1]+1),])
## Source: local data frame [3 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Anseriformes Anatidae Pteronetta
## 2 NA NA Oressochen
## 3 Anseriformes Anatidae Chloephaga
There are some cases where there are more than one NA
in a row, and even sometimes these gaps are in between different families:
print(ebirdorders[2161:2164,])
## Source: local data frame [4 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Passeriformes Fringillidae Linurgus
## 2 NA NA Pseudochloroptila
## 3 NA NA Alario
## 4 Passeriformes Fringillidae Serinus
print(ebirdorders[1402:1409,])
## Source: local data frame [8 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Passeriformes Campephagidae Campochaera
## 2 NA NA Malindangia
## 3 Passeriformes Campephagidae Lalage
## 4 NA NA Celebesia
## 5 NA NA Cyanograucalus
## 6 NA NA Analisoma
## 7 NA NA Edolisoma
## 8 Passeriformes Neosittidae Daphoenositta
We can make an ugly for
loop to deal with this situation, in which it checks the adjacent non-NA
values, and if they are identical, uses them to replace the NA
s in between:
for (i in 1:nrow(ebirdorders)) {
if (is.na(ebirdorders$family[i])) {
nextval <- which(!is.na(ebirdorders$family[i+1:(i+9)]))[1]
nextval2 <-which(!is.na(ebirdorders$order[i+1:(i+9)]))[1]
if (identical(ebirdorders$family[i-1], ebirdorders$family[i+nextval])) {
ebirdorders$family[i] <- ebirdorders$family[i-1]
}
if (identical(ebirdorders$order[i-1], ebirdorders$order[i+nextval2])) {
ebirdorders$order[i] <- ebirdorders$order[i-1]
}
}
}
ebirdorders %>% filter(is.na(order))
## Source: local data frame [0 x 3]
##
## Variables not shown: order (chr), family (chr), genus (chr)
ebirdorders %>% filter(is.na(family))
## Source: local data frame [8 x 3]
##
## order family genus
## (chr) (chr) (chr)
## 1 Piciformes NA Calorhamphus
## 2 Passeriformes NA Euchrepomis
## 3 Passeriformes NA Ceratopipra
## 4 Passeriformes NA Celebesia
## 5 Passeriformes NA Cyanograucalus
## 6 Passeriformes NA Analisoma
## 7 Passeriformes NA Edolisoma
## 8 Passeriformes NA Grammatoptila
There are now only 8 genera with missing family values, but none that are missing their respective order as we managed to fill in all of those. The remaining gaps are from NA
s between different families, hence we can’t infer the family based on their position in the checklist. We’ll have to add these by hand:
for (i in 1:nrow(ebirdorders)) {
if (is.na(ebirdorders$family[i])) {
if (ebirdorders$genus[i] %in% c("Calorhamphus")) {
ebirdorders$family[i] <- "Megalamidae"
}
if (ebirdorders$genus[i] %in% c("Ceratopipra")) {
ebirdorders$family[i] <- "Pipridae"
}
if (ebirdorders$genus[i] %in% c("Celebesia","Cyanograucalus",
"Analisoma", "Edolisoma")) {
ebirdorders$family[i] <- "Campephagidae"
}
if (ebirdorders$genus[i] %in% c("Grammatoptila")) {
ebirdorders$family[i] <- "Leiothrichidae"
}
if (ebirdorders$genus[i] %in% c("Euchrepomis")) {
ebirdorders$family[i] <- "Thamnophilidae"
}
}
}
anyNA(ebirdorders$family)
## [1] FALSE
Voilà! No more missing values! We can join this list of genera back to the full taxonomic checklist with all species, or to your own checklist of sightings downloaded from eBird’s website:
ebirdfull <- left_join(ebirdtax, ebirdorders, by = "genus")
anyNA(ebirdfull$family)
## [1] FALSE
So, there it is! A slightly complicated way to add order and family information to the eBird taxonomic checklist. If there’s and eBird developers reading this, please feel free to add it to the eBird-1.1-SpeciesReference results fields! ;)