4. Extending coconatfly with external data sources
Source:vignettes/extending-coconatfly.Rmd
extending-coconatfly.RmdThis vignette explains how and why developers and end users might want to extend coconatfly to support an
tldr
Register a new external dataset (or override an existing one) by doing:
coconat::register_dataset(
name = 'fcns', shortname = 'fc', namespace = 'coconatfly',
metafun=function(ids, ...) {
# first column must be within dataset id
# additional metadata can include columns like type and side
data.frame(id=1234, type="DNp25", side="L", class='descending_neuron')
},
partnerfun = function(ids, partners=c("inputs", "outputs"), threshold = 1, ...) {
# data frame should contain 3 columns
data.frame(query=1234, partner=4567, weight=10)
}
)Read on for more details, including additional functions that you can provide.
introduction
coconatfly has built-in support for a number of datasets
corresponding to named arguments of the cf_ids()
function.
library(coconatfly)
cf_ids(hemibrain='LAL008', flywire='LAL008', expand = TRUE)
#> flywire [2 ids]: 720575940639138382 720575940643829704
#> hemibrain [2 ids]: 1170352367 1605181883While it is likely that additional datasets will be added over time, there are reasons why adding more and more built-in datasets could cause trouble in the long term. These include
- introducing additional dependencies that make
coconatflypackage installation increasingly fragile - slower development by relying on lead developer(s) for integration.
Conversely giving users a mechanism to add external datasets gives benefits including:
- rapid support for new datasets even when they are evolving
- support for private/pre-release data
- support for new kinds of metadata source
- ability for end users to modify behaviour of existing built-in datasets e.g. fix a few cell types for a public dataset.
In order to achieve this you must tell coconatfly about your new dataset, what it is called and what functions are available to get the relevant data.
registering a dataset
The process of telling coconatfly about a new dataset is called
registration and relies on a function in the base coconat
package called coconat::register_dataset().
You need to supply both information about the dataset and functions that interact with it. Some of this is essential, other parts are optional. Starting with the information, something like this would be standard:
coconat::register_dataset(
# crucial arguments
name = 'fcns', shortname = 'fc', namespace = 'coconatfly',
# optional information
species = 'Drosophila melanogaster', sex = 'F', age = 'adult 7d',
description = 'Complete female CNS dataset',
# [functions omitted]
)The name argument gives the dataset name that will be
used in the cf_ids() function. Note that
cf_ids will not have an argument named fcns
but after dataset registration it will correctly process specifications
like:
cf_ids(fcns='/DNa02', expand = TRUE)which with give
fcns [2 ids]: 1234 5678
Besides the name argument the only other completely essential
argument is namespace = 'coconatfly'. This is what ensures
that the coconatfly package will know about this
dataset.
All the optional information is nice to have but not really used at the moment.
dataset functions
It’s great to have told coconatfly the name of your dataset, but that
doesn’t provide any real functionality. In addition you’ll want to
supply functions that do metadata or connectivity queries. There are
three named arguments in coconat::register_dataset() each
of which expects to be passed a function:
-
idfunprocess queries or ids -
metafunreturn metadata for given ids or query -
partnerfunreturn connectivity information
idfun
idfun is the most basic it turns ids or queries into a
character vector of ids.
Typical queries might look like: ids='/type:DNa02' to
find descending neurons with cell type DNa02.
myidfun <- function(ids, ...) {
# return value should be character vector (integer64 is also accepted)
}Although basic, it is not essential to specify an idfun
as it will by default use the supplied metadata function to
handle queries.
metafun
The function to return metadata is a little more complicated. The
input ids specification will be the same as for the
previous function. But this time we return a data frame.
mymetafun <- function(ids, ...) {
data.frame(id=as.character(c(123, 456)), side=c("L", "R"), class='descending_neuron', subclass='DNa', type='DNp42')
}The return value should be a data.frame whose first
column is the within dataset id; although normally numeric,
we recommend encoding as a character. Additional columns can include
side, class, subclass,
subsubclass, type, lineage,
instance. Note that side should be encoded as
L, R, M (for midline) or
NA for unknown. Additional columns can be returned but may
be dropped by downstream functions. Note that although it is recommended
that the id should be a character vector, you can also use the
integer64 type returned by the bit64 package
for some speed/size savings but these are likely to be minor.