Most scientific databases consist of datasets (or sources) which in turn include samples (or files) with an identical structure (or schema). In many cases, samples are associated with rich metadata, describing the process that leads to building them (e.g.: the experimental conditions used during sample generation). Metadata are typically used in scientific computations just for the initial data selection; at most, metadata about query results is recovered after executing the query, and associated with its results by post-processing. In this way, a large body of information that could be relevant for interpreting query results goes unused during query processing.
In this paper, we present ScQL, a new algebraic relational language, whose operations apply to objects consisting of data–metadatapairs, by preserving such one-to-one correspondence throughout the computation. We formally define each operation and we describe an optimization, called meta-first, that may significantly reduce the query processing overhead by anticipating the use of metadata for selectively loading into the execution environment only those input samples that contribute to the result samples.
In ScQL, metadata have the same relevance as data, and contribute to building query results; in this way, the resulting samples are systematically associated with metadata about either the specific input samples involved or about query processing, thereby yielding a new form of metadata provenance. We present many examples of use of ScQL, relative to several application domains, and we demonstrate the effectiveness of the meta-first optimization.
Introduction and motivation
The organizations of scientific databases are very different. In many scientific fields, such as biology and astronomy, big consortia produce large, well-organized data repositories for public use. In other contexts, such as public administrations, data are open but much less organized and much more dispersed. Other big data players, such as Internet companies or mobile phone operators, produce information mostly for internal use, but often support third parties in research studies (e.g., about consumers’ interests) by providing them with services for data retrieval. We abstract a scientific data source as a container of several datasets, that in turn consists of thousands of samples, one for each experimental condition, often stored as files and not within a database; typically, samples are described by metadata, i.e., descriptive information about the content and production process of each sample. In meteorology, typical metadata describe ‘‘the WDM station, the sources of meteorological data, and the period of record for which the data is available’’; then the samples describe millions of records registered at the station. In genomics, typical metadata describe ‘‘the technology used for DNA sequencing, the process of DNA preparation, the genotype and phenotype of the donor’’; then, samples describe millions of genomic regions collected during the experiment. Metadata support the selection of the relevant experimental data by means of user interfaces (e.g. see genomic repositories such as ENCODE (the Encyclopedia of Genomic Elements, ) or TCGA (The Cancer Genome Atlas, ). When a source exposes APIs or WEB interfaces, metadata associated to each sample (such as Twitter’s hashtags or timestamps) support data retrieval.