How to extract the Gaia ancillary data using datalink - Gaia Users
The Datalink is an IVOA protocol that specifies a service container which can represent or accommodate a variety of services as well as free self-described protocol services.
The support of the Datalink feature in the Gaia archive is an independent service providing serialization for time series at the moment but it will be extended in the future to provide additional services to the archive as a future spectra serialization service. A dedicated tutorial is available here.
See the Datalink specification.
You may use HTTP protocol to execute requests at https://geadata.esac.esa.int/data-server/.
To access Datalink products via command line please see this Section.
Author: Alcione Mora
This tutorial explains how to retrieve DR2 light curves. The data are retrieved using two new services within the Gaia Archive: DataLink and massive Data. The full reference documentation is available here.
This is an intermediate level tutorial that assumes a basic knowledge of the general interface and workflow. The introductory tutorials White dwarfs exploration and Cluster analysis are recommended in case of difficulties following this exercise.
Light curves in DR2 are located in a different place than regular tables like gaia_source, in preparation for the large amounts of ancillary data to be served from DR3 onwards: spectra, epoch astrometry, astrophysical parameter Markov chain Monte Carlo samples, etc.
Epoch photometry is thus indexed using the DataLink Virtual Observatory protocol and served via a dedicated massive data service. Its basic usage in the GUI is relatively simple. It is also supported by the Gaia Python Astropy module.
First, the Archive is queried for IDs (designation or source_id) of objects with light curves, which in DR2 are variable stars. For example, imagine we are interested in bright G < 17 RR Lyrae stars in a cone of 1 degree around the globular cluster Omega Centauri.
select designation, source_id from gaiadr2.gaia_source as gaia join gaiadr2.vari_rrlyrae using (source_id) where 1 = contains( point('', 201.697, -47.47947222), circle('', ra, dec, 1) )
The query returns 38 entries. In the job description, the last icon on the right shows two interlocked links, representing ancillary data indexed by the DataLink protocol.
The following pop-up appears after clicking on it
It shows 38 light curves have been found associated to the IDs in the previous query, as expected. Clicking on the download icon next to the data count produces a zip file with the light curves. By default, it contains one file per source (INDIVIDUAL). The old COMBINED format with all sources in a single archive can still be requested. RAW produces a compact single table, with one object per row, making extensive use of array fields.
The main contents of Gaia DR1 are included in table gaia_source, providing astrometry and photometry for 1.1 billion sources. For DR2, a similar catalogue is provided, having grown in size to 1.7 billion and in complexity to include e.g. colours, radial velocities and astrophysical parameters.
Apart from the catalogues, the releases will progressively include much larger data sets. For example, both DR1 and DR2 contain a collection of light curves comprising about 3000 and 550000 variable stars, respectively.
The contents for DR3 are not fully settled, but they could comprise up to several million ancillary products, including light curves, BP, RP and RVS spectra and astrophysical parameter Markov chain Monte Carlo samples. See the Data Release scenario for up-to-date information.
DR4 contents are still to be defined, but they will include epoch astrometry and photometry for the whole sample, plus many billions of XP spectra (the total amount depends on the quantity and quality of epoch spectroscopy to be released). In addition, there are lots of auxiliary data of uncertain status (e.g. raw data, GBOT ground-based support observations, BAM, ...).
It is unlikely those additional contents can be stored as plain tables in a monolithic relational data base. For example, DR4 could include around two trillion astrometry and photometry epochs and 20 trillion spectrum samples.
Two new services have thus been introduced for DR2: DataLink and Massive Data, with the intent to serve the data and demonstrate feasibility for future releases.
The Virtual Observatory has already identified the need to associate ancillary data to catalogues and defined the DataLink protocol accordingly. The full specification can be found here.
A simplistic way of describing DataLink would be as a web service providing for each object (identified with its designation or source_id) the list of additional resources available outside the main catalogues. For example, following this URL:
provides the following VOTable xml response for input designation Gaia DR2 6680733225618222592. source_id can also be used instead of designations. The system assumes they correspond to the latest data release.
This file is not intended for humans, but for other VO services. It can be opened with Topcat, revealing there is only one additional resource for this source: the light curve.
For DR2, the light curves are the only additional resource that might or might not be available, depending on the source. A valid DataLink response will be produced for every source, regardless of it having associated epoch photometry or not. That is, the list will always be provided, even if most times empty.
The light curves are provided using another dedicated service, designed to handle Massive Data requests for DR3+. The light curve for designation Gaia DR2 6680733225618222592 can be retrieved from the following URL, which can be constructed from the associated DataLink response:
The default output is a binary64 encoded VOTable. It can be opened using e.g. Topcat, to reveal the basic structure of a light curve, including the source_id, band, time, flux, error and magnitude (additional details and columns explained in the DR2 data model):
The format follows the current VO conventions on time series. Note a stable standard was not available before DR2, so it has been adopted on a best effort basis. That is, it could evolve for DR3+.
The service is described in detail in the GUI Help. A number of options can be specified to tune the basic light curve request. They include filtering by G|BP|RP band, specifying the output format, and including invalid data for debugging purposes. These optional parameters can be added to the URL, although they are much better suited for the Python Astropy module.
Multiple sources can be combined in a given epoch photometry request to the Massive Data service. For this, the different source_id need to be provided consecutively and separated by commas or in ranges separated by hyphens. For example:
retrieves the light curve for three different objects. The output data contains the concatenation of the request for each source. Note column source_id allows the end user to identify to which object each data point belongs to. This mechanism has been used to generate the bulk download files containing all DR2 epoch photometry.
The new DataLink and massive Data services are different from the TAP, of which the Archive GUI is the front-end. However, some integration has been provided to improve usability and interoperability.
The first level is the DataLink pop-up described earlier, which can be invoked to search for sources listed in jobs or user tables. The second level is the inclusion of the DataLink and default light curve URL requests as synthetic columns in gaia_source. That is, these fields are not part of the data model because they are not DPAC products, but usability aids. Let's see an example TAP query, to be introduced in the Advanced (ADQL) tab:
select source_id, datalink_url from gaiadr2.gaia_source where source_id in ( 4053206924184153472, 2944217403015932032, 6680733225618222592, 4464190764106439936, 5836226622499714048)
If the results are downloaded in e.g. csv or plain VOTable, the corresponding DataLink and light curve URLs pointing to the DataLink and Massive Data services would be retrieved. If they are inspected within the GUI:
we find the actual values for these fields are hidden and replaced by a hyperlink. This serves two purposes: to have a compact representation (these URLs are large) and to allow a richer interaction. If the third datalink_url link is clicked, the following pop-up appears:
Providing a human readable version of the DataLink VOTable xml response. For this particular source (source_id on top), only one Massive Data resource is available: the light curve. For DR3+, the pop-up will be complemented with additional functionality and data sets.
Clicking in the link within the pop-up will request the light curve to the Massive Data service. For DR2, light curves are only provided for objects identified as variable. For example, the following query:
select source_id, phot_variable_flag, datalink_url from gaiadr2.gaia_source where source_id in ( 4053206924184153472, 2944217403015932032, 6680733225618222592, 4464190764106439936, 5836226622499714048)
Produces the output:
Clicking on top of the "Open link" hyperlinks in the top first two rows will open an empty pop-up Data link window. However, this pop-up window will contain a link that allows to download the epoch photometry data for the other four sources (i.e., those with phot_variable_flag = 'VARIABLE').