The following problems appeared as assignments in the coursera course Data-Driven Astronomy.
One of the most widely used formats for astronomical images is the Flexible Image Transport System. In a FITS file, the image is stored in a numerical array. The FITS files shown below are some of the publicly available ones downloaded from the following sites:
In this assignment, we shall focuss on calculating the mean of a stack of FITS files. Each individual file may or may not have a detected a pulsar, but in the final stack we should be able to see a clear detection.
The following figure shows 5 noisy FITS files, which will be used to compute the mean FITS file.
The following figure shows the mean FITS file computed from thes above FITS files. Mean being an algebraic measure, it’s possible to compute running mean by loadig each file at a time in the memory.
Computing the Median FITS
Now let’s look at a different statistical measure — the median, which in many cases is considered to be a better measure than the mean due to its robustness to outliers. The median can be a more robust measure of the average trend of datasets than the mean, as the latter is easily skewed by outliers.
However, a naïve implementation of the median algorithm can be very inefficient when dealing with large datasets. Median, being a holistic measure, required all the datasets to be loaded in memory for exact computation, when implemeted i a naive manner.
Calculating the median requires all the data to be in memory at once. This is an issue in typical astrophysics calculations, which may use hundreds of thousands of FITS files. Even with a machine with lots of RAM, it’s not going to be possible to find the median of more than a few tens of thousands of images.This isn’t an issue for calculating the mean, since the sum only requires one image to be added at a time.
If there were a way to calculate a “running median” we could save space by only having one image loaded at a time. Unfortunately, there’s no way to do an exact running median, but there are ways to do it approximately.
The binapprox algorithm does just this. The idea behind it is to find the median from the data’s histogram.
First of all it ca be proved that media always lies within one standard deviation of the mean, as show below:
The algorithm to find the running approximate median is show below:
As soon as the relevant bin is updated the data point being binned can be removed from memory. So if we’re finding the median of a bunch of FITS files we only need to have one loaded at any time. (The mean and standard deviation can both be calculated from running sums so that still applies to the first step).
The downside of using binapprox is that we only get an answer accurate to σ/B by using B bins. Scientific data comes with its own uncertainties though, so as long as you keep large enough this isn’t necessarily a problem.
The following figure shows the histogram of 1 million data points generated randomly. Now, the binapprox algorithm will be used to compute the running median and the error in approximation will be computed with different number of bins B.
As can be seen from above, as the number of bins B increases, the error in
approximation of the running median decreases.
Now we can use the binapprox algorithm to efficiently estimate the median of each pixel from a set of astronomy images in FITS files. The following figure shows 10 noisy FITS files, which will be used to compute the median FITS file.
The following figure shows the median FITS file computed from the above FITS files using the binapprox algorithm.
When investigating astronomical objects, like active galactic nuclei (AGN), astronomers compare data about those objects from different telescopes at different wavelengths.
This requires positional cross-matching to find the closest counterpart within a given radius on the sky.
In this activity you’ll cross-match two catalogues: one from a radio survey, the AT20G Bright Source Sample (BSS) catalogue and one from an optical survey, the SuperCOSMOS all-sky galaxy catalogue.
The BSS catalogue lists the brightest sources from the AT20G radio survey while the SuperCOSMOS catalogue lists galaxies observed by visible light surveys. If we can find an optical match for our radio source, we are one step closer to working out what kind of object it is, e.g. a galaxy in the local Universe or a distant quasar.
We’ve chosen one small catalogue (BSS has only 320 objects) and one large one (SuperCOSMOS has about 240 million) to demonstrate the issues you can encounter when implementing cross-matching algorithms.
The positions of stars, galaxies and other astronomical objects are usually recorded in either equatorial or Galactic coordinates.
Equatorial coordinates are fixed relative to the celestial sphere, so the positions are independent of when or where the observations took place. They are defined relative to the celestial equator (which is in the same plane as the Earth’s equator) and the ecliptic (the path the sun traces throughout the year).
A point on the celestial sphere is given by two coordinates:
The vernal equinox is the intersection of the celestial equator and the ecliptic where the ecliptic rises above the celestial equator going further east.
To crossmatch two catalogs we need to compare the angular distance between objects on the celestial sphere.
People loosely call this a “distance”, but technically its an angular distance: the projected angle between objects as seen from Earth.
Angular distances have the same units as angles (degrees). There are other equations for calculating the angular distance but this one, called the haversine formula, is good at avoiding floating point errors when the two points are close together.
If we have an object on the celestial sphere with right ascension and declination (α1,δ1), then the angular distance to another object with coordinates (α2,δ2) is given below:
Before we can crossmatch our two catalogs we first have to import the raw data. Every astronomy catalog tends to have its own unique format so we’ll need to look at how to do this with each one individually.
We’ll look at the AT20G bright source sample survey first. The raw data we’ll be using is the file table2.dat from this page in the VizieR archives, but we’ll use the filename bss.dat from now on.
Every catalogue in VizieR has a detailed README file that gives you the exact format of each table in the catalogue.
The full catalogue of bright radio sources contains 320 objects. The first few rows look like this:
The catalogue is organised in fixed-width columns, with the format of the columns being:
The SuperCOSMOS all-sky catalogue is a catalogue of galaxies generated from several visible light surveys.
The original data is available on this page in a package called SCOS_stepWedges_inWISE.csv.gz. The file is extracted to a csv file named super.csv.
The first few lines of super.csv look like this:
The catalog uses a comma-separated value (CSV) format. Aside from the first row, which contains column labels, the format is:
Let’s implement a naive crossmatch function that crossmatches two catalogs within a maximum distance and returns a list of matches and non-matches for the first catalog (BSS) against the second (SuperCOSMOS). The maximum distance is given in decimal degrees (e.g., nearest objects within a distance of 5 degree will be considered to be matched).
There are 320 objects in the BSS catalog that are compared with first n objects from the SuperCOSMOS catalog. The values of n is gradually increased from 500 to
1,25,000 and impact on the running time and the number of matched objected are noted.
The below figures shows the time taken for the naive cross-matching as the umber of objects in the second catalog is increased and also the final matches produced as a bipartite graph.
Cross-matching is a very common task in astrophysics, so it’s natural that it’s had optimized implementations written of it already. A popular implementation uses objects called k-d trees to perform crossmatching incredibly quickly, by constructing a k-d tree out of the second catalogue, letting it search through for a match for each object in the first catalogue efficiently. Constructing a k-d tree is similar to binary search: the k-dimensional space is divided into two parts recursively until each division only contains only a single object. Creating a k-d tree from an astronomy catalogue works like this:
This creates a binary tree where each object used to split a partition (a node) links to the two objects that then split the partitions it has created (its children), similar to the one show in the image below.
Once a k-d tree is costructed out of a catalogue, finding a match to an object then works like this:
Since each node branches into two children, a catalogue of objects will have, on average, log2(N) nodes from the root to any leaf, as show in the following figure.
As can be seen above, the kd-tree based implementation is way more faster than the naive counterpart for crossmatching. When the naive takes > 20 mins to match against 1,25,000 objects in the second catalog, the kd-tree based implemetation takes just about 1 second to produce the same set of results, as shown above.