Home » Uncategorized

Building an exoplanet detection model using TensorFlow's prebuilt estimator for gradient boosting trees

In this post we will talk about the Kepler dataset from Kaggle competitions and use it to build an exoplanet detection model using TensorFlow’s prebuilt estimator for gradient boosting trees known as the BoostedTreesClassifier.

Detecting exoplanets in outer space

For the project explained in this post, we use the Kepler labeled time series data from Kaggle. This dataset is derived mainly from the Campaign 3 observations of the mission by NASA’s Kepler space telescope.

In the dataset, column 1 values are the labels and columns 2 to 3198 values are the flux values over time. The training set has 5087 data points, 37 confirmed exoplanets, and 5050 non-exoplanet stars. The test set has 570 data points, 5 confirmed exoplanets, and 565 non-exoplanet stars.

We will carry out the following steps to download, and then preprocess our data to create the train and test datasets: 

  1. Download the dataset using the Kaggle API. The following code will be used for the same:

    [email protected]:~/datasets/kaggle-kepler$ kaggle datasets download -d keplersmachines/kepler-labelled-time-series-data Downloading kepler-labelled-time-series-data.zip to /mnt/disk1tb/datasets/kaggle-kepler 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 57.4M/57.4M [00:03<00:00, 18.3MB/s]

    The folder contains the following two files:

    exoTest.csv

    exoTrain.csv

  2. Link the folder datasets to our home folder so we can access it from the ~/datasets/kaggle-kepler path and then we define the folder path and list the contents of the folder through the Notebook to confirm if we have access to the data files through the Notebook:

    dsroot = os.path.join(os.path.expanduser('~'),'datasets','kaggle-kepler')

    os.listdir(dsroot)

    We get the following output:

    ['exoTest.csv', 'kepler-labelled-time-series-data.zip', 'exoTrain.csv']

    Note

    The ZIP file is just a leftover of the download process because the Kaggle API begins by downloading the ZIP file and then proceeds to unzip the contents in the same folder.

  3. We will then read the two .csv data files in the pandas DataFrames named train and test respectively:

    import pandas as pd

    train = pd.read_csv(os.path.join(dsroot,'exoTrain.csv'))

    test = pd.read_csv(os.path.join(dsroot,'exoTest.csv'))

    print('Training datan',train.head())

    print('Test datan',test.head())

    The first five lines of the training and test data look similar to the following:

    Training data

    LABEL FLUX.1 FLUX.2 FLUX.3
    0 2 93.85 83.81 20.10
    1 2 -38.88 -33.83 -58.54
    2 2 532.64 535.92 513.73
    3 2 326.52 347.39 302.35
    4 2 -1107.21 -1112.59 -1118.95
    FLUX.4 FLUX.5 FLUX.6 FLUX.7
    0 -26.98 -39.56 -124.71 -135.18
    1 -40.09 -79.31 -72.81 -86.55
    2 496.92 456.45 466.00 464.50
    3 298.13 317.74 312.70 322.33
    4 -1095.10 -1057.55 -1034.48 -998.34

    FLUX.8 FLUX.9 … FLUX.3188
    0 -96.27 -79.89 … -78.07
    1 -85.33 -83.97 … -3 .28
    2 486.39 436.56 … -71.69
    3 311.31 312.42 … 5.71
    4 -1022.71 -989.57 … -594.37

    FLUX.3189 FLUX.3190 FLUX.3191
    0 -102.15 -102.15 25.13
    1 -32.21 -32.21 -24.89
    2 13.31 13.31 -29.89
    3 -3.73 -3.73 30.05
    4 -401.66 -401.66 -357.24

    FLUX.3192 FLUX.3193 FLUX.3194
    0 48.57 92.54 39.32
    1 -4.86 0.76 -11.70
    2 -20.88 5.06 -11.80
    3 20.03 -12.67 -8.77
    4 -443.76 -438.54 -399.71

    FLUX.3195 FLUX.3196 FLUX.3197
    0 61.42 5.08 -39.54
    1 6.46 16.00 19.93
    2 -28.91 -70.02 -96.67
    3 -17.31 -17.35 13.98
    4 -384.65 -411.79 -510.54

    [5 rows x 3198 columns]

    Test data

    LABEL FLUX.1 FLUX.2 FLUX.3
    0 2 119.88 100.21 86.46
    1 2 5736.59 5699.98 5717.16
    2 2 844.48 817.49 770.07
    3 2 -826.00 -827.31 -846.12
    4 2 -39.57 -15.88 -9.16

    FLUX.4 FLUX.5 FLUX.6 FLUX.7
    0 48.68 46.12 39.39 18.57
    1 5692.73 5663.83 5631.16 5626.39
    2 675.01 605.52 499.45 440.77
    3 -836.03 -745.50 -784.69 -791.22
    4 -6.37 -16.13 -24.05 -0.90

    FLUX.8 FLUX.9 … FLUX.3188
    0 6.98 6.63 … 14.52
    1 5569.47 5550.44 … -581.91
    2 362.95 207.27 … 17.82
    3 -746.50 -709.53 … 122.34
    4 -45.20 -5.04 … -37.87
    FLUX.3189 FLUX.3190 FLUX.3191
    0 19.29 14.44 -1.62
    1 -984.09 -1230.89 -1600.45
    2 -51.66 -48.29 -59.99
    3 93.03 93.03 68.81
    4 -61.85 -27.15 -21.18

    FLUX.3192 FLUX.3193 FLUX.3194
    0 13.33 45.50 31.93
    1 -1824.53 -2061.17 -2265.98
    2 -82.10 -174.54 -95.23
    3 9.81 20.75 20.25
    4 -33.76 -85.34 -81.46

    FLUX.3195 FLUX.3196 FLUX.3197
    0 35.78 269.43 57.72
    1 -2366.19 -2294.86 -2034.72
    2 -162.68 -36.79 30.63
    3 -120.81 -257.56 -215.41
    4 -61.98 -69.34 -17.84

    [5 rows x 3198 columns]

The training and test datasets have labels in the first column and 3197 features in the next columns. Now let us split the training and test data into labels and features with the following code:

x_train = train.drop('LABEL', axis=1)

y_train = train.LABEL-1 #subtract one because of TGBT

x_test = test.drop('LABEL', axis=1)

y_test = test.LABEL-1

In the preceding code, we subtract 1 from the labels, since the TFBT estimator assumes labels starting with numerical zero while the features in the datasets are numbers 1 and 2.

Now that we have the label and feature vectors for training and test data, let us build the boosted tree models.

In this section, we shall build the gradient boosted trees model for detecting exoplanets using the Kepler dataset. Let us follow these steps in the Jupyter Notebook to build and train the exoplanet finder model:

  1. We will save the names of all the features in a vector with the following code:

    numeric_column_headers = x_train.columns.values.tolist()

  2. We will then bucketize the feature columns into two buckets around the mean since the TFBT estimator only takes bucketed features with the following code:

    bc_fn = tf.feature_column.bucketized_column
    nc_fn = tf.feature_column.numeric_column
    bucketized_features = [bc_fn(source_column=nc_fn(key=column),
    boundaries=[x_train[column].mean()])
    for column in numeric_column_headers]

  3. Since we only have numeric bucketized features and no other kinds of features, we store them in the all_features variable with the following code:

    all_features = bucketized_features

  4. We will then define the batch size and create a function that will provide inputs from the label and feature vectors created from the training data. For creating this function we use a convenience function tf.estimator.inputs.pandas_input_fn() provided by TensorFlow. We will use the following code:

    batch_size = 32
    pi_fn = tf.estimator.inputs.pandas_input_fn
    train_input_fn = pi_fn(x = x_train,
    y = y_train,
    batch_size = batch_size,
    shuffle = True,
    num_epochs = None)

  5. Similarly, we will create another data input function that would be used to evaluate the model from the test features and label vectors and name it eval_input_fn using the following code:

    eval_input_fn = pi_fn(x = x_test,
    y = y_test,
    batch_size = batch_size,
    shuffle = False,
    num_epochs = 1)

  6. We will define the number of trees to be created as 100 and the number of steps to be used for training as 100. We also define the BoostedTreeClassifier as the estimator using the following code:

    n_trees = 100
    n_steps = 100
    m_fn = tf.estimator.BoostedTreesClassifier
    model = m_fn(feature_columns=all_features,
    n_trees = n_trees,
    n_batches_per_layer = batch_size,
    model_dir='./tfbtmodel')

    Note

    Since we are doing classification, hence we use the BoostedTreesClassifier, for regression problems where a value needs to be predicted, TensorFlow also has an estimator named BoostedTreesRegressor.

    One of the parameters provided to the estimator function is model_dir that defines where the trained model would be stored. The estimators are built such that they look for the model in that folder in further invocations for using them for inference and prediction. We name the folder as tfbtmodel to save the model.

    Note

    We have used the minimum number of models to define the BoostedTreesClassifier. Please look up the definition of this estimator in the TensorFlow API documentation to find various other parameters that can be provided to further customize the estimator.

    The following output in the Jupyter Notebook describes the classifier estimator and its various settings:

    INFO:tensorflow:Using default config.
    INFO:tensorflow:Using config: {'_model_dir': './tfbtmodel', '_tf_random_seed': None, '_save_summary_steps': 100,
    '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5,
    '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn':
    None, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '',
    '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

  7. Post this, we will train the model using the train_input_fn function that provides the exoplanets input data using 100 steps with the following code:

    model.train(input_fn=train_input_fn, steps=n_steps)

    The Jupyter Notebook shows the following output to indicate the training in progress:

    INFO:tensorflow:Calling model_fn.
    INFO:tensorflow:Done calling model_fn.
    INFO:tensorflow:Create CheckpointSaverHook.
    WARNING:tensorflow:Issue encountered when serializing resources.
    Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
    '_Resource' object has no attribute 'name'
    INFO:tensorflow:Graph was finalized.
    INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19201
    INFO:tensorflow:Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    WARNING:tensorflow:Issue encountered when serializing resources.
    Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
    '_Resource' object has no attribute 'name'
    INFO:tensorflow:Saving checkpoints for 19201 into ./tfbtmodel/model.ckpt.
    WARNING:tensorflow:Issue encountered when serializing resources.
    Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
    '_Resource' object has no attribute 'name'
    INFO:tensorflow:loss = 1.0475121e-05, step = 19201
    INFO:tensorflow:Saving checkpoints for 19202 into ./tfbtmodel/model.ckpt.
    WARNING:tensorflow:Issue encountered when serializing resources.
    Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
    '_Resource' object has no attribute 'name'
    INFO:tensorflow:Loss for final step: 1.0475121e-05.

  8. Use the eval_input_fn that provides batches from the test dataset to evaluate the model with the following code:

    results = model.evaluate(input_fn=eval_input_fn)

    The Jupyter Notebook shows the following output as the progress of the evaluation:

    INFO:tensorflow:Calling model_fn.
    WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
    WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
    INFO:tensorflow:Done calling model_fn.
    INFO:tensorflow:Starting evaluation at 2018-09-07-04:23:31
    INFO:tensorflow:Graph was finalized.
    INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19203
    INFO:tensorflow:Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Finished evaluation at 2018-09-07-04:23:50
    INFO:tensorflow:Saving dict for global step 19203: accuracy = 0.99122804, accuracy_baseline = 0.99122804, auc =
    0.49911517, auc_precision_recall = 0.004386465, average_loss = 0.09851996, global_step = 19203, label/mean = 0.00877193, loss = 0.09749381, precision = 0.0, prediction/mean = 4.402521e-05, recall = 0.0
    WARNING:tensorflow:Issue encountered when serializing resources.
    Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
    '_Resource' object has no attribute 'name'
    INFO:tensorflow:Saving 'checkpoint_path' summary for global step 19203: ./tfbtmodel/model.ckpt-19203

    Note that during the evaluation the estimator loads the parameters saved in the checkpoint file: 

    INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19203

  9. The results of the evaluation are stored in the results collection. Let us print each item in the results collection using the for loop in the following code:

    for key,value in sorted(results.items()):
    print('{}: {}'.format(key, value))

    The Notebook shows the following results:

    accuracy: 0.9912280440330505
    accuracy_baseline: 0.9912280440330505
    auc: 0.4991151690483093
    auc_precision_recall: 0.004386465065181255
    average_loss: 0.0985199585556984
    global_step: 19203
    label/mean: 0.008771929889917374
    loss: 0.09749381244182587
    precision: 0.0
    prediction/mean: 4.4025211536791176e-05
    recall: 0.0

    It is observed that we achieve an accuracy of almost 99% with the first model itself. This is because the estimators are prewritten with several optimizations and we did not need to set various values of hyperparameters ourselves. For some datasets, the default hyperparameter values in the estimators will work out of the box, but for other datasets, you will have to play with various inputs to the estimators.

In this post we learned about the Kepler dataset from Kaggle competitions. We used the Kepler dataset to build an exoplanet detection model using TensorFlow’s prebuilt estimator for gradient boosting trees known as the BoostedTreesClassifier. To learn how to use TensorFlow in the browser using the TensorFlow.js API for sentiment analysis.

Original Post can be viewed Here