Home » Uncategorized

Deep learning Data Sets for Every Data Scientist

Machine Learning has seen a tremendous rise in the last decade, and one of its sub-fields which has contributed largely to its growth is Deep Learning. The large volumes of data and the huge computation power that modern system possess has given Data Scientist, Machine Learning Engineers, and others to achieve ground-breaking results in the Deep Learning and continue to bring in new developments in this field.

In this blog post, we would cover the deep learning data sets that you could work with as a Data Scientist but before that, we would provide an intuition about the concept of Deep Learning.

Understanding Deep Learning

A sub-field of Machine Learning, the working structure of Deep Learning is similar to our brain known as the Artificial Neural Networks. It is similar to our nervous system where each neuron connected to each other. Across multiple industries from image classification to language translation, Deep Learning has penetrated into every sphere of the analytical Eco space.

The foundation of Deep Learning is a Neural Network. To understand neural networks, let’s say for example for house price prediction where the house size and the price are the variables. To find the price we could use Linear Regression but if we apply Deep Learning here, then an input would be provided to a neuron which would generate the output after applying some activation function such as Rectified Linear Unit or ReLU. The input to the activation function is a real number and the output would be zero or that number.

The task of mapping input to a corresponding output after applying some function is known as Supervised Learning. There are different types of neural networks used for different purposes – for predicting house prices, or ad revenue, the Standard Neural Network is used whereas for image classification, we would use the Convolutional Neural Network. Additionally, the Recurrent Neural Network is used for Speech Recognition, Machine Translation, and so on.

Despite the increase in the amount of data, traditional Machine Learning algorithms like Logistic Regression, Support Vector Machines, Linear Regression, etc., fails to improve a whole lot. However, in case of Deep Learning, as the data increases the performance of the model also increases. Data, Computation Time, and Algorithms are the three scales of a Deep Learning process.

Previously, the sigmoid function was used which decreased the learning rate and hence the computation time was more. The ReLU activation function has solved this problem as the parameters are updated faster and the computation time is reduced. There are certain other concepts in Deep Learning which are important as well – Forward and Backpropagation whose explanation is beyond the scope of this article.

Now, in Deep Neural Networks, the hidden layers have a lot of significance. Suppose there is an image, then the first hidden layer would try to identify the edges in it and as you would go deeper, complex functions like face identification would be formed. Below is the working process of deep neural networks to identify a face and recognize audio.

Image -> Edges -> Face parts -> Faces -> desired face

Audio -> Low Level sound features (sss, bb) -> Phonemes -> Words -> Sentences

In a Deep Neural Network, the activation from the previous layer is the input to a layer, and its own activations are the output of the layer. We don’t need to write a large number of codes as the more data would give better results. Moreover, the right parameter choice is crucial for the efficiency of the model. During the backpropagation step, the parameters are updated. Hyperparameters of a Deep Neural Network are learning rate, the iteration number, the number of hidden layers, each hidden layer units, and the activation function.

Datasets for Deep Learning

1. MNIST – One of the popular deep learning datasets of handwritten digits which consists of sixty thousand training set examples, and ten thousand test set examples. The time spent in data pre-processing is minimum while you could try different deep recognition patterns, and learning techniques on the real-world data. The size of the dataset if nearly 50 MB.

2. MS-COCO – It is a dataset for segmentation, object detection, etc. The features of the COCO dataset are – object segmentation, context recognition, stuff segmentation, three hundred thirty thousand images, 1.5 million instances of the object, eighty categories of object, ninety-one categories of staff, five per image captions, 250,000 keynotes people. The size of the dataset is 25 GB.

3. ImageNet – An images dataset organized with regards to the WordNet hierarchy. There are one lakh phrases in WordNet and each phrase is illustrated by on average 1000 images. It is a huge dataset of size hundred and fifty gigabytes.

4. VisualQA – The open-ended questions about images is present in this dataset which requires vision and language understanding. The features are – 265,016 COCO and abstract scenes, three questions per image, ten true answers per question, three likely to be correct answers per question, automatic evaluation metric. The size is 25 GB.

5. CIFAR-10 – An image classification dataset consisting of ten classes of sixty thousand images. There are five training batches and one test batch in the dataset and there are 10000 images in each batch. The size is 170 MB.

6. Fashion-MNIST – There are sixty thousand training and ten thousand test images in the dataset. This dataset was created as a direct replacement for the MNIST dataset. The size is 30 MB.

7. Street View House Numbers – A dataset for object detection problems. Similar to MNIST dataset with minimum data pre-processing but more labeled data collected from Google Street viewed house numbers. The size is 2.5 GB.

8. Sentiment140 – It is a Natural Language Processing dataset which performs sentiment analysis. There are six features in the final dataset with emotions removed from the data. The features are – tweet polarity, the id of the tweet, tweet date, query, username, tweet text.

9. WordNet – It’s a large English synsets database which describes a different concept of synonyms. The size is nearly 10 MB.

10. Wikipedia Corpus – It consists of 1.9 billion textual records for more than four million articles. You could search using a phrase, word.

11. Free Spoken Digit – Inspired by MNIST dataset, it was created to identify spoken digits in audio samples. The more people contribute to it, the more it would grow. The characteristics of this dataset are three speakers, fifteen hundred recordings, and English pronunciations. The size of the dataset is nearly 10 MB.

12. Free Music Archive – It is a music analysis dataset which has HQ audio features and user-level metadata. The size is almost 1000 GB.

13. Ballroom – A dancing audio files dataset where in real audio format, many dance styles excerpts are provided. The dataset consists of six hundred and ninety-eight instances, a thirty seconds duration with a total duration of 20940 seconds.

14. Million Song – A million music tracks’ audio features and metadata are present in this dataset. The dataset is an alternative to create large datasets. There is only derived features, but no audio in this dataset. The size is nearly 280 GB.

15. LibriSpeech – It consists of English speech for a thousand hours. The dataset is properly segmented and there are Acoustic models which are trained by this.

16. VoxCeleb – It is a speaker identification dataset extracted from videos in YouTube consisting of one lakh utterances by 1251 celebrities. There is a balanced distributed of gender and a wide range of professions, accents, and so on. The intriguing task is to identify the superstar the voice belongs to.

17. Urban Sound Classification – This dataset consists of 8000 urban sounds excerpts from ten classes. The training size is three GB and the test set is 2 GB.

18. IMDB reviews – For any movie junkie, this is an ideal dataset. Used for binary sentiment classification and has unlabelled data as well apart from train and test review examples. The size is 80 MB.

19. Twenty Newsgroups – Newspaper information is present in the dataset. From twenty different newspapers, 1000 Usenet articles were used. Subject lines, signatures, etc., are some of the features. The size of the dataset is nearly 20 MB.

20. Yelp Reviews – This dataset is for learning the purpose and was released by Yelp. It consists of user reviews and more than twenty thousand pictures. The JSON file size is 2.66 GB, SQL is 2.9 GB. And Photos is 7.5 GB with all compressed together.

Conclusion 

Deep Learning, Artificial Intelligence has revolutionized our world and has solved numerous real-life problems. The unprecedented power of data has been deeply nurtured by the Data Scientists and would continue to do so to make the world an improved and safe place to live in. Every now and then, new developments are coming through and with the ever-increasing capacity of modern computers, the field of exploration is increasing exponentially as well.

As several professionals are trying to enter this field, it is necessary that they learn to program first, and Python is an ideal language to start off their programming journey.

If you want to read more about data science, you can read our Data Science Blogs