A dataset is a collection of data that is used for analysis or training a machine learning model. Data sets can be of different types, depending on the type of data they contain and the way they are organized.
Types of Datasets
In Machine Learning while training a model we often encounter the problem of over-fitting and underfitting.
In order to overcome the situation, we need to divide our dataset into 3 different parts:
- Training Dataset
- Validation Dataset
- Test Dataset
The division of the dataset into the above three categories is done in the ratio of 60:20:20.
1. Training Dataset
- This data set is used to train the model i.e. these datasets are used to update the weight of the model.
2. Validation Dataset
- These types of a dataset are used to reduce overfitting. It is used to verify that the increase in the accuracy of the training dataset is actually increased if we test the model with the data that is not used in the training.
- If the accuracy over the training dataset increase while the accuracy over the validation dataset decrease, then this results in the case of high variance i.e. overfitting.
3. Test Dataset
- Most of the time when we try to make changes to the model based upon the output of the validation set then unintentionally we make the model peek into our validation set and as a result, our model might get overfit on the validation set as well.
- To overcome this issue we have a test dataset that is only used to test the final output of the model in order to confirm the accuracy.
Various Sources of Dataset
It is quite often hard to find the dataset for the machine learning application.
Following are the few lists of datasets along with their descriptions that can be used for experimentation.
1. Google Dataset Search Engine
Link: https://datasetsearch.research.google.com/
Google has its own search engine for the dataset. Their objective was to unify almost all the available dataset repositories and make them discoverable. One can easily search for the dataset based upon the application of their learning model.
2. Microsoft Dataset
Link: https://msropendata.com/
Microsoft has Microsoft Research Open Data. It is a data repository that makes the dataset created by the researchers at Microsoft available to the data scientists. Over here one can get a bunch of curated datasets.
3. Computer Vision Dataset
Link: https://visualdata.io/
This source provides a dataset of images. If you plan to work on image processing, deep learning or computer vision you can use this source. There are great visual datasets that are available to build computer vision models.
4. Kaggle Dataset
Link: https://www.kaggle.com/datasets
It contains numerous amounts of data with different shapes and sizes. Most of the available dataset has kernels associated with them, where many data scientist has provided their notebooks to analyze the dataset.
5. Amazon Dataset
Link: https://registry.opendata.aws/
It contains a dataset from the field of public transport, satellite images, etc. These datasets are available on the Amazon Web Service resource like Amazon S3. It becomes handy if you plan to use AWS for machine learning experimentation and development.
0 comments :
Post a Comment
Note: only a member of this blog may post a comment.