Dataset (or data set) is a collection of various data types stored in a digital format. Dataset is the main component of any Machine Learning system. Dataset provide the system with the data on which it will learn. Therefore, it should consist of a generalized representation of the real data. In this case data quality is very important. The main features of a good dataset are :
Accuracy is a key aspect of machine learning data. Inaccurately labeled data significantly affects the quality and accuracy of the model. Completeness means that the data set has all the data that is required to perform a specific task. The data that will be used for learning cannot be contradictory, that is what reliability refers to. In some machine learning models up to date data is very important. The timeliness and relevance of information is an important data quality characteristic, because in the real world, the data on which the model was learned may not appear anymore, which makes the model useless in a given application.
Preparing a custom dataset for a given application is a difficult task. As can be seen from the information presented above the first step is to identify what data is needed and why. The data must be properly collected and I must represent the problem under consideration. The next step is labeling the data appropriately. Labeling is an important step as improperly labeled data reduces the quality of the data. After the dataset is created, the model needs to be trained and tested.
R. L. Sarfin, „5 Characteristics of Data Quality,” 07 05 2021. [Online]. Available: https://www.precisely.com/blog/data-quality/5-characteristics-of-data-quality. [Date of access: 22 07 2022].