Data Management - solarfresh/solartf GitHub Wiki

Introduction

Data management is one of essential components in machine learning lifecycle, and there are different aspects from data ingestion, data model, search & query, curation, security, and scalability. After designing data strategy to solve business problems, people start to collect data to verify assumption with several experiments. It is important to reproduce experimental results, hence the data must be stored persistently. Therefore, after amassing materials from serveral portals, the information is extracted and distilled, and data models are designed according to algorithms or the searching & querying performance. Security is also aware because of the privacy or the confidential. Finally, the capability of scaling up should also be concerned once information will be updated continuously, and algorithms will be adaptively improved under various circumstances.

Curation

In the machine learning domain, people concentrate on data curation, and the activities include contextualizing, de-identification, validation, and visualization.