The Data Science Landscape¶

Metadata¶

Author: Dr. Stefan Karenfort
Full Title: The Data Science Landscape
Category: #Type/Highlight/Article
URL: https://medium.com/p/f6f7842c9865

Highlights¶

Fourth, over- and underfitting of the model should be avoided as underfitting leads to generally poor performance and high prediction error while overfitting leads to poor generalization and high model complexity. Lastly, the result of the data science project must be communicated in a way that non-technical people can understand. A suitable way to communicate data is to use visualization techniques. In the business context, a good reference for presenting data is the International Business Communication Standards (IBCS).
Deployment — How do stakeholders access the results?
Evaluation — Which model best meets the business objectives?
Modeling — What modeling techniques should we apply?
Data preparation — How do we organize the data for modeling?
Data understanding — What data do we have / need? Is it clean?
Business understanding — What does the business need?
The Cross Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. It is a framework to plan, organize and implement a data science project.
Analytics generates insights from data using simple presentation, manipulation, calculation or visualization of data. In the context of data science, it is also sometimes referred to as exploratory data analytics. It often serves the purpose to familiarize oneself with the subject matter and to obtain some initial hints for further analysis. To this end, analytics is often used to formulate appropriate questions for a data science project.
Data science is part of the computer sciences [1]. It comprises the disciplines of i) analytics, ii) statistics and iii) machine learning.
Principles of Success
The Data Science Process
The Data Science Toolkit
The Data Science Landscape
Analytics
First, at the initial stage, it is paramount that the underlying business problem is clear to all stakeholders involved. Second, sufficient time has to be allocated for the data preparation stage which typically accounts for the majority of time spent during most projects. Third, the right variables have to be selected by the data scientist. A model should ideally comprise only the fewest possible number of variables with relevant explanatory power. The process of feature selection is therefore important in order to maximize performance while reducing the noise in a model.
The most popular languages for machine learning are Python, C/C++, Java, R and Java Script.
The R programming language, for example, was built primarily for statistical applications. Therefore, it is highly suitable for statistical tasks as well as visualization using the popular R package ggplot2.