# Theory & Practice

Choose one of the real-life data projects offered by our industry partners, and work on it throughout the year. You will receive guidance through our weekly project seminars and periodic meetings with the project’s data owner.

The list of our industry partners will be published close to Y-DATA opening in October 2018.

The list of our industry partners will be published close to Y-DATA opening in October 2018.

Study specific topics in data analysis and machine learning in short, dedicated courses (4-7 weeks), spanning topics from mathematics for ML, statistics and Python to Spark and reinforcement learning. Additional topics might be learned on-demand by joining our online courses on Coursera.

Become familiar with current scientific research and advancements through weekly research seminars, where you will also engage in in-depth discussions and exploration of the most recent advancements in the field. In addition, you will have a chance to present a scientific paper at one of our weekly seminars. Your presentation will be followed by a discussion of the relevance and value of this research to other students’ industry projects

# Industry Project

To achieve full understanding of the use and application of ML algorithms, our participants will work on a real-life industry project, translating theoretical knowledge to practical process and overcoming realistic challenges.

The process of working on the project follows popular industry standards and methodologies and incorporates a growing set of tools the students possess to methodically understand and solve a real-world problem.

A customer operates a forum where programmers ask each other questions, provide answers and rate questions giving them "ups" and "downs". The forum has a core expert community that provides good answers and valuable insights. However, they often waste their time handling questions of little to no value: marking questions as duplicates and redirecting them, closing topics with incoherent or irrelevant questions etc. Because of this, the overall efficiency of the system suffers.

He would like us to find a way to automatically recognize low-value questions and flag them.

### Full cycle of data science project: Q&A forum

#### The customer desires to improve the system efficiency as measured by the mean time between a question being posted and the first accepted (upvoted) answer given. How do we translate this request into ML terms? Should this be treated as a classification (good question or not) or regression (predicted number of up/down votes) problem? Which metric should be used? Accuracy, precision, recall, AUC etc. – which is most relevant to the situation?

Is the problem symmetrical? We’re more concerned with losing a valuable question then missing several low-value ones.

What are our resource constraints? We probably need a solution that runs immediately once a question is posted, so can’t use resource-heavy algorithms#### Downloading the questions and additional information, followed by a decision whether to use date-time stamps, user ID and other technical information. Cleaning the data by removing intrusive elements such as empty or corrupted questions, questions in Chinese asked in an English forum or embedded as text in a picture. Are more low-value questions asked by first-time users? Maybe we want to construct a separate model for first-time and veteran user questions. Are most questions plain text or do they use embedded code which should be parsed and analyzed separately? Do we have enough questions with up/down votes to construct a metric, or are there many old questions predating the up/down vote system which can’t be used and should be removed?

#### Recognizing which features provide statistically significant contribution to the problem at hand, and removing those that don’t: Is our users’ geolocation relevant to question-quality? Is the vast majority from the same region?

Extracting useful features for learning and constructing new ones when needed: parsing date-time stamps, text vectorization, one-hot encoding keyword tags, etc.

If we’re splitting the data into training and test sets, how is the split made? By time? By user ID?#### The heart of the learning process: selecting and building the model. We need to choose an algorithm: be it a simple logistic regression or XGBoost or an ensemble of neural networks, it must be chosen based on the resources available and the peculiarities of the problem. Then we train it, making sure we avoid overfitting, and tune hyperparameters to maximize its performance

#### Finally, our model is ready, it marks all low-value questions with astounding precision and leaves all relevant questions intact. But will it help to boost consumer's metric? To check it we should probably design and conduct an A/B-test and to determine statistical significance of the results. If A/B-testing is impossible or undesired, we may probably use some sort of causal impact inference.

#### The customer desires to improve the system efficiency as measured by the mean time between a question being posted and the first accepted (upvoted) answer given. How do we translate this request into ML terms? Should this be treated as a classification (good question or not) or regression (predicted number of up/down votes) problem? Which metric should be used? Accuracy, precision, recall, AUC etc. – which is most relevant to the situation?

Is the problem symmetrical? We’re more concerned with losing a valuable question then missing several low-value ones.

What are our resource constraints? We probably need a solution that runs immediately once a question is posted, so can’t use resource-heavy algorithms#### Downloading the questions and additional information, followed by a decision whether to use date-time stamps, user ID and other technical information. Cleaning the data by removing intrusive elements such as empty or corrupted questions, questions in Chinese asked in an English forum or embedded as text in a picture. Are more low-value questions asked by first-time users? Maybe we want to construct a separate model for first-time and veteran user questions. Are most questions plain text or do they use embedded code which should be parsed and analyzed separately? Do we have enough questions with up/down votes to construct a metric, or are there many old questions predating the up/down vote system which can’t be used and should be removed?

#### Recognizing which features provide statistically significant contribution to the problem at hand, and removing those that don’t: Is our users’ geolocation relevant to question-quality? Is the vast majority from the same region?

Extracting useful features for learning and constructing new ones when needed: parsing date-time stamps, text vectorization, one-hot encoding keyword tags, etc.

If we’re splitting the data into training and test sets, how is the split made? By time? By user ID?#### The heart of the learning process: selecting and building the model. We need to choose an algorithm: be it a simple logistic regression or XGBoost or an ensemble of neural networks, it must be chosen based on the resources available and the peculiarities of the problem. Then we train it, making sure we avoid overfitting, and tune hyperparameters to maximize its performance

#### Finally, our model is ready, it marks all low-value questions with astounding precision and leaves all relevant questions intact. But will it help to boost consumer's metric? To check it we should probably design and conduct an A/B-test and to determine statistical significance of the results. If A/B-testing is impossible or undesired, we may probably use some sort of causal impact inference.

# Coursework

Math for Machine Learning

Fall semester

2 hours

4 weeks

Fall semester

2 hours

4 weeks

Introduction to the main theoretical concepts and mathematical tools used in machine learning. The course focuses on linear algebra including basic concepts such as vectors and matrices as well as more advanced subjects such as decomposition techniques relevant for ML applications. Additional subjects covered include graph theory, multivariate calculus, optimization theory and signal processing.

Probability Theory and

Statistics for Data Science

Fall semester

3 hours

6 weeks

Statistics for Data Science

Fall semester

3 hours

6 weeks

This introductory course teaches the basics of probability theory and statistics. It aims to develop a good intuition of random events and variables, common distributions and their properties, estimators and statistical tests. The emphasis is made on the tools widely applied in data science, such as maximum likelihood estimation and Bayesian inference.

Python for Data Processing

Fall semester

3 hours

6 weeks

Fall semester

3 hours

6 weeks

A highly practical course aimed to provide with the fundamental Python toolkit for a data scientist, including how to setup and configure a working environment for a machine learning project and perform exploratory data analysis (including data cleaning, variable analysis, visualizations and feature construction) both locally and in the cloud.

Introduction to

Machine Learning

Fall semester

2 hours

2 weeks

Machine Learning

Fall semester

2 hours

2 weeks

This course provides an outline and basic understanding of the key concepts and skills required from a data scientist. It offers an overview of the field today, its most common techniques and applications, and provides a basis for more advanced ML topics. It covers key ML concepts (models, labels, features, train and test sets, performance measures, validation, under- and over-fitting etc.)

Supervised Learning

Fall semester

3 hours

6 weeks

Fall semester

3 hours

6 weeks

The course introduces students to the most common machine learning tasks and tools, providing their first hands-on experience constructing and evaluating models using existing pre-labelled input/output data pairs to create accurate predictions on new datapoints. It covers topics ranging from basic classification and regression models, their strength and weaknesses, to complex modern applications including ensemble models and gradient boosting, as well as important general concepts such as performance measures and evaluation metrics.

Deep Learning

Fall semester

4 hours

7 weeks

Fall semester

4 hours

7 weeks

This course introduces students to one of the most popular and fast-growing fields of machine learning – deep learning, and gives students understanding of the underlying principles of modern neural networks, their construction and applications (including NLP and computer vision). It covers common network architectures including convolutional and recurrent networks, backpropagation, sequence modelling, representation learning, autoencoders, information bottleneck theory and more. The course will grant students familiarity with PyTorch and TensorFlow.

Big Data

Spring Semester

3 hours

6 weeks

Spring Semester

3 hours

6 weeks

Students will learn the techniques required to handle massive datasets which necessitate parallel processing by a cluster. The course covers the skills required to parallelize the learning process effectively and handle massive amounts of data: general understanding of the common solutions for distributed data storage, their strengths and limitations, how to manipulate distributed data, and how to scale machine learning algorithms both on a single multi-core processor and on a cluster.

Unsupervised Learning

Spring Semester

3 hours

6 weeks

Spring Semester

3 hours

6 weeks

This advanced course aims to provide its students with a highly valuable skill with multiple real-world applications: Unsupervised Machine Learning teaches how to derive insights and construct models that do not rely on the availability of pre-labeled data. The course covers techniques including pattern recognition, clustering, dimensionality reduction, matrix factorization, anomaly detection, and Genetic Algorithms. The course will be accompanied by examples of cutting-edge applications in various business-oriented high-tech applications.

Reinforcement Learing

Spring Semester

3 hours

6 weeks

Spring Semester

3 hours

6 weeks

The course introduces essential Reinforcement Learning techniques and concepts, including Q-Learning, policy gradient methods, partially observable MDPs and exploration strategies. The skills acquired can be implemented on many real-world problems requiring an agent to learn to survive in an unknown environment by the use of trial and error techniques (e.g. video games AI, robotics, dialog systems).