Tag Archives: Machine Learning

40 Interview Questions asked at Startups in Machine Learning / Data Science

二月 11, 2018BigData, ML&DLBigdata, Machine Learningdotte

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists. What could be a better start for your aspiring career!

However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they provide consulting? Do they build ML products ? You should always find this out prior to beginning your interview preparation.

To help you prepare for your next interview, I’ve prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts.

40 interview questions asked at startups in machine learning

Interview Questions on Machine Learning

Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.
We can randomly sample the data set. This means, we can create a smaller data set, let’s say, having 1000 variables and 300000 rows and do the computations.
To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use chi-square test.
Also, we can use PCA and pick the components which can explain the maximum variance in the data set.
Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.
Building a linear model using Stochastic Gradient Descent is also helpful.
We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic Gradient Descent. These are advanced methods.

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

Know more: PCA

Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

Answer: If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps:

We can use undersampling, oversampling or SMOTE to make the data balanced.
We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve.
We can assign weight to classes such that the minority classes gets larger weight.
We can also use anomaly detection.

Know more: Imbalanced Classification

Q5. Why is naive Bayes so ‘naive’ ?

Answer: naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.

Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam.

Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word ‘FREE’ is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word ‘FREE’ is used in any message.

Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non – linear interactions. The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

Answer: You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals.

This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things:

There exist a pattern.
You cannot solve it mathematically (even by writing exponential equations).
You have data on it.

Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

Answer: Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results.

In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

Also, to combat high variance, we can:

Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity.
Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.

Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions.

Q12. How is kNN different from kmeans clustering?

Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.

Q13. How is True Positive Rate and Recall related? Write the equation.

Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Know more: Evaluation Metrics

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

Answer: Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 – ∑(y – y´)²/∑(y – ymean)² where y´ is predicted value.

When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².

Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.

But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

Know more: Regression

Q16. When is Ridge regression favorable over Lasso regression?

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

Know more: Ridge and Lasso Regression

Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

Answer: After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

Know more: Causation and Correlation

Q18. While working on a data set, how do you select important variables? Explain your methods.

Answer: Following are the methods of variable selection you can use:

Remove the correlated variables prior to selecting important variables
Use linear regression and select variables based on p values
Use Forward Selection, Backward Selection, Stepwise Selection
Use Random Forest, Xgboost and plot variable importance chart
Use Lasso Regression
Measure information gain for the available set of features and select top n features accordingly.

Q19. What is the difference between covariance and correlation?

Answer: Correlation is the standardized form of covariance.

Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?

Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

Answer: The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.

Know more: Tree based modeling

Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?

Answer: A classification trees makes decision based on Gini Index and Node Entropy. In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes.

Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. We can calculate Gini as following:

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).
Calculate Gini for split using weighted Gini score of each node of that split

Entropy is the measure of impurity as given by (for binary class):

Here p and q is probability of success and failure respectively in that node. Entropy is zero when a node is homogeneous. It is maximum when a both the classes are present in a node at 50% – 50%. Lower entropy is desirable.

Q23. You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?

Answer: The model has overfitted. Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data. Hence, when this classifier was run on unseen sample, it couldn’t find those patterns and returned prediction with higher error. In random forest, it happens when we use larger number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation.

Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?

Answer: In such high dimensional data sets, we can’t use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all.

To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance. Precisely, ridge regression works best in situations where the least square estimates have higher variance.

Among other methods include subset regression, forward stepwise regression.

Q25. What is convex hull ? (Hint: Think SVM)

Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.

Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

Answer: Don’t get baffled at this question. It’s a simple question asking the difference between the two.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

Answer: Neither.

In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]

where 1,2,3,4,5,6 represents “year”.

Q28. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

Answer: We can deal with them in the following ways:

Assign a unique category to missing values, who knows the missing values might decipher some trend
We can remove them blatantly.
Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.

29. ‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

Answer: The basic idea for this kind of recommendation engine comes from collaborative filtering.

Collaborative Filtering algorithm considers “User Behavior” for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.

Know more: Recommender System

Q30. What do you understand by Type I vs Type II error ?

Answer: Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).

Q31. You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

Answer: In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

Answer: We don’t use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.

Q34. Explain machine learning to me like a 5 year old.

Answer: It’s simple. It’s just like how babies learn to walk. Every time they fall down, they learn (unconsciously) & realize that their legs should be straight and not in a bend position. The next time they fall down, they feel pain. They cry. But, they learn ‘not to stand like that again’. In order to avoid that pain, they try harder. To succeed, they even seek support from the door or wall or anything near them, which helps them stand firm.

This is how a machine works & develops intuition from its environment.

Note: The interview is only trying to test if have the ability of explain complex concepts in simple terms.

Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

Answer: We can use the following methods:

Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance.
Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.
Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

Know more: Logistic Regression

Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

Answer: You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model.

If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we’ll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.

In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.

Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

Answer: For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.

Q38. When does regularization becomes necessary in Machine Learning?

Answer: Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).

Q39. What do you understand by Bias Variance trade off?

Answer: The error emerging from any model can be broken down into three components mathematically. Following are these component :

Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.

Answer: OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words,

Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

End Notes

You might have been able to answer all the questions, but the real value is in understanding them and generalizing your knowledge on similar questions. If you have struggled at these questions, no worries, now is the time to learn and not perform. You should right now focus on learning these topics scrupulously.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning. I’m sure these questions would leave you curious enough to do deeper topic research at your end. If you are planning for it, that’s a good sign.

Did you like reading this article? Have you appeared in any startup interview recently for data scientist profile? Do share your experience in comments below. I’d love to know your experience.

from:https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/

2017年度盘点：15个最流行的GitHub机器学习项目

十二月 26, 2017ML&DLMachine Learningdotte

在本文中，作者列出了 2017 年 GitHub 平台上最为热门的知识库，囊括了数据科学、机器学习、深度学习中的各种项目，希望能对大家学习、使用有所帮助。另，小编恬不知耻地把机器之心的 Github 项目也加了进来，求 star，求 pull requests。

GitHub 是计算机科学领域最为活跃的社区，在 GitHub 上，来自不同背景的人们分享越来越多的软件工具和资源库。在其中，你不仅可以获取自己所需的工具，还可以观看代码是如何写成并实现的。

作为一名机器学习爱好者，作者在本文中列出了 2017 年 GitHub 平台上最为热门的知识库，其中包含了学习资料与工具。希望对你的学习和研究有所帮助。

1. 学习资源

1.1 Awesome Data Science

项目地址： https://github.com/bulutyazilim/awesome-datascience

该 repo 是数据科学的基本资源。多年来的无数贡献构建了此 repo 里面的各种资源，从入门指导、信息图，到社交网络上你需要 follow 的账号。无论你是初学者还是业内老兵，里面都有大量的资源需要学习。

从该 repo 的目录可以看出其深度。

1.2 Machine Learning / Deep Learning Cheat Sheet

项目地址：https://github.com/kailashahirwar/cheatsheets-ai

该项目以 cheatsheet 的形式介绍了机器学习/深度学习中常用的工具与技术，从 pandas 这样的简单工具到深度学习技术都涵盖其中。在收藏或者 fork 该项目之后，你就不用再费事搜索常用的技巧和注意事项了。

简单介绍下，cheatsheets 类型包括 pandas、numpy、scikit learn、matplotlib、ggplot、dplyr、tidyr、pySpark 和神经网络。

1.3 Oxford Deep Natural Language Processing Course Lectures

项目地址：https://github.com/oxford-cs-deepnlp-2017/lectures

斯坦福的 NLP 课程一直是自然语言处理领域的金牌教程。但是近期随着深度学习的发展，在 RNN 和 LSTM 等深度学习架构的帮助下，NLP 出现了大量进步。

该 repo 基于牛津大学的 NLP 课程，涵盖先进技术和术语，如使用 RNN 进行语言建模、语音识别、文本转语音（TTS）等。该 repo 包含该课程从课程材料到实践联系的所有内容。

1.4 PyTorch – Tutorial

项目地址：https://github.com/yunjey/pytorch-tutorial

截至今天，PyTorch 仍是 TensorFlow 的唯一竞争对手，它的功能和声誉使其成为了颇具竞争力的深度学习框架。因其 Pythonic 风格的编程、动态计算图和更快的原型开发，Pytorch 已经获得了深度学习社区的广泛关注。

该知识库包含 PyTorch 上大量的深度学习任务代码，包括 RNN、GAN 和神经风格迁移。其中的大多数模型在实现上仅需 30 余行代码。这充分说明了 PyTorch 的抽象能力，它让研究人员可以专注于找到正确的模型，而无需纠缠于编程语言和工具选择等细节。

1.5 Resources of NIPS 2017

项目地址：https://github.com/hindupuravinash/nips2017

该 repo 包含 NIPS 2017 的资源和所有受邀演讲、教程和研讨会的幻灯片。NIPS 是一年一度的机器学习和计算神经科学会议。

过去几年中，数据科学领域内的大部分突破性研究都曾作为研究结果出现在 NIPS 大会上。如果你想站在领域前沿，那这就是很好的资源！

2. 开源软件库

2.1 TensorFlow

项目地址：https://github.com/tensorflow/tensorflow

TensorFlow 是一种采用数据流图（data flow graph）进行数值计算的开源软件库。其中 Tensor 代表传递的数据为张量（多维数组），Flow 代表使用计算图进行运算。数据流图用「结点」（node）和「边」（edge）组成的有向图来描述数学运算。「结点」一般用来表示施加的数学操作，但也可以表示数据输入的起点和输出的终点，或者是读取/写入持久变量（persistent variable）的终点。边表示结点之间的输入/输出关系。这些数据边可以传送维度可动态调整的多维数据数组，即张量（tensor）。

TensorFlow 自正式发布以来，一直保持着「深度学习/机器学习」顶尖库的位置。谷歌大脑团队和机器学习社区也一直在积极地贡献并保持最新的进展，尤其是在深度学习领域。

TensorFlow 最初是使用数据流图进行数值计算的开源软件库，但从目前来看，它已经成为构建深度学习模型的完整框架。它目前主要支持 TensorFlow，但也支持 C、C++ 和 Java 等语言。此外，今年 11 月谷歌终于发布了新工具的开发者预览版本，这是一款 TensorFlow 用于移动设备和嵌入式设备的轻量级解决方案。

2.2 TuriCreate：一个简化的机器学习库

项目地址：https://github.com/apple/turicreate

TuriCreate 是苹果最近贡献的一个开源项目，它为机器学习模型提供易于使用的创建方法和部署方法，这些机器学习模型包括目标检测、人体姿势识别和推荐系统等复杂任务。

可能我们作为机器学习爱好者会比较熟悉 GraphLab Create，一个非常简便高效的机器学习库，而当初创建该库的公司 TuriCreate 被苹果收购时，造成了很大反响。

TuriCreate 是针对 Python 开发的，且它最强的的特征是将机器学习模型部署到 Core ML 中，用于开发 iOS、macOS、watchOS 和 tvOS 等应用程序。

2.3 OpenPose

项目地址： https://github.com/CMU-Perceptual-Computing-Lab/openpose

OpenPose 是一个多人关键点检测库，它可以帮助我们实时地检测图像或视频中某个人的位置。

OpenPose 软件库由 CMU 的感知计算实验室开发并维护，对于说明开源研究如何快速应用于部署到工业中，它是非常好的一个案例。

OpenPose 的一个使用案例是帮助解决活动检测问题，即演员完成的动作或活动能被实时捕捉到。然后这些关键点和它们的动作可用来制作动画片。OpenPose 不仅有 C++的 API 以使开发者能快速地访问它，同时它还有简单的命令行界面用来处理图像或视频。

2.4 DeepSpeech

项目地址： https://github.com/mozilla/DeepSpeech

DeepSpeech 是百度开发的开源实现库，它提供了当前顶尖的语音转文本合成技术。它基于 TensorFlow 和 Python，但也可以绑定到 NodeJS 或使用命令行运行。

Mozilla 一直是构建 DeepSpeech 和开源软件库的主要研究力量，Mozilla 技术战略副总裁 Sean White 在一篇博文中写道：「目前只有少数商用质量的语音识别引擎是开源的，它们大多数由大型公司主宰。这样就减少了初创公司、研究人员和传统企业为它们的用户定制特定的产品与服务。但我们与机器学习社区的众多开发者和研究者共同完善了该开源库，因此目前 DeepSpeech 已经使用了复杂和前沿的机器学习技术创建语音到文本的引擎。」

2.5 Mobile Deep Learning

项目地址：https://github.com/baidu/mobile-deep-learning

该 repo 将数据科学中的当前最佳技术移植到了移动平台上。该 repo 由百度研究院开发，目的是将深度学习模型以低复杂性和高速度部署到移动设备（例如 Android 和 IOS）上。

该 repo 解释了一个简单的用例，即目标检测。它可以识别目标（例如一张图像中的手机）的准确位置，很棒不是吗？

2.6 Visdom

项目地址：https://github.com/facebookresearch/visdom

Visdom 支持图表、图像和文本在协作者之间进行传播。你可以用编程的方式组织可视化空间，或者通过 UI 为实时数据创建仪表盘，检查实验结果，或者调试实验代码。

绘图函数中的输入会发生改变，尽管大部分输入是数据的张量 X（而非数据本身）和（可选）张量 Y（包含可选数据变量，如标签或时间戳）。它支持所有基本图表类型，以创建 Plotly 支持的可视化。

Visdom 支持使用 PyTorch 和 Numpy。

2.7 Deep Photo Style Transfer

项目地址：https://github.com/luanfujun/deep-photo-styletransfer

该 repo 基于近期论文《Deep Photo Style Transfer》，该论文介绍了一种用于摄影风格迁移的深度学习方法，可处理大量图像内容，同时有效迁移参考风格。该方法成功克服了失真，满足了大量场景中的摄影风格迁移需求，包括时间、天气、季节、艺术编辑等场景。

2.8 CycleGAN

项目地址：https://github.com/junyanz/CycleGAN

CycleGAN 是一个有趣且强大的库，展现了该顶尖技术的潜力。举例来说，下图大致展示了该库的能力：调整图像景深。这里有趣的点在于你事先并没有告诉算法需要注意图像的哪一部分。算法完全依靠自己做到了！

目前该库用 Lua 编写，但是它也可以在命令行中使用。

2.9 Seq2seq

项目地址：https://github.com/google/seq2seq

Seq2seq 最初是为机器翻译而建立的，但已经被开发用于多种其它任务，包括摘要生成、对话建模和图像捕捉。只要一个问题的结构是将输入数据编码为一种格式，并将其解码为另一种格式，就可以使用 Seq2seq 框架。它使用了所有流行的基于 Python 的 TensorFlow 库进行编程。

2.10 Pix2code

项目地址：https://github.com/tonybeltramelli/pix2code

这个深度学习项目非常令人振奋，它尝试为给定的 GUI 自动生成代码。当建立网站或移动设备界面的时候，通常前端工程师必须编写大量枯燥的代码，这很费时和低效。这阻碍了开发者将主要的时间用于实现真正的功能和软件逻辑。Pix2code 的目的是通过将过程自动化来克服这一困难。它基于一种新方法，允许以单个 GUI 截图作为输入来生成计算机 token。

Pix2code 使用 Python 编写，可将移动设备和网站界面的捕捉图像转换成代码。

3. 机器之心项目

机器之心目前在 GitHub 上也有三个项目，分别是评估人工智能各领域优秀公司的 AI00、人工智能领域中英术语集和模型试验与解释项目。

3.1 AI00——机器之心百家影响人工智能未来的公司榜单

项目地址：https://github.com/jiqizhixin/AI00

人工智能是一个复杂庞大的体系，涉及众多学科，也关乎技术、产品、行业和资本等众多要素，本报告的写作团队只代表他们的专业观点，有自己的局限性，需要更多行业专家参与进来加以修正和完善。

我们深刻地理解在没有专业用户反馈的情况下所做出报告的质量局限性，所以希望用工程界「Agile Development」的理念来对待我们的报告，不断收集专业反馈来持续提升报告质量。

为此，我们将邀请人工智能领域的科学家、技术专家、产业专家、专业投资人和读者加入进来，共同完成这项人工智能的长期研究。我们将对参与者提供的信息进行汇总和整理，以月度为单位更新此份报告。

3.2 Artificial-Intelligence-Terminology

项目地址：https://github.com/jiqizhixin/Artificial-Intelligence-Terminology

我们将机器之心在编译技术文章和论文过程中所遇到的专业术语记录下来，希望有助于大家查阅和翻译（第二版）。

本词汇库目前拥有的专业词汇共计 760 个，主要为机器学习基础概念和术语，同时也是该项目的基本词汇。机器之心将继续完善术语的收录和扩展阅读的构建。

词汇更新主要分为两个阶段，第一阶段机器之心将继续完善基础词汇的构建，即通过权威教科书或其它有公信力的资料抽取常见术语。第二阶段机器之心将持续性地把编译论文或其他资料所出现的非常见术语更新到词汇表中。

读者的反馈意见和更新建议将贯穿整个阶段，并且我们将在项目致谢页中展示对该项目起积极作用的读者。因为我们希望术语的更新更具准确度和置信度，所以我们希望读者能附上该术语的来源地址与扩展地址。因此，我们能更客观地更新词汇，并附上可信的来源与扩展。

3.3 ML-Tutorial-Experiment

项目地址：https://github.com/jiqizhixin/ML-Tutorial-Experiment

该项目主要是展示我们在实验机器学习模型中所获得的经验与解释，目前我们解释并实现了卷积神经网络、生成对抗网络和 CapsNet。这些实现都有非常详细的文章以说明模型的结构与实现代码。如下所示为这三个实现项目的说明：

原文链接：https://www.analyticsvidhya.com/blog/2017/12/15-data-science-repositories-github-2017/

声明：本文由机器之心编译出品，原文来自Analytics Vidhya，转载请查看要求，机器之心对于违规侵权者保有法律追诉权。

from:https://www.jiqizhixin.com/articles/2017-12-21-10

机器学习驱动编程：新世界的新编程

九月 13, 2017ML&DLMachine Learning, Programmingdotte

如果今天要重新建立Google，这家公司的大部分系统将会是通过学习得来的，而非编写代码而来。Google的25,000名开发者中原本有大概10%精通机器学习，现在这一比例可能会达到100% — Jeff Dean

和天气一样，大家都会抱怨编程，但谁也没有为此做点什么。这种情况正在发生变化，就像突如其来的暴风雨一样，这些变化也来自一个出乎意料的方向：机器学习/深度学习。

我知道很多人对深度学习都听厌了。谁不是呢？但是编程技术很长时间来都处在一成不变的情况下，是时候做点什么改变这一情况了。

各种有关编程的“战斗”还在继续，但解决不了任何问题。函数还是对象，这种语言还是那种语言，这朵公有云还是那朵公有云或者这朵私有云亦或那朵“填补了空白”的云，REST还是Unrest，这种字节级编码还是另一种编码，这个框架还是那个框架，这种方法论还是那种方法论，裸机还是容器或者虚拟机亦或Unikernel，单层还是微服务或者NanoService，最终一致还是事务型，可变（Mutable）还是不可变（Immutable），DevOps还是NoOps或者SysOps，横向扩展还是纵向扩展，中心化还是去中心化，单线程还是大规模并行，同步还是异步。诸如此类永无止境。

年复一年，天天如此。我们只是创造了调用函数的不同方法，但最终的代码还需要我们人类来编写。更强大的方法应该是让机器编写我们需要的函数，这就是机器学习大展拳脚的领域了，可以为我们编写函数。这种年复一年天天如此的无聊活动还是交给机器学习吧。

机器学习驱动的编程

我最初是和Jeff Dean聊过之后开始接触用深度学习方式编程的。当时我针对聊天内容写了篇文章：《Jeff Dean针对Google大规模深度学习的看法》。强烈建议阅读本文了解深度学习技术。围绕本文的主题，本次讨论的重点在于深度学习技术如何有效地取代人工编写的程序代码：

Google的经验是对于大量子系统，其中一些甚至是通过机器学习方式获得的，使用更为通用的端到端机器学习系统取代它们。通常当你有大量复杂的子系统时，必须通过大量复杂代码将它们连接在一起。Google的目标是让大家使用数据和非常简单的算法取代这一切。

机器学习将成为软件工程的敏捷工具

Google研究总监Peter Norvig针对这个话题进行过详细的讨论：用深度学习与可理解性对抗软件工程和验证。他的大致想法如下：

软件。你可以将软件看作为函数构建的规范，而实现相应的函数就可以满足规范的要求。
机器学习。以（x，y）对为例，你会猜测某一函数会使用x作为参数并生成y作为结果，这样就能很好地概括出x这个新值。
深度学习。依然是以（x，y）对为例，可以通过学习知道你所组建的表征（Representation）具备不同级别的抽象，而非简单直接的输入和输出。这样也可以很好地概括出x这个新值。

软件工程师并不需要介入这个循环，只需要将数据流入机器学习组件并流出所需的表征。

表征看起来并不像代码，而是类似这样：

这种全新类型的程序并非人类可以理解的函数分解（Functional decomposition），看起来更像是一堆参数。

在机器学习驱动的编程（MLDP）世界里，依然可以有人类的介入，不过这些人再也不叫“程序员”了，他们更像是数据科学家。

可以通过范例习得程序的部分内容吗？可以。

有一个包含超过 2000 行代码的开源拼写检查软件的例子，这个软件的效果始终不怎么好。但只要通过包含 17 行代码的朴素贝叶斯分类器就能以同样性能实现这个软件的功能，但代码量减少了100倍，效果也变得更好。
AlphaGo是另一个例子，该程序的结构是手工编写的，所有参数则是通过学习得到的。

只通过范例可以习得完整的程序吗？简单的程序可以，但大型传统程序目前还不行。

在Jeff Dean的讲话中，他介绍了一些有关使用神经网络进行序列学习的顺序的细节。借助这种方法可以从零开始构建最尖端的机器翻译程序，这将是一种端到端学习获得的完整系统，无需手工编写大量代码或机器学习的模型即可解决一系列细节问题。
另一个例子是学习如何玩Atari出品的游戏，对于大部分游戏机器可以玩得和人一样好，甚至比人玩得更好。如果按照传统方式对不同组件进行长期的规划，效果肯定不会像现在这么好。
神经图灵机（Neural Turing Machine）也是一次学习如何用程序编写更复杂程序的尝试，但Peter认为这种技术不会走的太远。

正如你所期待的，Norvig先生的这次讲话非常棒，很有远见，值得大家一看。

Google正在以开源代码的方式从GitHub收集数据

深度学习需要数据，如果你想要创建一个能通过范例学习如何编程的AI，肯定需要准备大量程序/范例作为学习素材。

目前Google和GitHub已经开始合作让开源数据更可用。他们已经从GitHub Archive将超过3TB的数据集上传至BigQuery，其中包含超过280万个开源GitHub代码库中包含的活动数据，以及超过1.45亿次提交，和超过20亿不同的文件路径。

这些都可以当作很棒的训练数据。

就真的完美无缺吗？

当然不是。神经网络如同“黑匣子”般的本质使其难以与其他技术配合使用，这就像试图将人类潜意识行为与意识行为背后的原因结合在一起一样。

Jeff Dean还提到了Google搜索评级团队在搜索评级研究工作中运用神经网络技术时的犹豫。对于搜索评级，他们想要了解整个模型，了解做出某一决策的原因。当系统出现错误后他们还想了解为什么会出现这样的错误。

为了解决这样的问题，需要创建配套的调试工具，而相关工具必须具备足够的可理解性。

这样的做法会有效的。针对搜索结果提供的机器学习技术RankBrain发布于2015年，现在已成为第三大最重要的搜索评级指标（指标共有100项）。详情可访问：Google将利润丰厚的网页搜索技术交给AI计算机处理。

Peter还通过一篇论文《机器学习：技术债的高利率信用卡》进一步介绍了相关问题的更多细节。缺乏明确的抽象层，就算有Bug你也不知道到底在哪。更改任何细节都要改变全局，很难进行纠正，纠正后的结果更是难以预测。反馈环路，机器学习系统生成数据后会将这些数据重新送入系统，导致反馈环路。诱惑性损害（Attractive Nuisance），一旦一个系统使得所有人都想使用，这样的系统很可能根本无法在不同上下文中使用。非稳态（Non-stationarity），随着时间流逝数据会产生变化，因此需要指出到底需要使用哪些数据，但这个问题根本没有明确答案。配置依赖性，数据来自哪里？数据准确吗？其他系统中出现的改动是否会导致这些数据产生变化，以至于今天看到的结果和昨天不一样？训练用数据和生产用数据是否不同？系统加载时是否丢弃了某些数据？缺乏工具，标准化软件开发有很多优秀的工具可以使用，但机器学习编程是全新的，暂时还没有工具可用。

MLDP是未来吗？

这并不是一篇“天呐，世界末日到了”这样的文章，而是一篇类似于“有些事你可能没听说过，世界又将为此产生变化，这多酷啊”的文章。

这种技术依然处于非常早期的阶段。对程序的部分内容进行学习目前已经是可行的，而对大规模复杂程序进行学习目前还不实用。但是当我们看到标题类似于Google如何将自己重塑为一家“机器学习为先”的公司这样的文章，可能还不明白这到底真正意味着什么。这种技术不仅仅是为了开发类似AlphaGo这样的系统，最终结果远比这个还要深刻。

查看英文原文：Machine Learning Driven Programming: A New Programming for a New World

from:http://www.infoq.com/cn/articles/machine-learning-programing

机器学习算法 Python&R 速查表

十二月 27, 2016ML&DL, Python, RMachine Learning, python, Rdotte

原文出处: Cheatsheet – Python & R codes for common Machine Learning Algorithms
在拿破仑•希尔的名著《思考与致富》中讲述了达比的故事：达比经过几年的时间快要挖掘到了金矿，却在离它三英尺的地方离开了！

现在,我不知道这个故事是否真实。但是,我肯定在我的周围有一些跟达比一样的人，这些人认为，不管遇到什么问题, 机器学习的目的就是执行以及使用2 – 3组算法。他们不去尝试更好的算法和技术，因为他们觉得太困难或耗费时间。

像达比一样,他们无疑是在到达最后一步的时候突然消失了!最后,他们放弃机器学习,说计算量非常大、非常困难或者认为自己的模型已经到达优化的临界点——真的是这样吗?

下面这些速查表能让这些“达比”成为机器学习的支持者。这是10个最常用的机器学习算法，这些算法使用了Python和R代码。考虑到机器学习在构建模型中的应用，这些速查表可以很好作为编码指南帮助你学好这些机器学习算法。Good Luck!

PDF版本

from:http://colobu.com/2015/11/05/full-cheatsheet-machine-learning-algorithms

python机器学习深度学习总结

十一月 25, 2016BigData, ML&DL, PythonBigdata, DeepLearning, Machine Learning, pythondotte

1、Python环境搭建（Windows）

开发工具：PyCharm Community Edition（free）

Python环境：WinPython 3.5.2.3Qt5
–此环境集成了机器学习和深度学习用到的主要包：
numpy,scipy,matplotlib,pandas,scikit-learn,theano,keras

IPython notebook :

2、示例代码：

scikit-learn sample

keras sample

3、数据集Datasets

GeoHey公共数据

4、kaggle平台

Kaggle是一个数据建模和数据分析竞赛平台。企业和研究者可在其上发布数据，统计学者和数据挖掘专家可在其上进行竞赛以产生最好的模型。这一众包模式依赖于这一事实，即有众多策略可以用于解决几乎所有预测建模的问题，而研究者不可能在一开始就了解什么方法对于特定问题是最为有效的。Kaggle的目标则是试图通过众包的形式来解决这一难题，进而使数据科学成为一场运动。(wiki)

5、常见问题处理

Approaching (Almost) Any Machine Learning Problem

Dotte博客

大数据、云计算、架构、语言的本质、计算的未来