All posts by dotte

uniflow

Hi Mingli,
Thank you for the request.
To extend the REST service is possible but I do have a couple of questions.
In all the new calls you have stated {id}
and
{pwd}
how will these be linked to uniFLOW identities?
REST Call: CheckUser- No issues just need to know which uniFLOW identities its should checkREST Call: RegisterUser – From what I can tell this call is to be used to update or assign the openId to an existing user again just need to know what identity type will hold this valueREST Call: BindWeChatUser – Not sure the purpose of this call please could you explain what should happen
In the PCL you talk about creating user in uniFLOW but I don’t see a call that would request this or should the creation of the account be tied to one of the other calls?

When creating an account should this be static or should it link to AD or other user data management source?If static what additional information / configuration should be applied to new accounts ie Groups / CC / Budgets / email etc….Should this information come from the 3rd party app or configured via uniFLOW?

程序员简历模板列表

说明

Interviews

 

Interviews

软件工程技术面试个人指南。

Maintainer – Kevin Naughton Jr.

其他语言版本

目录

在线练习

在线面试编程

数据结构

Linked List

  • 链表即是由节点(Node)组成的线性集合,每个节点可以利用指针指向其他节点。它是一种包含了多个节点的、能够用于表示序列的数据结构。
  • 单向链表: 链表中的节点仅指向下一个节点,并且最后一个节点指向空。
  • 双向链表: 其中每个节点具有两个指针 p、n,使得 p 指向先前节点并且 n 指向下一个节点;最后一个节点的 n 指针指向 null。
  • 循环链表:每个节点指向下一个节点并且最后一个节点指向第一个节点的链表。
  • 时间复杂度:
    • 索引: O(n)
    • 搜索: O(n)
    • 插入: O(1)
    • 移除: O(1)

Stack

  • 栈是元素的集合,其包含了两个基本操作:push 操作可以用于将元素压入栈,pop 操作可以将栈顶元素移除。
  • 遵循后入先出(LIFO)原则。
  • 时间复杂度:
  • 索引: O(n)
  • 搜索: O(n)
  • 插入: O(1)
  • 移除: O(1)

Queue

  • 队列是元素的集合,其包含了两个基本操作:enqueue 操作可以用于将元素插入到队列中,而 dequeue 操作则是将元素从队列中移除。
  • 遵循先入先出原则 (FIFO)。
  • 时间复杂度:
  • 索引: O(n)
  • 搜索: O(n)
  • 插入: O(1)
  • 移除: O(1)

Tree

  • 树是无向、连通的无环图。

Binary Tree

  • 二叉树即是每个节点最多包含左子节点与右子节点这两个节点的树形数据结构。
  • 满二叉树: 树中的每个节点仅包含 0 或 2 个节点。
  • 完美二叉树(Perfect Binary Tree): 二叉树中的每个叶节点都拥有两个子节点,并且具有相同的高度。
  • 完全二叉树: 除最后一层外,每一层上的结点数均达到最大值;在最后一层上只缺少右边的若干结点。

Binary Search Tree

  • 二叉搜索树(BST)是一种特殊的二叉树,其任何节点中的值都会大于或者等于其左子树中存储的值并且小于或者等于其右子树中存储的值。
  • 时间复杂度:
    • 索引: O(log(n))
    • 搜索: O(log(n))
    • 插入: O(log(n))
    • 删除: O(log(n))

Binary Search Tree

Trie

  • 字典树,又称基数树或者前缀树,能够用于存储键为字符串的动态集合或者关联数组的搜索树。树中的节点并没有直接存储关联键值,而是该节点在树中的挂载位置决定了其关联键值。某个节点的所有子节点都拥有相同的前缀,整棵树的根节点则是空字符串。

Alt text

Fenwick Tree

  • 树状数组又称 Binary Indexed Tree,其表现形式为树,不过本质上是以数组实现。数组中的下标代表着树中的顶点,每个顶点的父节点或者子节点的下标能够通过位运算获得。数组中的每个元素包含了预计算的区间值之和,在整棵树更新的过程中同样会更新这些预计算的值。
  • 时间复杂度:
    • 区间求值: O(log(n))
    • 更新: O(log(n))

Alt text

Segment Tree

  • 线段树是用于存放间隔或者线段的树形数据结构,它允许快速的查找某一个节点在若干条线段中出现的次数.
  • 时间复杂度:
    • 区间查询: O(log(n))
    • 更新: O(log(n))

Alt text

Heap

  • 堆是一种特殊的基于树的满足某些特性的数据结构,整个堆中的所有父子节点的键值都会满足相同的排序条件。堆更准确地可以分为最大堆与最小堆,在最大堆中,父节点的键值永远大于或者等于子节点的值,并且整个堆中的最大值存储于根节点;而最小堆中,父节点的键值永远小于或者等于其子节点的键值,并且整个堆中的最小值存储于根节点。
  • 时间复杂度:
    • 访问最大值 / 最小值: O(1)
    • 插入: O(log(n))
    • 移除最大值 / 最小值: O(log(n))

Max Heap

Hashing

  • 哈希能够将任意长度的数据映射到固定长度的数据。哈希函数返回的即是哈希值,如果两个不同的键得到相同的哈希值,即将这种现象称为碰撞。
  • Hash Map: Hash Map 是一种能够建立起键与值之间关系的数据结构,Hash Map 能够使用哈希函数将键转化为桶或者槽中的下标,从而优化对于目标值的搜索速度。
  • 碰撞解决
    • 链地址法(Separate Chaining): 链地址法中,每个桶是相互独立的,包含了一系列索引的列表。搜索操作的时间复杂度即是搜索桶的时间(固定时间)与遍历列表的时间之和。
    • 开地址法(Open Addressing): 在开地址法中,当插入新值时,会判断该值对应的哈希桶是否存在,如果存在则根据某种算法依次选择下一个可能的位置,直到找到一个尚未被占用的地址。所谓开地址法也是指某个元素的位置并不永远由其哈希值决定。

Alt text

Graph

  • 图是一种数据元素间为多对多关系的数据结构,加上一组基本操作构成的抽象数据类型。
    • 无向图(Undirected Graph): 无向图具有对称的邻接矩阵,因此如果存在某条从节点 u 到节点 v 的边,反之从 v 到 u 的边也存在。
    • 有向图(Directed Graph): 有向图的邻接矩阵是非对称的,即如果存在从 u 到 v 的边并不意味着一定存在从 v 到 u 的边。

Graph

算法

排序

快速排序

  • 稳定: 否
  • 时间复杂度:
    • 最优时间: O(nlog(n))
    • 最坏时间: O(n^2)
    • 平均时间: O(nlog(n))

Alt text

归并排序

  • 归并排序是典型的分治算法,它不断地将某个数组分为两个部分,分别对左子数组与右子数组进行排序,然后将两个数组合并为新的有序数组。
  • 稳定: 是
  • 时间复杂度:
    • 最优时间: O(nlog(n))
    • 最坏时间: O(nlog(n))
    • 平均时间: O(nlog(n))

Alt text

桶排序

  • 桶排序将数组分到有限数量的桶子里。每个桶子再个别排序(有可能再使用别的排序算法或是以递归方式继续使用桶排序进行排序)。
  • 时间复杂度:
    • 最优时间: Ω(n + k)
    • 最坏时间: O(n^2)
    • 平均时间:Θ(n + k)

Alt text

基数排序

  • 基数排序类似于桶排序,将数组分割到有限数目的桶中;不过其在分割之后并没有让每个桶单独地进行排序,而是直接进行了合并操作。
  • 时间复杂度:
    • 最优时间: Ω(nk)
    • 最坏时间: O(nk)
    • 平均时间: Θ(nk)

图算法

深度优先搜索

  • 深度优先算法是一种优先遍历子节点而不是回溯的算法。
  • 时间复杂度: O(|V| + |E|)

Alt text

广度优先搜索

  • 广度优先搜索是优先遍历邻居节点而不是子节点的图遍历算法。
  • 时间复杂度: O(|V| + |E|)

Alt text

拓扑排序

  • 拓扑排序是对于有向图节点的线性排序,如果存在某条从 u 到 v 的边,则认为 u 的下标先于 v。
  • 时间复杂度: O(|V| + |E|)

Dijkstra 算法

  • Dijkstra 算法 用于计算有向图中单源最短路径问题。
  • 时间复杂度: O(|V|^2)

Alt text

Bellman-Ford 算法

  • Bellman-Ford 算法是在带权图中计算从单一源点出发到其他节点的最短路径的算法。
  • 尽管算法复杂度大于 Dijkstra 算法,但是它适用于包含了负值边的图。
  • 时间复杂度:
    • 最优时间: O(|E|)
    • 最坏时间: O(|V||E|)

Alt text

Floyd-Warshall 算法

  • Floyd-Warshall 算法 能够用于在无环带权图中寻找任意节点的最短路径。
  • 时间复杂度:
    • 最优时间: O(|V|^3)
    • 最坏时间: O(|V|^3)
    • 平均时间: O(|V|^3)

Prim 算法

  • Prim 算法是用于在带权无向图中计算最小生成树的贪婪算法。换言之,Prim 算法能够在图中抽取出连接所有节点的边的最小代价子集。
  • 时间复杂度: O(|V|^2)

Alt text

Kruskal 算法

  • Kruskal 算法同样是计算图的最小生成树的算法,与 Prim 的区别在于并不需要图是连通的。
  • 时间复杂度: O(|E|log|V|)

Alt text

位运算

  • 位运算即是在位级别进行操作的技术,合适的位运算能够帮助我们得到更快地运算速度与更小的内存使用。
  • 测试第 k 位: s & (1 << k)
  • 设置第 k 位: s |= (1 << k)
  • 第 k 位置零: s &= ~(1 << k)
  • 切换第 k 位值: s ^= ~(1 << k)
  • 乘以 2: s << n
  • 除以 2: s >> n
  • 交集: s & t
  • 并集: s | t
  • 减法: s & ~t
  • 交换 x = x ^ y ^ (y = x)
  • 取出最小非 0 位(Extract lowest set bit): s & (-s)
  • 取出最小 0 位(Extract lowest unset bit): ~s & (s + 1)
  • 交换值: x ^= y; y ^= x; x ^= y;

算法复杂度分析

大 O 表示

  • 大 O 表示 用于表示某个算法的上限,往往用于描述最坏的情况。

Alt text

小 O 表示

  • 小 O 表示用于描述某个算法的渐进上界,不过二者要更为紧密。

大 Ω 表示

  • 大 Ω 表示用于描述某个算法的渐进下界。

Alt text

小 ω 表示

  • Little Omega Notation用于描述某个特定算法的下界,不过不一定很靠近。

Theta Θ 表示

  • Theta Notation用于描述某个确定算法的确界。

Alt text

视频教程

面试书籍

  • Competitive Programming 3 – Steven Halim & Felix Halim
  • Cracking The Coding Interview – Gayle Laakmann McDowell
  • Cracking The PM Interview – Gayle Laakmann McDowell & Jackie Bavaro

计算机科学与技术资讯

文件结构

.
├── Array
│   ├── bestTimeToBuyAndSellStock.java
│   ├── findTheCelebrity.java
│   ├── gameOfLife.java
│   ├── increasingTripletSubsequence.java
│   ├── insertInterval.java
│   ├── longestConsecutiveSequence.java
│   ├── maximumProductSubarray.java
│   ├── maximumSubarray.java
│   ├── mergeIntervals.java
│   ├── missingRanges.java
│   ├── productOfArrayExceptSelf.java
│   ├── rotateImage.java
│   ├── searchInRotatedSortedArray.java
│   ├── spiralMatrixII.java
│   ├── subsetsII.java
│   ├── subsets.java
│   ├── summaryRanges.java
│   ├── wiggleSort.java
│   └── wordSearch.java
├── Backtracking
│   ├── androidUnlockPatterns.java
│   ├── generalizedAbbreviation.java
│   └── letterCombinationsOfAPhoneNumber.java
├── BinarySearch
│   ├── closestBinarySearchTreeValue.java
│   ├── firstBadVersion.java
│   ├── guessNumberHigherOrLower.java
│   ├── pow(x,n).java
│   └── sqrt(x).java
├── BitManipulation
│   ├── binaryWatch.java
│   ├── countingBits.java
│   ├── hammingDistance.java
│   ├── maximumProductOfWordLengths.java
│   ├── numberOf1Bits.java
│   ├── sumOfTwoIntegers.java
│   └── utf-8Validation.java
├── BreadthFirstSearch
│   ├── binaryTreeLevelOrderTraversal.java
│   ├── cloneGraph.java
│   ├── pacificAtlanticWaterFlow.java
│   ├── removeInvalidParentheses.java
│   ├── shortestDistanceFromAllBuildings.java
│   ├── symmetricTree.java
│   └── wallsAndGates.java
├── DepthFirstSearch
│   ├── balancedBinaryTree.java
│   ├── battleshipsInABoard.java
│   ├── convertSortedArrayToBinarySearchTree.java
│   ├── maximumDepthOfABinaryTree.java
│   ├── numberOfIslands.java
│   ├── populatingNextRightPointersInEachNode.java
│   └── sameTree.java
├── Design
│   └── zigzagIterator.java
├── DivideAndConquer
│   ├── expressionAddOperators.java
│   └── kthLargestElementInAnArray.java
├── DynamicProgramming
│   ├── bombEnemy.java
│   ├── climbingStairs.java
│   ├── combinationSumIV.java
│   ├── countingBits.java
│   ├── editDistance.java
│   ├── houseRobber.java
│   ├── paintFence.java
│   ├── paintHouseII.java
│   ├── regularExpressionMatching.java
│   ├── sentenceScreenFitting.java
│   ├── uniqueBinarySearchTrees.java
│   └── wordBreak.java
├── HashTable
│   ├── binaryTreeVerticalOrderTraversal.java
│   ├── findTheDifference.java
│   ├── groupAnagrams.java
│   ├── groupShiftedStrings.java
│   ├── islandPerimeter.java
│   ├── loggerRateLimiter.java
│   ├── maximumSizeSubarraySumEqualsK.java
│   ├── minimumWindowSubstring.java
│   ├── sparseMatrixMultiplication.java
│   ├── strobogrammaticNumber.java
│   ├── twoSum.java
│   └── uniqueWordAbbreviation.java
├── LinkedList
│   ├── addTwoNumbers.java
│   ├── deleteNodeInALinkedList.java
│   ├── mergeKSortedLists.java
│   ├── palindromeLinkedList.java
│   ├── plusOneLinkedList.java
│   ├── README.md
│   └── reverseLinkedList.java
├── Queue
│   └── movingAverageFromDataStream.java
├── README.md
├── Sort
│   ├── meetingRoomsII.java
│   └── meetingRooms.java
├── Stack
│   ├── binarySearchTreeIterator.java
│   ├── decodeString.java
│   ├── flattenNestedListIterator.java
│   └── trappingRainWater.java
├── String
│   ├── addBinary.java
│   ├── countAndSay.java
│   ├── decodeWays.java
│   ├── editDistance.java
│   ├── integerToEnglishWords.java
│   ├── longestPalindrome.java
│   ├── longestSubstringWithAtMostKDistinctCharacters.java
│   ├── minimumWindowSubstring.java
│   ├── multiplyString.java
│   ├── oneEditDistance.java
│   ├── palindromePermutation.java
│   ├── README.md
│   ├── reverseVowelsOfAString.java
│   ├── romanToInteger.java
│   ├── validPalindrome.java
│   └── validParentheses.java
├── Tree
│   ├── binaryTreeMaximumPathSum.java
│   ├── binaryTreePaths.java
│   ├── inorderSuccessorInBST.java
│   ├── invertBinaryTree.java
│   ├── lowestCommonAncestorOfABinaryTree.java
│   ├── sumOfLeftLeaves.java
│   └── validateBinarySearchTree.java
├── Trie
│   ├── addAndSearchWordDataStructureDesign.java
│   ├── implementTrie.java
│   └── wordSquares.java
└── TwoPointers
    ├── 3Sum.java
    ├── 3SumSmaller.java
    ├── mergeSortedArray.java
    ├── minimumSizeSubarraySum.java
    ├── moveZeros.java
    ├── removeDuplicatesFromSortedArray.java
    ├── reverseString.java
    └── sortColors.java

18 directories, 124 files

40 Interview Questions asked at Startups in Machine Learning / Data Science

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists.  What could be a better start for your aspiring career!

However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they provide consulting? Do they build ML products ? You should always find this out prior to beginning your interview preparation.

To help you prepare for your next interview, I’ve prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts.

40 interview questions asked at startups in machine learning

 

Interview Questions on Machine Learning

Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

Answer: Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

  1. Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.
  2. We can randomly sample the data set. This means, we can create a smaller data set, let’s say, having 1000 variables and 300000 rows and do the computations.
  3. To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use chi-square test.
  4. Also, we can use PCA and pick the components which can explain the maximum variance in the data set.
  5. Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.
  6. Building a linear model using Stochastic Gradient Descent is also helpful.
  7. We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

Note: For point 4 & 5, make sure you read about online learning algorithms & Stochastic Gradient Descent. These are advanced methods.

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that’s the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn’t change, it only changes the actual coordinates of the points.

If we don’t rotate the components, the effect of PCA will diminish and we’ll have to select more number of components to explain variance in the data set.

Know more: PCA

 

Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

 

Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

Answer: If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps:

  1. We can use undersampling, oversampling or SMOTE to make the data balanced.
  2. We can alter the prediction threshold value by doing probability caliberation and finding a optimal threshold using AUC-ROC curve.
  3. We can assign weight to classes such that the minority classes gets larger weight.
  4. We can also use anomaly detection.

Know more: Imbalanced Classification

 

Q5. Why is naive Bayes so ‘naive’ ?

Answer: naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumption are rarely true in real world scenario.

 

Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would  be classified as spam.

Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word ‘FREE’ is used in previous spam message is likelihood. Marginal likelihood is, the probability that the word ‘FREE’ is used in any message.

 

Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non – linear interactions. The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

 

Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

Answer: You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals.

This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things:

  1. There exist a pattern.
  2. You cannot solve it mathematically (even by writing exponential equations).
  3. You have data on it.

Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

 

Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

Answer:  Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results.

In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate  a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

Also, to combat high variance, we can:

  1. Use regularization technique, where higher model coefficients get penalized, hence lowering model complexity.
  2. Use top n features from variable importance chart. May be, with all the variable in the data set, the algorithm is having difficulty in finding the meaningful signal.

 

Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated.

For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

 

Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information.

For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the premise of combining weak uncorrelated models to obtain better predictions.

 

Q12. How is kNN different from kmeans clustering?

Answer: Don’t get mislead by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set.

 

Q13. How is True Positive Rate and Recall related? Write the equation.

Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

Know more: Evaluation Metrics

 

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

Answer: Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 – ∑(y – y´)²/∑(y – ymean)² where y´ is predicted value.   

When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation’s value becomes smaller than actual, resulting in higher R².

 

Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity.

But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used.

Know more: Regression

 

Q16. When is Ridge regression favorable over Lasso regression?

Answer: You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

Know more: Ridge and Lasso Regression

 

Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

Answer: After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon.

Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can’t say that pirated died because of rise in global average temperature.

Know more: Causation and Correlation

 

Q18. While working on a data set, how do you select important variables? Explain your methods.

Answer: Following are the methods of variable selection you can use:

  1. Remove the correlated variables prior to selecting important variables
  2. Use linear regression and select variables based on p values
  3. Use Forward Selection, Backward Selection, Stepwise Selection
  4. Use Random Forest, Xgboost and plot variable importance chart
  5. Use Lasso Regression
  6. Measure information gain for the available set of features and select top n features accordingly.

 

Q19. What is the difference between covariance and correlation?

Answer: Correlation is the standardized form of covariance.

Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

 

Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?

Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

 

Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

Answer: The fundamental difference is, random forest uses bagging technique to make predictions. GBM uses boosting techniques to make predictions.

In bagging technique, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is build on all samples. Later, the resultant predictions are combined using voting or averaging. Bagging is done is parallel. In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

Random forest improves model accuracy by reducing variance (mainly). The trees grown are uncorrelated to maximize the decrease in variance. On the other hand, GBM improves accuracy my reducing both bias and variance in a model.

Know more: Tree based modeling

 

Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?

Answer: A classification trees makes decision based on Gini Index and Node Entropy. In simple words, the tree algorithm find the best possible feature which can divide the data set into purest possible children nodes.

Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. We can calculate Gini as following:

  1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).
  2. Calculate Gini for split using weighted Gini score of each node of that split

Entropy is the measure of impurity as given by (for binary class):

Entropy, Decision Tree

Here p and q is probability of success and failure respectively in that node. Entropy is zero when a node is homogeneous. It is maximum when a both the classes are present in a node at 50% – 50%.  Lower entropy is desirable.

 

Q23. You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?

Answer: The model has overfitted. Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data. Hence, when this classifier was run on unseen sample, it couldn’t find those patterns and returned prediction with higher error. In random forest, it happens when we use larger number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation.

 

Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?

Answer: In such high dimensional data sets, we can’t use classical regression techniques, since their assumptions tend to fail. When p > n, we can no longer calculate a unique least square coefficient estimate, the variances become infinite, so OLS cannot be used at all.

To combat this situation, we can use penalized regression methods like lasso, LARS, ridge which can shrink the coefficients to reduce variance. Precisely, ridge regression works best in situations where the least square estimates have higher variance.

Among other methods include subset regression, forward stepwise regression.

 

11222Q25. What is convex hull ? (Hint: Think SVM)

Answer: In case of linearly separable data, convex hull represents the outer boundaries of the two group of data points. Once convex hull is created, we get maximum margin hyperplane (MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to create greatest separation between two groups.

 

Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

Answer: Don’t get baffled at this question. It’s a simple question asking the difference between the two.

Using one hot encoding, the dimensionality (a.k.a features) in a data set get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value.

In label encoding, the levels of a categorical variables gets encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.

 

Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

Answer: Neither.

In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the data set will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold as shown below:

  • fold 1 : training [1], test [2]
  • fold 2 : training [1 2], test [3]
  • fold 3 : training [1 2 3], test [4]
  • fold 4 : training [1 2 3 4], test [5]
  • fold 5 : training [1 2 3 4 5], test [6]

where 1,2,3,4,5,6 represents “year”.

 

Q28. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

Answer: We can deal with them in the following ways:

  1. Assign a unique category to missing values, who knows the missing values might decipher some trend
  2. We can remove them blatantly.
  3. Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.

 

29. ‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

Answer: The basic idea for this kind of recommendation engine comes from collaborative filtering.

Collaborative Filtering algorithm considers “User Behavior” for recommending items. They exploit behavior of other users and items in terms of transaction history, ratings, selection and purchase information. Other users behaviour and preferences over the items are used to recommend items to the new users. In this case, features of the items are not known.

Know more: Recommender System

 

Q30. What do you understand by Type I vs Type II error ?

Answer: Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.

In the context of confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).

 

Q31. You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

Answer: In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

 

Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

Answer: Tolerance (1 / VIF) is used as an indicator of multicollinearity. It is an indicator of percent of variance in a predictor which cannot be accounted by other predictors. Large values of tolerance is desirable.

We will consider adjusted R² as opposed to R² to evaluate model fit because R² increases irrespective of improvement in prediction accuracy as we add more variables. But, adjusted R² would only increase if an additional variable improves the accuracy of model, otherwise stays same. It is difficult to commit a general threshold value for adjusted R² because it varies between data sets. For example: a gene mutation data set might result in lower adjusted R² and still provide fairly good predictions, as compared to a stock market data where lower adjusted R² implies that model is not good.

 

Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

Answer: We don’t use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.

Example: Think of a chess board, the movement made by a bishop or a rook is calculated by manhattan distance because of their respective vertical & horizontal movements.

 

Q34. Explain machine learning to me like a 5 year old.

Answer: It’s simple. It’s just like how babies learn to walk. Every time they fall down, they learn (unconsciously) & realize that their legs should be straight and not in a bend position. The next time they fall down, they feel pain. They cry. But, they learn ‘not to stand like that again’. In order to avoid that pain, they try harder. To succeed, they even seek support from the door or wall or anything near them, which helps them stand firm.

This is how a machine works & develops intuition from its environment.

Note: The interview is only trying to test if have the ability of explain complex concepts in simple terms.

 

Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

Answer: We can use the following methods:

  1. Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance.
  2. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.
  3. Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

Know more: Logistic Regression

 

Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

Answer: You should say, the choice of machine learning algorithm solely depends of the type of data. If you are given a data set which is exhibits linearity, then linear regression would be the best algorithm to use. If you given to work on images, audios, then neural network would help you to build a robust model.

If the data comprises of non linear interactions, then a boosting or bagging algorithm should be the choice. If the business requirement is to build a model which can be deployed, then we’ll use regression or a decision tree model (easy to interpret and explain) instead of black box algorithms like SVM, GBM etc.

In short, there is no one master algorithm for all situations. We must be scrupulous enough to understand which algorithm to use.

 

Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

Answer: For better predictions, categorical variable can be considered as a continuous variable only when the variable is ordinal in nature.

 

Q38. When does regularization becomes necessary in Machine Learning?

Answer: Regularization becomes necessary when the model begins to ovefit / underfit. This technique introduces a cost term for bringing in more features with the objective function. Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term. This helps to reduce model complexity so that the model can become better at predicting (generalizing).

 

Q39. What do you understand by Bias Variance trade off?

Answer:  The error emerging from any model can be broken down into three components mathematically. Following are these component :

error of a model

Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.

 

Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.

Answer: OLS and Maximum likelihood are the methods used by the respective regression methods to approximate the unknown parameter (coefficient) value. In simple words,

Ordinary least square(OLS) is a method used in linear regression which approximates the parameters resulting in minimum distance between actual and predicted values. Maximum Likelihood helps in choosing the the values of parameters which maximizes the likelihood that the parameters are most likely to produce observed data.

 

End Notes

You might have been able to answer all the questions, but the real value is in understanding them and generalizing your knowledge on similar questions. If you have struggled at these questions, no worries, now is the time to learn and not perform. You should right now focus on learning these topics scrupulously.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning. I’m sure these questions would leave you curious enough to do deeper topic research at your end. If you are planning for it, that’s a good sign.

Did you like reading this article? Have you appeared in any startup interview recently for data scientist profile? Do share your experience in comments below. I’d love to know your experience.

from:https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/

Android+TensorFlow+CNN+MNIST 手写数字识别实现

Catalogue

  1. 1. Overview
  2. 2. Practice
    1. 2.1. Environment
    2. 2.2. Train & Evaluate(Python+TensorFlow)
    3. 2.3. Test(Android+TensorFlow)
  3. 3. Theory
    1. 3.1. MNIST
    2. 3.2. CNN(Convolutional Neural Network)
      1. 3.2.1. CNN Keys
      2. 3.2.2. CNN Architecture
    3. 3.3. Regression + Softmax
      1. 3.3.1. Linear Regression
      2. 3.3.2. Softmax Regression
  4. 4. References & Recommends

Overview

本文系“SkySeraph AI 实践到理论系列”第一篇,咱以AI界的HelloWord 经典MNIST数据集为基础,在Android平台,基于TensorFlow,实现CNN的手写数字识别。
Code~


Practice

Environment

  • TensorFlow: 1.2.0
  • Python: 3.6
  • Python IDE: PyCharm 2017.2
  • Android IDE: Android Studio 3.0

Train & Evaluate(Python+TensorFlow)

训练和评估部分主要目的是生成用于测试用的pb文件,其保存了利用TensorFlow python API构建训练后的网络拓扑结构和参数信息,实现方式有很多种,除了cnn外还可以使用rnn,fcnn等。
其中基于cnn的函数也有两套,分别为tf.layers.conv2d和tf.nn.conv2d, tf.layers.conv2d使用tf.nn.conv2d作为后端处理,参数上filters是整数,filter是4维张量。原型如下:
convolutional.py文件
def conv2d(inputs, filters, kernel_size, strides=(1, 1), padding=’valid’, data_format=’channels_last’,
dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer=None,
bias_initializer=init_ops.zeros_initializer(), kernel_regularizer=None, bias_regularizer=None,
activity_regularizer=None, kernel_constraint=None, bias_constraint=None, trainable=True, name=None,
reuse=None)

gen_nn_ops.py 文件

def conv2d(input, filter, strides, padding, use_cudnn_on_gpu=True, data_format="NHWC", name=None)

官方Demo实例中使用的是layers module,结构如下:

  • Convolutional Layer #1:32个5×5的filter,使用ReLU激活函数
  • Pooling Layer #1:2×2的filter做max pooling,步长为2
  • Convolutional Layer #2:64个5×5的filter,使用ReLU激活函数
  • Pooling Layer #2:2×2的filter做max pooling,步长为2
  • Dense Layer #1:1024个神经元,使用ReLU激活函数,dropout率0.4 (为了避免过拟合,在训练的时候,40%的神经元会被随机去掉)
  • Dense Layer #2 (Logits Layer):10个神经元,每个神经元对应一个类别(0-9)

核心代码在cnn_model_fn(features, labels, mode)函数中,完成卷积结构的完整定义,核心代码如下.

也可以采用传统的tf.nn.conv2d函数, 核心代码如下。

Test(Android+TensorFlow)

  • 核心是使用API接口: TensorFlowInferenceInterface.java
  • 配置gradle 或者 自编译TensorFlow源码导入jar和so
    compile ‘org.tensorflow:tensorflow-android:1.2.0’
  • 导入pb文件.pb文件放assets目录,然后读取

    String actualFilename = labelFilename.split(“file:///android_asset/“)[1];
    Log.i(TAG, “Reading labels from: “ + actualFilename);
    BufferedReader br = null;
    br = new BufferedReader(new InputStreamReader(assetManager.open(actualFilename)));
    String line;
    while ((line = br.readLine()) != null) {
    c.labels.add(line);
    }
    br.close();

  • TensorFlow接口使用
  • 最终效果:

Theory

MNIST

MNIST,最经典的机器学习模型之一,包含0~9的数字,28*28大小的单色灰度手写数字图片数据库,其中共60,000 training examples和10,000 test examples。
文件目录如下,主要包括4个二进制文件,分别为训练和测试图片及Label。

如下为训练图片的二进制结构,在真实数据前(pixel),有部分描述字段(魔数,图片个数,图片行数和列数),真实数据的存储采用大端规则。
(大端规则,就是数据的高字节保存在低内存地址中,低字节保存在高内存地址中)

在具体实验使用,需要提取真实数据,可采用专门用于处理字节的库struct中的unpack_from方法,核心方法如下:
struct.unpack_from(self._fourBytes2, buf, index)

MNIST作为AI的Hello World入门实例数据,TensorFlow封装对其封装好了函数,可直接使用
mnist = input_data.read_data_sets(‘MNIST’, one_hot=True)

CNN(Convolutional Neural Network)

CNN Keys

  • CNN,Convolutional Neural Network,中文全称卷积神经网络,即所谓的卷积网(ConvNets)。
  • 卷积(Convolution)可谓是现代深度学习中最最重要的概念了,它是一种数学运算,读者可以从下面链接[23]中卷积相关数学机理,包括分别从傅里叶变换和狄拉克δ函数中推到卷积定义,我们可以从字面上宏观粗鲁的理解成将因子翻转相乘卷起来。
  • 卷积动画。演示如下图[26],更多动画演示可参考[27]
  • 神经网络。一个由大量神经元(neurons)组成的系统,如下图所示[21]

    其中x表示输入向量,w为权重,b为偏值bias,f为激活函数。
  • Activation Function 激活函数: 常用的非线性激活函数有Sigmoid、tanh、ReLU等等,公式如下如所示。
    • Sigmoid缺点
      • 函数饱和使梯度消失(神经元在值为 0 或 1 的时候接近饱和,这些区域,梯度几乎为 0)
      • sigmoid 函数不是关于原点中心对称的(无0中心化)
    • tanh: 存在饱和问题,但它的输出是零中心的,因此实际中 tanh 比 sigmoid 更受欢迎。
    • ReLU
      • 优点1:ReLU 对于 SGD 的收敛有巨大的加速作用
      • 优点2:只需要一个阈值就可以得到激活值,而不用去算一大堆复杂的(指数)运算
      • 缺点:需要合理设置学习率(learning rate),防止训练时dead,还可以使用Leaky ReLU/PReLU/Maxout等代替
  • Pooling池化。一般分为平均池化mean pooling和最大池化max pooling,如下图所示[21]为max pooling,除此之外,还有重叠池化(OverlappingPooling)[24],空金字塔池化(Spatial Pyramid Pooling)[25]
    • 平均池化:计算图像区域的平均值作为该区域池化后的值。
    • 最大池化:选图像区域的最大值作为该区域池化后的值。

CNN Architecture

  • 三层神经网络。分别为输入层(Input layer),输出层(Output layer),隐藏层(Hidden layer),如下图所示[21]
  • CNN层级结构。 斯坦福cs231n中阐述了一种[INPUT-CONV-RELU-POOL-FC],如下图所示[21],分别为输入层,卷积层,激励层,池化层,全连接层。
  • CNN通用架构分为如下三层结构:
    • Convolutional layers 卷积层
    • Pooling layers 汇聚层
    • Dense (fully connected) layers 全连接层
  • 动画演示。参考[22]。

Regression + Softmax

机器学习有监督学习(supervised learning)中两大算法分别是分类算法和回归算法,分类算法用于离散型分布预测,回归算法用于连续型分布预测。
回归的目的就是建立一个回归方程用来预测目标值,回归的求解就是求这个回归方程的回归系数。
其中回归(Regression)算法包括Linear Regression,Logistic Regression等, Softmax Regression是其中一种用于解决多分类(multi-class classification)问题的Logistic回归算法的推广,经典实例就是在MNIST手写数字分类上的应用。

Linear Regression

Linear Regression是机器学习中最基础的模型,其目标是用预测结果尽可能地拟合目标label

  • 多元线性回归模型定义
  • 多元线性回归求解
  • Mean Square Error (MSE)
    • Gradient Descent(梯度下降法)
    • Normal Equation(普通最小二乘法)
    • 局部加权线性回归(LocallyWeightedLinearRegression, LWLR ):针对线性回归中模型欠拟合现象,在估计中引入一些偏差以便降低预测的均方误差。
    • 岭回归(ridge regression)和缩减方法
  • 选择: Normal Equation相比Gradient Descent,计算量大(需计算X的转置与逆矩阵),只适用于特征个数小于100000时使用;当特征数量大于100000时使用梯度法。当X不可逆时可替代方法为岭回归算法。LWLR方法增加了计算量,因为它对每个点做预测时都必须使用整个数据集,而不是计算出回归系数得到回归方程后代入计算即可,一般不选择。
  • 调优: 平衡预测偏差和模型方差(高偏差就是欠拟合,高方差就是过拟合)
    • 获取更多的训练样本 – 解决高方差
    • 尝试使用更少的特征的集合 – 解决高方差
    • 尝试获得其他特征 – 解决高偏差
    • 尝试添加多项组合特征 – 解决高偏差
    • 尝试减小 λ – 解决高偏差
    • 尝试增加 λ -解决高方差

Softmax Regression

  • Softmax Regression估值函数(hypothesis)
  • Softmax Regression代价函数(cost function)
  • 理解:
  • Softmax Regression & Logistic Regression:
    • 多分类 & 二分类。Logistic Regression为K=2时的Softmax Regression
    • 针对K类问题,当类别之间互斥时可采用Softmax Regression,当非斥时,可采用K个独立的Logistic Regression
  • 总结: Softmax Regression适用于类别数量大于2的分类,本例中用于判断每张图属于每个数字的概率。

References & Recommends

MNIST

Softmax

CNN

TensorFlow+CNN / TensorFlow+Android



By SkySeraph-2018

SkySeraph cnBlogs
SkySeraph CSDN

本文首发于skyseraph.com“Android+TensorFlow+CNN+MNIST 手写数字识别实现”