All posts by dotte

What Is a Metaclass?(python)

十二月 2, 2016Pythonmetaclassdotte

Metaclasses do not mean “deep,dark black magic”. When you execute any class statement,
Python performs the following steps:
1. Remember the class name as a string, say n, and the class bases as a tuple, say b.
2. Execute the body of the class,recording all names that the body binds as keys in
a new dictionary d,each with its associated value (e.g.,each statement such as
def f(self) just sets d[‘f’] to the function object the def statement builds).
3. Determine the appropriate metaclass,say M,by inheritance or by looking for
name __metaclass__ in d and in the globals:
if ‘__metaclass__’ in d: M = d[‘__metaclass__’]
elif b: M = type(b[0])
elif ‘__metaclass__’ in globals( ): M = globals( )[‘__metaclass__’]
else: M = types.ClassType
types.ClassType is the metaclass of old-style classes,so this code implies that a
class without bases is old style if the name ‘__metaclass__’ is not set in the class
body nor among the global variables of the current module.
4. Call M(n, b, d) and record the result as a variable with name n in whatever
scope the class statement executed.
So,some metaclass M is always involved in the execution of any class statement. The
metaclass is normally type for new-style classes, types.ClassType for old-style classes.
You can set it up to use your own custom metaclass (normally a subclass of type),and
that is where you may reasonably feel that things are getting a bit too advanced. However,
understanding that a class statement, such as:
class Someclass(Somebase):
__metaclass__ = type
x = 23
is exactly equivalent to the assignment statement:
Someclass = type(‘Someclass’, (Somebase,), {‘x’: 23})
does help a lot in understanding the exact semantics of the class statement.

from: 《python cookbook》

一致性Hash算法(Java实现)

十一月 28, 2016JAVA, 一致性哈希, 分布式Consistent Hashingdotte

一致性Hash算法

关于一致性Hash算法，在我之前的博文中已经有多次提到了，MemCache超详细解读一文中”一致性Hash算法”部分，对于为什么要使用一致性Hash算法、一致性Hash算法的算法原理做了详细的解读。

算法的具体原理这里再次贴上：

先构造一个长度为2³²的整数环（这个环被称为一致性Hash环），根据节点名称的Hash值（其分布为[0, 2³²-1]）将服务器节点放置在这个Hash环上，然后根据数据的Key值计算得到其Hash值（其分布也为[0, 2³²-1]），接着在Hash环上顺时针查找距离这个Key值的Hash值最近的服务器节点，完成Key到服务器的映射查找。

这种算法解决了普通余数Hash算法伸缩性差的问题，可以保证在上线、下线服务器的情况下尽量有多的请求命中原来路由到的服务器。

当然，万事不可能十全十美，一致性Hash算法比普通的余数Hash算法更具有伸缩性，但是同时其算法实现也更为复杂，本文就来研究一下，如何利用Java代码实现一致性Hash算法。在开始之前，先对一致性Hash算法中的几个核心问题进行一些探究。

数据结构的选取

一致性Hash算法最先要考虑的一个问题是：构造出一个长度为2³²的整数环，根据节点名称的Hash值将服务器节点放置在这个Hash环上。

那么，整数环应该使用何种数据结构，才能使得运行时的时间复杂度最低？首先说明一点，关于时间复杂度，常见的时间复杂度与时间效率的关系有如下的经验规则：

O(1) < O(log₂N) < O(n) < O(N * log₂N) < O(N²) < O(N³) < 2N < 3N < N!

一般来说，前四个效率比较高，中间两个差强人意，后三个比较差（只要N比较大，这个算法就动不了了）。OK，继续前面的话题，应该如何选取数据结构，我认为有以下几种可行的解决方案。

1、解决方案一：排序+List

我想到的第一种思路是：算出所有待加入数据结构的节点名称的Hash值放入一个数组中，然后使用某种排序算法将其从小到大进行排序，最后将排序后的数据放入List中，采用List而不是数组是为了结点的扩展考虑。

之后，待路由的结点，只需要在List中找到第一个Hash值比它大的服务器节点就可以了，比如服务器节点的Hash值是[0,2,4,6,8,10]，带路由的结点是7，只需要找到第一个比7大的整数，也就是8，就是我们最终需要路由过去的服务器节点。

如果暂时不考虑前面的排序，那么这种解决方案的时间复杂度：

（1）最好的情况是第一次就找到，时间复杂度为O(1)

（2）最坏的情况是最后一次才找到，时间复杂度为O(N)

平均下来时间复杂度为O(0.5N+0.5)，忽略首项系数和常数，时间复杂度为O(N)。

但是如果考虑到之前的排序，我在网上找了张图，提供了各种排序算法的时间复杂度：

看得出来，排序算法要么稳定但是时间复杂度高、要么时间复杂度低但不稳定，看起来最好的归并排序法的时间复杂度仍然有O(N * logN)，稍微耗费性能了一些。

2、解决方案二：遍历+List

既然排序操作比较耗性能，那么能不能不排序？可以的，所以进一步的，有了第二种解决方案。

解决方案使用List不变，不过可以采用遍历的方式：

（1）服务器节点不排序，其Hash值全部直接放入一个List中

（2）带路由的节点，算出其Hash值，由于指明了”顺时针”，因此遍历List，比待路由的节点Hash值大的算出差值并记录，比待路由节点Hash值小的忽略

（3）算出所有的差值之后，最小的那个，就是最终需要路由过去的节点

在这个算法中，看一下时间复杂度：

1、最好情况是只有一个服务器节点的Hash值大于带路由结点的Hash值，其时间复杂度是O(N)+O(1)=O(N+1)，忽略常数项，即O(N)

2、最坏情况是所有服务器节点的Hash值都大于带路由结点的Hash值，其时间复杂度是O(N)+O(N)=O(2N)，忽略首项系数，即O(N)

所以，总的时间复杂度就是O(N)。其实算法还能更改进一些：给一个位置变量X，如果新的差值比原差值小，X替换为新的位置，否则X不变。这样遍历就减少了一轮，不过经过改进后的算法时间复杂度仍为O(N)。

总而言之，这个解决方案和解决方案一相比，总体来看，似乎更好了一些。

3、解决方案三：二叉查找树

抛开List这种数据结构，另一种数据结构则是使用二叉查找树。对于树不是很清楚的朋友可以简单看一下这篇文章树形结构。

当然我们不能简单地使用二叉查找树，因为可能出现不平衡的情况。平衡二叉查找树有AVL树、红黑树等，这里使用红黑树，选用红黑树的原因有两点：

1、红黑树主要的作用是用于存储有序的数据，这其实和第一种解决方案的思路又不谋而合了，但是它的效率非常高

2、JDK里面提供了红黑树的代码实现TreeMap和TreeSet

另外，以TreeMap为例，TreeMap本身提供了一个tailMap(K fromKey)方法，支持从红黑树中查找比fromKey大的值的集合，但并不需要遍历整个数据结构。

使用红黑树，可以使得查找的时间复杂度降低为O(logN)，比上面两种解决方案，效率大大提升。

为了验证这个说法，我做了一次测试，从大量数据中查找第一个大于其中间值的那个数据，比如10000数据就找第一个大于5000的数据（模拟平均的情况）。看一下O(N)时间复杂度和O(logN)时间复杂度运行效率的对比：

	50000	100000	500000	1000000	4000000
ArrayList	1ms	1ms	4ms	4ms	5ms
LinkedList	4ms	7ms	11ms	13ms	17ms
TreeMap	0ms	0ms	0ms	0ms	0ms

因为再大就内存溢出了，所以只测试到4000000数据。可以看到，数据查找的效率，TreeMap是完胜的，其实再增大数据测试也是一样的，红黑树的数据结构决定了任何一个大于N的最小数据，它都只需要几次至几十次查找就可以查到。

当然，明确一点，有利必有弊，根据我另外一次测试得到的结论是，为了维护红黑树，数据插入效率TreeMap在三种数据结构里面是最差的，且插入要慢上5~10倍。

Hash值重新计算

服务器节点我们肯定用字符串来表示，比如”192.168.1.1″、”192.168.1.2″，根据字符串得到其Hash值，那么另外一个重要的问题就是Hash值要重新计算，这个问题是我在测试String的hashCode()方法的时候发现的，不妨来看一下为什么要重新计算Hash值：

/**
 * String的hashCode()方法运算结果查看
 * @author 五月的仓颉 http://www.cnblogs.com/xrq730/
 *
 */
public class StringHashCodeTest
{
    public static void main(String[] args)
    {
        System.out.println("192.168.0.0:111的哈希值：" + "192.168.0.0:1111".hashCode());
        System.out.println("192.168.0.1:111的哈希值：" + "192.168.0.1:1111".hashCode());
        System.out.println("192.168.0.2:111的哈希值：" + "192.168.0.2:1111".hashCode());
        System.out.println("192.168.0.3:111的哈希值：" + "192.168.0.3:1111".hashCode());
        System.out.println("192.168.0.4:111的哈希值：" + "192.168.0.4:1111".hashCode());
    }
}

我们在做集群的时候，集群点的IP以这种连续的形式存在是很正常的。看一下运行结果为：

192.168.0.0:111的哈希值：1845870087
192.168.0.1:111的哈希值：1874499238
192.168.0.2:111的哈希值：1903128389
192.168.0.3:111的哈希值：1931757540
192.168.0.4:111的哈希值：1960386691

这个就问题大了，[0,2³²-1]的区间之中，5个HashCode值却只分布在这么小小的一个区间，什么概念？[0,2³²-1]中有4294967296个数字，而我们的区间只有114516604，从概率学上讲这将导致97%待路由的服务器都被路由到”192.168.0.0″这个集群点上，简直是糟糕透了！

另外还有一个不好的地方：规定的区间是非负数，String的hashCode()方法却会产生负数（不信用”192.168.1.0:1111″试试看就知道了）。不过这个问题好解决，取绝对值就是一种解决的办法。

综上，String重写的hashCode()方法在一致性Hash算法中没有任何实用价值，得找个算法重新计算HashCode。这种重新计算Hash值的算法有很多，比如CRC32_HASH、FNV1_32_HASH、KETAMA_HASH等，其中KETAMA_HASH是默认的MemCache推荐的一致性Hash算法，用别的Hash算法也可以，比如FNV1_32_HASH算法的计算效率就会高一些。

一致性Hash算法实现版本1：不带虚拟节点

使用一致性Hash算法，尽管增强了系统的伸缩性，但是也有可能导致负载分布不均匀，解决办法就是使用虚拟节点代替真实节点，第一个代码版本，先来个简单的，不带虚拟节点。

下面来看一下不带虚拟节点的一致性Hash算法的Java代码实现：

 1 /**
 2  * 不带虚拟节点的一致性Hash算法
 3  * @author 五月的仓颉http://www.cnblogs.com/xrq730/
 4  *
 5  */
 6 public class ConsistentHashingWithoutVirtualNode
 7 {
 8     /**
 9      * 待添加入Hash环的服务器列表
10      */
11     private static String[] servers = {"192.168.0.0:111", "192.168.0.1:111", "192.168.0.2:111",
12             "192.168.0.3:111", "192.168.0.4:111"};
13     
14     /**
15      * key表示服务器的hash值，value表示服务器的名称
16      */
17     private static SortedMap<Integer, String> sortedMap = 
18             new TreeMap<Integer, String>();
19     
20     /**
21      * 程序初始化，将所有的服务器放入sortedMap中
22      */
23     static
24     {
25         for (int i = 0; i < servers.length; i++)
26         {
27             int hash = getHash(servers[i]);
28             System.out.println("[" + servers[i] + "]加入集合中, 其Hash值为" + hash);
29             sortedMap.put(hash, servers[i]);
30         }
31         System.out.println();
32     }
33     
34     /**
35      * 使用FNV1_32_HASH算法计算服务器的Hash值,这里不使用重写hashCode的方法，最终效果没区别 
36      */
37     private static int getHash(String str)
38     {
39         final int p = 16777619;
40         int hash = (int)2166136261L;
41         for (int i = 0; i < str.length(); i++)
42             hash = (hash ^ str.charAt(i)) * p;
43         hash += hash << 13;
44         hash ^= hash >> 7;
45         hash += hash << 3;
46         hash ^= hash >> 17;
47         hash += hash << 5;
48         
49         // 如果算出来的值为负数则取其绝对值
50         if (hash < 0)
51             hash = Math.abs(hash);
52         return hash;
53     }
54     
55     /**
56      * 得到应当路由到的结点
57      */
58     private static String getServer(String node)
59     {
60         // 得到带路由的结点的Hash值
61         int hash = getHash(node);
62         // 得到大于该Hash值的所有Map
63         SortedMap<Integer, String> subMap = 
64                 sortedMap.tailMap(hash);
65         // 第一个Key就是顺时针过去离node最近的那个结点
66         Integer i = subMap.firstKey();
67         // 返回对应的服务器名称
68         return subMap.get(i);
69     }
70     
71     public static void main(String[] args)
72     {
73         String[] nodes = {"127.0.0.1:1111", "221.226.0.1:2222", "10.211.0.1:3333"};
74         for (int i = 0; i < nodes.length; i++)
75             System.out.println("[" + nodes[i] + "]的hash值为" + 
76                     getHash(nodes[i]) + ", 被路由到结点[" + getServer(nodes[i]) + "]");
77     }
78 }

可以运行一下看一下结果：

[192.168.0.0:111]加入集合中, 其Hash值为575774686
[192.168.0.1:111]加入集合中, 其Hash值为8518713
[192.168.0.2:111]加入集合中, 其Hash值为1361847097
[192.168.0.3:111]加入集合中, 其Hash值为1171828661
[192.168.0.4:111]加入集合中, 其Hash值为1764547046

[127.0.0.1:1111]的hash值为380278925, 被路由到结点[192.168.0.0:111]
[221.226.0.1:2222]的hash值为1493545632, 被路由到结点[192.168.0.4:111]
[10.211.0.1:3333]的hash值为1393836017, 被路由到结点[192.168.0.4:111]

看到经过FNV1_32_HASH算法重新计算过后的Hash值，就比原来String的hashCode()方法好多了。从运行结果来看，也没有问题，三个点路由到的都是顺时针离他们Hash值最近的那台服务器上。

使用虚拟节点来改善一致性Hash算法

上面的一致性Hash算法实现，可以在很大程度上解决很多分布式环境下不好的路由算法导致系统伸缩性差的问题，但是会带来另外一个问题：负载不均。

比如说有Hash环上有A、B、C三个服务器节点，分别有100个请求会被路由到相应服务器上。现在在A与B之间增加了一个节点D，这导致了原来会路由到B上的部分节点被路由到了D上，这样A、C上被路由到的请求明显多于B、D上的，原来三个服务器节点上均衡的负载被打破了。某种程度上来说，这失去了负载均衡的意义，因为负载均衡的目的本身就是为了使得目标服务器均分所有的请求。

解决这个问题的办法是引入虚拟节点，其工作原理是：将一个物理节点拆分为多个虚拟节点，并且同一个物理节点的虚拟节点尽量均匀分布在Hash环上。采取这样的方式，就可以有效地解决增加或减少节点时候的负载不均衡的问题。

至于一个物理节点应该拆分为多少虚拟节点，下面可以先看一张图：

横轴表示需要为每台福利服务器扩展的虚拟节点倍数，纵轴表示的是实际物理服务器数。可以看出，物理服务器很少，需要更大的虚拟节点；反之物理服务器比较多，虚拟节点就可以少一些。比如有10台物理服务器，那么差不多需要为每台服务器增加100~200个虚拟节点才可以达到真正的负载均衡。

一致性Hash算法实现版本2：带虚拟节点

在理解了使用虚拟节点来改善一致性Hash算法的理论基础之后，就可以尝试开发代码了。编程方面需要考虑的问题是：

1、一个真实结点如何对应成为多个虚拟节点？

2、虚拟节点找到后如何还原为真实结点？

这两个问题其实有很多解决办法，我这里使用了一种简单的办法，给每个真实结点后面根据虚拟节点加上后缀再取Hash值，比如”192.168.0.0:111″就把它变成”192.168.0.0:111&&VN0″到”192.168.0.0:111&&VN4″，VN就是Virtual Node的缩写，还原的时候只需要从头截取字符串到”&&”的位置就可以了。

下面来看一下带虚拟节点的一致性Hash算法的Java代码实现：

 1 /**
 2  * 带虚拟节点的一致性Hash算法
 3  * @author 五月的仓颉 http://www.cnblogs.com/xrq730/
 4  */
 5 public class ConsistentHashingWithVirtualNode
 6 {
 7     /**
 8      * 待添加入Hash环的服务器列表
 9      */
10     private static String[] servers = {"192.168.0.0:111", "192.168.0.1:111", "192.168.0.2:111",
11             "192.168.0.3:111", "192.168.0.4:111"};
12     
13     /**
14      * 真实结点列表,考虑到服务器上线、下线的场景，即添加、删除的场景会比较频繁，这里使用LinkedList会更好
15      */
16     private static List<String> realNodes = new LinkedList<String>();
17     
18     /**
19      * 虚拟节点，key表示虚拟节点的hash值，value表示虚拟节点的名称
20      */
21     private static SortedMap<Integer, String> virtualNodes = 
22             new TreeMap<Integer, String>();
23     
24     /**
25      * 虚拟节点的数目，这里写死，为了演示需要，一个真实结点对应5个虚拟节点
26      */
27     private static final int VIRTUAL_NODES = 5;
28     
29     static
30     {
31         // 先把原始的服务器添加到真实结点列表中
32         for (int i = 0; i < servers.length; i++)
33             realNodes.add(servers[i]);
34         
35         // 再添加虚拟节点，遍历LinkedList使用foreach循环效率会比较高
36         for (String str : realNodes)
37         {
38             for (int i = 0; i < VIRTUAL_NODES; i++)
39             {
40                 String virtualNodeName = str + "&&VN" + String.valueOf(i);
41                 int hash = getHash(virtualNodeName);
42                 System.out.println("虚拟节点[" + virtualNodeName + "]被添加, hash值为" + hash);
43                 virtualNodes.put(hash, virtualNodeName);
44             }
45         }
46         System.out.println();
47     }
48     
49     /**
50      * 使用FNV1_32_HASH算法计算服务器的Hash值,这里不使用重写hashCode的方法，最终效果没区别 
51      */
52     private static int getHash(String str)
53     {
54         final int p = 16777619;
55         int hash = (int)2166136261L;
56         for (int i = 0; i < str.length(); i++)
57             hash = (hash ^ str.charAt(i)) * p;
58         hash += hash << 13;
59         hash ^= hash >> 7;
60         hash += hash << 3;
61         hash ^= hash >> 17;
62         hash += hash << 5;
63         
64         // 如果算出来的值为负数则取其绝对值
65         if (hash < 0)
66             hash = Math.abs(hash);
67         return hash;
68     }
69     
70     /**
71      * 得到应当路由到的结点
72      */
73     private static String getServer(String node)
74     {
75         // 得到带路由的结点的Hash值
76         int hash = getHash(node);
77         // 得到大于该Hash值的所有Map
78         SortedMap<Integer, String> subMap = 
79                 virtualNodes.tailMap(hash);
80         // 第一个Key就是顺时针过去离node最近的那个结点
81         Integer i = subMap.firstKey();
82         // 返回对应的虚拟节点名称，这里字符串稍微截取一下
83         String virtualNode = subMap.get(i);
84         return virtualNode.substring(0, virtualNode.indexOf("&&"));
85     }
86     
87     public static void main(String[] args)
88     {
89         String[] nodes = {"127.0.0.1:1111", "221.226.0.1:2222", "10.211.0.1:3333"};
90         for (int i = 0; i < nodes.length; i++)
91             System.out.println("[" + nodes[i] + "]的hash值为" + 
92                     getHash(nodes[i]) + ", 被路由到结点[" + getServer(nodes[i]) + "]");
93     }
94 }

关注一下运行结果：

虚拟节点[192.168.0.0:111&&VN0]被添加, hash值为1686427075
虚拟节点[192.168.0.0:111&&VN1]被添加, hash值为354859081
虚拟节点[192.168.0.0:111&&VN2]被添加, hash值为1306497370
虚拟节点[192.168.0.0:111&&VN3]被添加, hash值为817889914
虚拟节点[192.168.0.0:111&&VN4]被添加, hash值为396663629
虚拟节点[192.168.0.1:111&&VN0]被添加, hash值为1032739288
虚拟节点[192.168.0.1:111&&VN1]被添加, hash值为707592309
虚拟节点[192.168.0.1:111&&VN2]被添加, hash值为302114528
虚拟节点[192.168.0.1:111&&VN3]被添加, hash值为36526861
虚拟节点[192.168.0.1:111&&VN4]被添加, hash值为848442551
虚拟节点[192.168.0.2:111&&VN0]被添加, hash值为1452694222
虚拟节点[192.168.0.2:111&&VN1]被添加, hash值为2023612840
虚拟节点[192.168.0.2:111&&VN2]被添加, hash值为697907480
虚拟节点[192.168.0.2:111&&VN3]被添加, hash值为790847074
虚拟节点[192.168.0.2:111&&VN4]被添加, hash值为2010506136
虚拟节点[192.168.0.3:111&&VN0]被添加, hash值为891084251
虚拟节点[192.168.0.3:111&&VN1]被添加, hash值为1725031739
虚拟节点[192.168.0.3:111&&VN2]被添加, hash值为1127720370
虚拟节点[192.168.0.3:111&&VN3]被添加, hash值为676720500
虚拟节点[192.168.0.3:111&&VN4]被添加, hash值为2050578780
虚拟节点[192.168.0.4:111&&VN0]被添加, hash值为586921010
虚拟节点[192.168.0.4:111&&VN1]被添加, hash值为184078390
虚拟节点[192.168.0.4:111&&VN2]被添加, hash值为1331645117
虚拟节点[192.168.0.4:111&&VN3]被添加, hash值为918790803
虚拟节点[192.168.0.4:111&&VN4]被添加, hash值为1232193678

[127.0.0.1:1111]的hash值为380278925, 被路由到结点[192.168.0.0:111]
[221.226.0.1:2222]的hash值为1493545632, 被路由到结点[192.168.0.0:111]
[10.211.0.1:3333]的hash值为1393836017, 被路由到结点[192.168.0.2:111]

从代码运行结果看，每个点路由到的服务器都是Hash值顺时针离它最近的那个服务器节点，没有任何问题。

通过采取虚拟节点的方法，一个真实结点不再固定在Hash换上的某个点，而是大量地分布在整个Hash环上，这样即使上线、下线服务器，也不会造成整体的负载不均衡。

后记

在写本文的时候，很多知识我也是边写边学，难免有很多写得不好、理解得不透彻的地方，而且代码整体也比较糙，未有考虑到可能的各种情况。抛砖引玉，一方面，写得不对的地方，还望网友朋友们指正；另一方面，后续我也将通过自己的工作、学习不断完善上面的代码。

from:http://www.cnblogs.com/xrq730/p/5186728.html

A Beginner’s Guide to Big Data Terminology

十一月 25, 2016BigDataBigdata, Terminologydotte

Big Data includes so many specialized terms that it’s hard to know where to begin. Make sure you can talk the talk before you try to walk the walk.

Data science can be confusing enough without all of the complicated lingo and jargon. For many, the terms NoSQL, DaaS and Neural Networking instill nothing more than the hesitant thought, “this sounds data-related.” It can be difficult to tell a mathematical term from a proper programming language or a dystopian sci-fi world. The first step to getting the most out of data science is understanding the most basic of terminology. That’s why we compiled a list of terms from all across the big data spectrum.

Algorithms: Mathematical formulas or statistical processes used to analyze data. These are used in software to process and analyze any input data.

Analytics: The process of drawing conclusions based on raw information. Through analysis, otherwise meaningless data and numbers can be transformed into something useful. The focus here is on inference rather than big software systems. Perhaps that’s why data analysts are often well-versed in the art of story-telling. There are three main types of analytics in data, and they appear in the following order:

Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.

Predictive Analytics: Studying recent and historical data, analysts are now able to make predictions about the future. It is hardly 100% accurate, but it provides insight as to what will most likely happen next. This process often involves data mining, machine learning and statistics.

Prescriptive Analytics: Finally, having a solid prediction for the future, analysts can prescribe a course of action. This turns data into action and leads to real-world decisions.

Cloud: It’s available any and everywhere. Cloud computing simply means storing or accessing data (programs, files, data) over the internet instead of a hard drive.

DaaS: Data-as-a-service treats data as a product. DaaS providers use the cloud to give on-demand access of data to customers. This allows companies to get high quality data quickly. DaaS has been a popular word in 2015, and is playing a major role in marketing.

Data Mining: Data miners explore large sets of data to find patterns and insight. This is a highly analytical process that emphasizes making use of large datasets. This process could likely involve artificial intelligence, machine learning or statistics.

Dark Data: This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Database: A database is an organized collection of data. It may include charts, schemas or tables. It may also be integrated into a Database Management System (DBMS), a software that allows data to be explored and analyzed.

Hadoop (Apache Hadoop): An open source software framework, Hadoop works largely by storing files and processing data. It is also known for large processing power, making it easy to run a multitude of tasks concurrently. It allows businesses to save, access and analyze enormously big amounts of data. Apache is also in charge of other, related programs you may run into: Pig, Hive, and now Spark (more on Spark later).

IoT: The Internet of Things is generally described as the way products are able “talk” to each other. It is a network of objects (for example, your phone, wearable or car) embedded with network connectivity. Driverless cars are perfect examples. They are always pulling information from the cloud and their sensors are relaying information back. The IoT generates huge amounts of data, making it both important and popular for data science. There is also:

IoE (Internet of Everything): This combines products, people and processes to generate even more connectivity.

Machine Learning: An incredibly cool method of data analysis, machine learning automates analytical model building and relies on a machine’s ability to adapt. Using algorithms, models actively learn and better themselves each time they process new data. Though machine learning is not new, it is gaining massive traction as a modern data analysis tool. It enables machines to adapt and grow without needing hours of extra work on the part of scientists.

MapReduce: MapReduce is a programming model for processing and generating large data sets. This model actually does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of bits called tuples. Second, “Reduce” takes all of the broken down tuples and breaks them down even further. The result is a practical breakdown of information.

Neural Network: Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NoSQL: “Non-relational SQL” or “Not only SQL” is much like SQL (discussed below) but does not use relational tables with rows and columns. It is used to manage and stream processing of data. NoSQL includes a number of different databases and models that run horizontally, meaning across servers. This might make it more cost-effective than vertical scaling (as used in SQL).

Petabyte: Yes, it’s big. It’s 1,000,000,000,000,000 bytes. To visualize, Gizmodo described one petabyte as 20 million 4-drawer filing cabinets filled with texts. 20 Petabytes would be all the written works of mankind from the beginning of time translated in every language.

SQL: Also known as Structured Query Language, this is used for the managing and stream processing of data. It is used to communicate with and perform tasks on a database. Standard commands include “Insert,” “Update,” “Delete,” “Create,” and “Drop.” Data appears in a relational table with rows and columns.

R: R is a horribly named programming language that works with statistical computing. It is considered one of the more important and most popular languages in data science.

SaaS: Software-as-a-Service enables vendors to host an application and make it available via the internet. Yes, that’s cloud servicing. SaaS providers provide services over the cloud rather than hard copies.

Spark (Apache Spark): An open-source computing framework originally developed at University of California, Berkely, Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.

from:http://dataconomy.com/a-beginners-guide-to-big-data-terminology/

python机器学习深度学习总结

十一月 25, 2016BigData, ML&DL, PythonBigdata, DeepLearning, Machine Learning, pythondotte

1、Python环境搭建（Windows）

开发工具：PyCharm Community Edition（free）

Python环境：WinPython 3.5.2.3Qt5
–此环境集成了机器学习和深度学习用到的主要包：
numpy,scipy,matplotlib,pandas,scikit-learn,theano,keras

IPython notebook :

2、示例代码：

scikit-learn sample

keras sample

3、数据集Datasets

GeoHey公共数据

4、kaggle平台

Kaggle是一个数据建模和数据分析竞赛平台。企业和研究者可在其上发布数据，统计学者和数据挖掘专家可在其上进行竞赛以产生最好的模型。这一众包模式依赖于这一事实，即有众多策略可以用于解决几乎所有预测建模的问题，而研究者不可能在一开始就了解什么方法对于特定问题是最为有效的。Kaggle的目标则是试图通过众包的形式来解决这一难题，进而使数据科学成为一场运动。(wiki)

5、常见问题处理

Approaching (Almost) Any Machine Learning Problem