Category Archives: BigData

Data science Python notebooks






IPython Notebook(s) demonstrating deep learning functionality.



Additional TensorFlow tutorials:

Notebook Description
tsf-basics Learn basic operations in TensorFlow, a library for various kinds of perceptual and language understanding tasks from Google.
tsf-linear Implement linear regression in TensorFlow.
tsf-logistic Implement logistic regression in TensorFlow.
tsf-nn Implement nearest neighboars in TensorFlow.
tsf-alex Implement AlexNet in TensorFlow.
tsf-cnn Implement convolutional neural networks in TensorFlow.
tsf-mlp Implement multilayer perceptrons in TensorFlow.
tsf-rnn Implement recurrent neural networks in TensorFlow.
tsf-gpu Learn about basic multi-GPU computation in TensorFlow.
tsf-gviz Learn about graph visualization in TensorFlow.
tsf-lviz Learn about loss visualization in TensorFlow.


Notebook Description
tsf-not-mnist Learn simple data curation by creating a pickle with formatted datasets for training, development and testing in TensorFlow.
tsf-fully-connected Progressively train deeper and more accurate models using logistic regression and neural networks in TensorFlow.
tsf-regularization Explore regularization techniques by training fully connected networks to classify notMNIST characters in TensorFlow.
tsf-convolutions Create convolutional neural networks in TensorFlow.
tsf-word2vec Train a skip-gram model over Text8 data in TensorFlow.
tsf-lstm Train a LSTM character model over Text8 data in TensorFlow.



Notebook Description
theano-intro Intro to Theano, which allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.
theano-scan Learn scans, a mechanism to perform loops in a Theano graph.
theano-logistic Implement logistic regression in Theano.
theano-rnn Implement recurrent neural networks in Theano.
theano-mlp Implement multilayer perceptrons in Theano.



Notebook Description
keras Keras is an open source neural network library written in Python. It is capable of running on top of either Tensorflow or Theano.
setup Learn about the tutorial goals and how to set up your Keras environment.
intro-deep-learning-ann Get an intro to deep learning with Keras and Artificial Neural Networks (ANN).
theano Learn about Theano by working with weights matrices and gradients.
keras-otto Learn about Keras by looking at the Kaggle Otto challenge.
ann-mnist Review a simple implementation of ANN for MNIST using Keras.
conv-nets Learn about Convolutional Neural Networks (CNNs) with Keras.
conv-net-1 Recognize handwritten digits from MNIST using Keras – Part 1.
conv-net-2 Recognize handwritten digits from MNIST using Keras – Part 2.
keras-models Use pre-trained models such as VGG16, VGG19, ResNet50, and Inception v3 with Keras.
auto-encoders Learn about Autoencoders with Keras.
rnn-lstm Learn about Recurrent Neural Networks (RNNs) with Keras.
lstm-sentence-gen Learn about RNNs using Long Short Term Memory (LSTM) networks with Keras.


Notebook Description
deep-dream Caffe-based computer vision program which uses a convolutional neural network to find and enhance patterns in images.



IPython Notebook(s) demonstrating scikit-learn functionality.

Notebook Description
intro Intro notebook to scikit-learn. Scikit-learn adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
knn Implement k-nearest neighbors in scikit-learn.
linear-reg Implement linear regression in scikit-learn.
svm Implement support vector machine classifiers with and without kernels in scikit-learn.
random-forest Implement random forest classifiers and regressors in scikit-learn.
k-means Implement k-means clustering in scikit-learn.
pca Implement principal component analysis in scikit-learn.
gmm Implement Gaussian mixture models in scikit-learn.
validation Implement validation and model selection in scikit-learn.



IPython Notebook(s) demonstrating statistical inference with SciPy functionality.

Notebook Description
scipy SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data.
effect-size Explore statistics that quantify effect size by analyzing the difference in height between men and women. Uses data from the Behavioral Risk Factor Surveillance System (BRFSS) to estimate the mean and standard deviation of height for adult women and men in the United States.
sampling Explore random sampling by analyzing the average weight of men and women in the United States using BRFSS data.
hypothesis Explore hypothesis testing by analyzing the difference of first-born babies compared with others.



IPython Notebook(s) demonstrating pandas functionality.

Notebook Description
pandas Software library written for data manipulation and analysis in Python. Offers data structures and operations for manipulating numerical tables and time series.
github-data-wrangling Learn how to load, clean, merge, and feature engineer by analyzing GitHub data from the Viz repo.
Introduction-to-Pandas Introduction to Pandas.
Introducing-Pandas-Objects Learn about Pandas objects.
Data Indexing and Selection Learn about data indexing and selection in Pandas.
Operations-in-Pandas Learn about operating on data in Pandas.
Missing-Values Learn about handling missing data in Pandas.
Hierarchical-Indexing Learn about hierarchical indexing in Pandas.
Concat-And-Append Learn about combining datasets: concat and append in Pandas.
Merge-and-Join Learn about combining datasets: merge and join in Pandas.
Aggregation-and-Grouping Learn about aggregation and grouping in Pandas.
Pivot-Tables Learn about pivot tables in Pandas.
Working-With-Strings Learn about vectorized string operations in Pandas.
Working-with-Time-Series Learn about working with time series in pandas.
Performance-Eval-and-Query Learn about high-performance Pandas: eval() and query() in Pandas.



IPython Notebook(s) demonstrating matplotlib functionality.

Notebook Description
matplotlib Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
matplotlib-applied Apply matplotlib visualizations to Kaggle competitions for exploratory data analysis. Learn how to create bar plots, histograms, subplot2grid, normalized plots, scatter plots, subplots, and kernel density estimation plots.
Introduction-To-Matplotlib Introduction to Matplotlib.
Simple-Line-Plots Learn about simple line plots in Matplotlib.
Simple-Scatter-Plots Learn about simple scatter plots in Matplotlib.
Errorbars.ipynb Learn about visualizing errors in Matplotlib.
Density-and-Contour-Plots Learn about density and contour plots in Matplotlib.
Histograms-and-Binnings Learn about histograms, binnings, and density in Matplotlib.
Customizing-Legends Learn about customizing plot legends in Matplotlib.
Customizing-Colorbars Learn about customizing colorbars in Matplotlib.
Multiple-Subplots Learn about multiple subplots in Matplotlib.
Text-and-Annotation Learn about text and annotation in Matplotlib.
Customizing-Ticks Learn about customizing ticks in Matplotlib.
Settings-and-Stylesheets Learn about customizing Matplotlib: configurations and stylesheets.
Three-Dimensional-Plotting Learn about three-dimensional plotting in Matplotlib.
Geographic-Data-With-Basemap Learn about geographic data with basemap in Matplotlib.
Visualization-With-Seaborn Learn about visualization with Seaborn.



IPython Notebook(s) demonstrating NumPy functionality.

Notebook Description
numpy Adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
Introduction-to-NumPy Introduction to NumPy.
Understanding-Data-Types Learn about data types in Python.
The-Basics-Of-NumPy-Arrays Learn about the basics of NumPy arrays.
Computation-on-arrays-ufuncs Learn about computations on NumPy arrays: universal functions.
Computation-on-arrays-aggregates Learn about aggregations: min, max, and everything in between in NumPy.
Computation-on-arrays-broadcasting Learn about computation on arrays: broadcasting in NumPy.
Boolean-Arrays-and-Masks Learn about comparisons, masks, and boolean logic in NumPy.
Fancy-Indexing Learn about fancy indexing in NumPy.
Sorting Learn about sorting arrays in NumPy.
Structured-Data-NumPy Learn about structured data: NumPy’s structured arrays.



IPython Notebook(s) demonstrating Python functionality geared towards data analysis.

Notebook Description
data structures Learn Python basics with tuples, lists, dicts, sets.
data structure utilities Learn Python operations such as slice, range, xrange, bisect, sort, sorted, reversed, enumerate, zip, list comprehensions.
functions Learn about more advanced Python features: Functions as objects, lambda functions, closures, *args, **kwargs currying, generators, generator expressions, itertools.
datetime Learn how to work with Python dates and times: datetime, strftime, strptime, timedelta.
logging Learn about Python logging with RotatingFileHandler and TimedRotatingFileHandler.
pdb Learn how to debug in Python with the interactive source code debugger.
unit tests Learn how to test in Python with Nose unit tests.



IPython Notebook(s) used in kaggle competitions and business analyses.

Notebook Description
titanic Predict survival on the Titanic. Learn data cleaning, exploratory data analysis, and machine learning.
churn-analysis Predict customer churn. Exercise logistic regression, gradient boosting classifers, support vector machines, random forests, and k-nearest-neighbors. Includes discussions of confusion matrices, ROC plots, feature importances, prediction probabilities, and calibration/descrimination.



IPython Notebook(s) demonstrating spark and HDFS functionality.

Notebook Description
spark In-memory cluster computing framework, up to 100 times faster for certain applications and is well suited for machine learning algorithms.
hdfs Reliably stores very large files across machines in a large cluster.



IPython Notebook(s) demonstrating Hadoop MapReduce with mrjob functionality.

Notebook Description
mapreduce-python Runs MapReduce jobs in Python, executing jobs locally or on Hadoop clusters. Demonstrates Hadoop Streaming in Python code with unit test and mrjob config file to analyze Amazon S3 bucket logs on Elastic MapReduce. Disco is another python-based alternative.



IPython Notebook(s) demonstrating Amazon Web Services (AWS) and AWS tools functionality.

Also check out:

  • SAWS: A Supercharged AWS command line interface (CLI).
  • Awesome AWS: A curated list of libraries, open source repos, guides, blogs, and other resources.
Notebook Description
boto Official AWS SDK for Python.
s3cmd Interacts with S3 through the command line.
s3distcp Combines smaller files and aggregates them together by taking in a pattern and target file. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.
s3-parallel-put Uploads multiple files to S3 in parallel.
redshift Acts as a fast data warehouse built on top of technology from massive parallel processing (MPP).
kinesis Streams data in real time with the ability to process thousands of data streams per second.
lambda Runs code in response to events, automatically managing compute resources.



IPython Notebook(s) demonstrating various command lines for Linux, Git, etc.

Notebook Description
linux Unix-like and mostly POSIX-compliant computer operating system. Disk usage, splitting files, grep, sed, curl, viewing running processes, terminal syntax highlighting, and Vim.
anaconda Distribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment.
ipython notebook Web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.
git Distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.
ruby Used to interact with the AWS command line and for Jekyll, a blog framework that can be hosted on GitHub Pages.
jekyll Simple, blog-aware, static site generator for personal, project, or organization sites. Renders Markdown or Textile and Liquid templates, and produces a complete, static website ready to be served by Apache HTTP Server, Nginx or another web server.
pelican Python-based alternative to Jekyll.
django High-level Python Web framework that encourages rapid development and clean, pragmatic design. It can be useful to share reports/analyses and for blogging. Lighter-weight alternatives include Pyramid, Flask, Tornado, and Bottle.


IPython Notebook(s) demonstrating miscellaneous functionality.

Notebook Description
regex Regular expression cheat sheet useful in data wrangling.
algorithmia Algorithmia is a marketplace for algorithms. This notebook showcases 4 different algorithms: Face Detection, Content Summarizer, Latent Dirichlet Allocation and Optical Character Recognition.



Anaconda is a free distribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing that aims to simplify package management and deployment.

Follow instructions to install Anaconda or the more lightweight miniconda.


For detailed instructions, scripts, and tools to set up your development environment for data analysis, check out the dev-setup repo.


To view interactive content or to modify elements within the IPython notebooks, you must first clone or download the repository then run the notebook. More information on IPython Notebooks can be found here.

$ git clone
$ cd data-science-ipython-notebooks
$ jupyter notebook

Notebooks tested with Python 2.7.x.



Contributions are welcome! For bug reports or requests please submit an issue.


Feel free to contact me to discuss any issues, questions, or comments.


This repository contains a variety of content; some developed by Donne Martin, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Donne Martin is distributed under the following license:

I am providing code and resources in this repository to you under an open source license. Because this is my personal repository, the license you receive to my code and resources is from me and not my employer (Facebook).

Copyright 2015 Donne Martin

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License



相关话题:大数据 数据分析 



  • 在线练习
  • 在线编程面试
  • 数据结构
  • 算法
  • 贪心算法
  • 位运算
  • 复杂度分析
  • 视频教程
  • 面试宝典
  • 计算机科学资讯
  • 文件结构





  • 链表是一种由节点(Node)组成的线性数据集合,每个节点通过指针指向下一个节点。它是一种由节点组成,并能用于表示序列的数据结构。
  • 单链表 :每个节点仅指向下一个节点,最后一个节点指向空(null)。
  • 双链表 :每个节点有两个指针p,n。p指向前一个节点,n指向下一个节点;最后一个节点指向空。
  • 循环链表 :每个节点指向下一个节点,最后一个节点指向第一个节点。
  • 时间复杂度:
    • 索引:O(n)
    • 查找:O(n)
    • 插入:O(1)
    • 删除:O(1)

  • 栈是一个元素集合,支持两个基本操作:push用于将元素压入栈,pop用于删除栈顶元素。
  • 后进先出的数据结构(Last In First Out, LIFO)
  • 时间复杂度
    • 索引:O(n)
    • 查找:O(n)
    • 插入:O(1)
    • 删除:O(1)


  • 队列是一个元素集合,支持两种基本操作:enqueue 用于添加一个元素到队列,dequeue 用于删除队列中的一个元素。
  • 先进先出的数据结构(First In First Out, FIFO)。
  • 时间复杂度
    • 索引:O(n)
    • 查找:O(n)
    • 插入:O(1)
    • 删除:O(1)

  • 树是无向、联通的无环图。


  • 二叉树是一个树形数据结构,每个节点最多可以有两个子节点,称为左子节点和右子节点。
  • 满二叉树(Full Tree) :二叉树中的每个节点有 0 或者 2 个子节点。


  • 完美二叉树(Perfect Binary) :二叉树中的每个节点有两个子节点,并且所有的叶子节点的深度是一样的。
  • 完全二叉树 :二叉树中除最后一层外其他各层的节点数均达到最大值,最后一层的节点都连续集中在最左边。


  • 二叉查找树(BST)是一种二叉树。其任何节点的值都大于等于左子树中的值,小于等于右子树中的值。
  • 时间复杂度
    • 索引:O(log(n))
    • 查找:O(log(n))
    • 插入:O(log(n))
    • 删除:O(log(n))



  • 字典树,又称为基数树或前缀树,是一种用于存储键值为字符串的动态集合或关联数组的查找树。树中的节点并不直接存储关联键值,而是该节点在树中的位置决定了其关联键值。一个节点的所有子节点都有相同的前缀,根节点则是空字符串。



  • 树状数组,又称为二进制索引树(Binary Indexed Tree,BIT),其概念上是树,但以数组实现。数组中的下标代表树中的节点,每个节点的父节点或子节点的下标可以通过位运算获得。数组中的每个元素都包含了预计算的区间值之和,在整个树更新的过程中,这些计算的值也同样会被更新。
  • 时间复杂度
    • 区间求和:O(log(n))
    • 更新:O(log(n))



  • 线段树是用于存储区间和线段的树形数据结构。它允许查找一个节点在若干条线段中出现的次数。
  • 时间复杂度
    • 区间查找:O(log(n))
    • 更新:O(log(n))


  • 堆是一种基于树的满足某些特性的数据结构:整个堆中的所有父子节点的键值都满足相同的排序条件。堆分为最大堆和最小堆。在最大堆中,父节点的键值永远大于等于所有子节点的键值,根节点的键值是最大的。最小堆中,父节点的键值永远小于等于所有子节点的键值,根节点的键值是最小的。
  • 时间复杂度
    • 索引:O(log(n))
    • 查找:O(log(n))
    • 插入:O(log(n))
    • 删除:O(log(n))
    • 删除最大值/最小值:O(1)



  • 哈希用于将任意长度的数据映射到固定长度的数据。哈希函数的返回值被称为哈希值、哈希码或者哈希。如果不同的主键得到相同的哈希值,则发生了冲突。
  • Hash Maphash map  是一个存储键值间关系的数据结构。HashMap 通过哈希函数将键转化为桶或者槽中的下标,从而便于指定值的查找。
  • 冲突解决
    • 链地址法( Separate Chaining :在链地址法中,每个桶(bucket)是相互独立的,每一个索引对应一个元素列表。处理HashMap 的时间就是查找桶的时间(常量)与遍历列表元素的时间之和。
    • 开放地址法( Open Addressing :在开放地址方法中,当插入新值时,会判断该值对应的哈希桶是否存在,如果存在则根据某种算法依次选择下一个可能的位置,直到找到一个未被占用的地址。开放地址即某个元素的位置并不永远由其哈希值决定。


  • 图是G =(V,E)的有序对,其包括顶点或节点的集合 V 以及边或弧的集合E,其中E包括了两个来自V的元素(即边与两个顶点相关联 ,并且该关联为这两个顶点的无序对)。
  • 无向图 :图的邻接矩阵是对称的,因此如果存在节点 u 到节点 v 的边,那节点 v 到节点 u 的边也一定存在。
  • 有向图 :图的邻接矩阵不是对称的。因此如果存在节点 u 到节点 v 的边并不意味着一定存在节点 v 到节点 u 的边。





  • 稳定:否
  • 时间复杂度
    • 最优:O(nlog(n))
    • 最差:O(n^2)
    • 平均:O(nlog(n))



  • 合并排序是一种分治算法。这个算法不断地将一个数组分为两部分,分别对左子数组和右子数组排序,然后将两个数组合并为新的有序数组。
  • 稳定:是
  • 时间复杂度:
    • 最优:O(nlog(n))
    • 最差:O(nlog(n))
    • 平均:O(nlog(n))


  • 桶排序是一种将元素分到一定数量的桶中的排序算法。每个桶内部采用其他算法排序,或递归调用桶排序。
  • 时间复杂度
    • 最优:Ω(n + k)
    • 最差: O(n^2)
    • 平均:Θ(n + k)



  • 基数排序类似于桶排序,将元素分发到一定数目的桶中。不同的是,基数排序在分割元素之后没有让每个桶单独进行排序,而是直接做了合并操作。
  • 时间复杂度
    • 最优:Ω(nk)
    • 最差: O(nk)
    • 平均:Θ(nk)


深度优先 搜索

  • 深度优先搜索是一种先遍历子节点而不回溯的图遍历算法。
  • 时间复杂度:O(|V| + |E|)


广度优先 搜索

  • 广度优先搜索是一种先遍历邻居节点而不是子节点的图遍历算法。
  • 时间复杂度:O(|V| + |E|)



  • 拓扑排序是有向图节点的线性排序。对于任何一条节点 u 到节点 v 的边,u 的下标先于 v。
  • 时间复杂度:O(|V| + |E|)


  • Dijkstra 算法是一种在有向图中查找单源最短路径的算法。
  • 时间复杂度:O(|V|^2)



  • Bellman-Ford  是一种在带权图中查找单一源点到其他节点最短路径的算法。
  • 虽然时间复杂度大于 Dijkstra 算法,但它可以处理包含了负值边的图。
  • 时间复杂度:
    • 最优:O(|E|)
    • 最差:O(|V||E|)


Floyd-Warshall 算法

  • Floyd-Warshall  算法是一种在无环带权图中寻找任意节点间最短路径的算法。
  • 该算法执行一次即可找到所有节点间的最短路径(路径权重和)。
  • 时间复杂度:
    • 最优:O(|V|^3)
    • 最差:O(|V|^3)
    • 平均:O(|V|^3)


  • 最小生成树算法是一种在无向带权图中查找最小生成树的贪心算法。换言之,最小生成树算法能在一个图中找到连接所有节点的边的最小子集。
  • 时间复杂度:O(|V|^2)

Kruskal 算法

  • Kruskal  算法也是一个计算最小生成树的贪心算法,但在 Kruskal 算法中,图不一定是连通的。
  • 时间复杂度:O(|E|log|V|)


  • 贪心算法总是做出在当前看来最优的选择,并希望最后整体也是最优的。
  • 使用贪心算法可以解决的问题必须具有如下两种特性:
    • 最优子结构
      • 问题的最优解包含其子问题的最优解。
    • 贪心选择
      • 每一步的贪心选择可以得到问题的整体最优解。
  • 实例-硬币选择问题
  • 给定期望的硬币总和为 V 分,以及 n 种硬币,即类型是 i 的硬币共有 coinValue[i] 分,i的范围是 [0…n – 1]。假设每种类型的硬币都有无限个,求解为使和为 V 分最少需要多少硬币?
  • 硬币:便士(1美分),镍(5美分),一角(10美分),四分之一(25美分)。
  • 假设总和 V 为41,。我们可以使用贪心算法查找小于或者等于 V 的面值最大的硬币,然后从 V 中减掉该硬币的值,如此重复进行。
    • V = 41 | 使用了0个硬币
    • V = 16 | 使用了1个硬币(41 – 25 = 16)
    • V = 6 | 使用了2个硬币(16 – 10 = 6)
    • V = 1 | 使用了3个硬币(6 – 5 = 1)
    • V = 0 | 使用了4个硬币(1 – 1 = 0)


  • 位运算即在比特级别进行操作的技术。使用位运算技术可以带来更快的运行速度与更小的内存使用。
  • 测试第 k 位:s & (1 << k);
  • 设置第k位:s |= (1 << k);
  • 关闭第k位:s &= ~(1 << k);
  • 切换第k位:s ^= (1 << k);
  • 乘以2n:s << n;
  • 除以2n:s >> n;
  • 交集:s & t;
  • 并集:s | t;
  • 减法:s & ~t;
  • 提取最小非0位:s & (-s);
  • 提取最小0位:~s & (s + 1);
  • 交换值:x ^= y; y ^= x; x ^= y;


大 O 表示

  • 大 O 表示用于表示某个算法的上界,用于描述最坏的情况。


小 O 表示

  • 小 O 表示用于描述某个算法的渐进上界,二者逐渐趋近。

大 Ω 表示

  • 大 Ω 表示用于描述某个算法的渐进下界。


小 ω 表示

  • 小 ω 表示用于描述某个算法的渐进下界,二者逐渐趋近。

Theta Θ 表示

  • Theta Θ 表示用于描述某个算法的确界,包括最小上界和最大下界。






A Beginner’s Guide to Big Data Terminology

Big Data includes so many specialized terms that it’s hard to know where to begin. Make sure you can talk the talk before you try to walk the walk.

Data science can be confusing enough without all of the complicated lingo and jargon. For many, the terms NoSQL, DaaS and Neural Networking instill nothing more than the hesitant thought, “this sounds data-related.” It can be difficult to tell a mathematical term from a proper programming language or a dystopian sci-fi world. The first step to getting the most out of data science is understanding the most basic of terminology. That’s why we compiled a list of terms from all across the big data spectrum.

Algorithms: Mathematical formulas or statistical processes used to analyze data. These are used in software to process and analyze any input data.

Analytics: The process of drawing conclusions based on raw information. Through analysis, otherwise meaningless data and numbers can be transformed into something useful. The focus here is on inference rather than big software systems. Perhaps that’s why data analysts are often well-versed in the art of story-telling. There are three main types of analytics in data, and they appear in the following order:

Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.

Predictive Analytics: Studying recent and historical data, analysts are now able to make predictions about the future. It is hardly 100% accurate, but it provides insight as to what will most likely happen next. This process often involves data mining, machine learning and statistics.

Prescriptive Analytics: Finally, having a solid prediction for the future, analysts can prescribe a course of action. This turns data into action and leads to real-world decisions.

Cloud: It’s available any and everywhere. Cloud computing simply means storing or accessing data (programs, files, data) over the internet instead of a hard drive.

DaaS: Data-as-a-service treats data as a product. DaaS providers use the cloud to give on-demand access of data to customers. This allows companies to get high quality data quickly. DaaS has been a popular word in 2015, and is playing a major role in marketing.

Data Mining: Data miners explore large sets of data to find patterns and insight. This is a highly analytical process that emphasizes making use of large datasets. This process could likely involve artificial intelligence, machine learning or statistics.

Dark Data: This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Database: A database is an organized collection of data. It may include charts, schemas or tables. It may also be integrated into a Database Management System (DBMS), a software that allows data to be explored and analyzed.

Hadoop (Apache Hadoop): An open source software framework, Hadoop works largely by storing files and processing data. It is also known for large processing power, making it easy to run a multitude of tasks concurrently. It allows businesses to save, access and analyze enormously big amounts of data. Apache is also in charge of other, related programs you may run into: Pig, Hive, and now Spark (more on Spark later).

IoT: The Internet of Things is generally described as the way products are able “talk” to each other. It is a network of objects (for example, your phone, wearable or car) embedded with network connectivity. Driverless cars are perfect examples. They are always pulling information from the cloud and their sensors are relaying information back. The IoT generates huge amounts of data, making it both important and popular for data science. There is also:

IoE (Internet of Everything): This combines products, people and processes to generate even more connectivity.

Machine Learning: An incredibly cool method of data analysis, machine learning automates analytical model building and relies on a machine’s ability to adapt. Using algorithms, models actively learn and better themselves each time they process new data. Though machine learning is not new, it is gaining massive traction as a modern data analysis tool. It enables machines to adapt and grow without needing hours of extra work on the part of scientists.

MapReduce: MapReduce is a programming model for processing and generating large data sets. This model actually does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of bits called tuples. Second, “Reduce” takes all of the broken down tuples and breaks them down even further. The result is a practical breakdown of information.

Neural Network: Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NoSQL: “Non-relational SQL” or “Not only SQL” is much like SQL (discussed below) but does not use relational tables with rows and columns. It is used to manage and stream processing of data. NoSQL includes a number of different databases and models that run horizontally, meaning across servers. This might make it more cost-effective than vertical scaling (as used in SQL).

Petabyte: Yes, it’s big. It’s 1,000,000,000,000,000 bytes. To visualize, Gizmodo described one petabyte as 20 million 4-drawer filing cabinets filled with texts. 20 Petabytes would be all the written works of mankind from the beginning of time translated in every language.

SQL: Also known as Structured Query Language, this is used for the managing and stream processing of data. It is used to communicate with and perform tasks on a database. Standard commands include “Insert,” “Update,” “Delete,” “Create,” and “Drop.” Data appears in a relational table with rows and columns.

R: R is a horribly named programming language that works with statistical computing. It is considered one of the more important and most popular languages in data science.

SaaS: Software-as-a-Service enables vendors to host an application and make it available via the internet. Yes, that’s cloud servicing. SaaS providers provide services over the cloud rather than hard copies.

Spark (Apache Spark): An open-source computing framework originally developed at University of California, Berkely, Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.




开发工具:PyCharm Community Edition(free)


IPython notebook :


scikit-learn sample

keras sample






Approaching (Almost) Any Machine Learning Problem