Tag Archives: Bigdata

A Beginner’s Guide to Big Data Terminology

Big Data includes so many specialized terms that it’s hard to know where to begin. Make sure you can talk the talk before you try to walk the walk.

Data science can be confusing enough without all of the complicated lingo and jargon. For many, the terms NoSQL, DaaS and Neural Networking instill nothing more than the hesitant thought, “this sounds data-related.” It can be difficult to tell a mathematical term from a proper programming language or a dystopian sci-fi world. The first step to getting the most out of data science is understanding the most basic of terminology. That’s why we compiled a list of terms from all across the big data spectrum.

Algorithms: Mathematical formulas or statistical processes used to analyze data. These are used in software to process and analyze any input data.

Analytics: The process of drawing conclusions based on raw information. Through analysis, otherwise meaningless data and numbers can be transformed into something useful. The focus here is on inference rather than big software systems. Perhaps that’s why data analysts are often well-versed in the art of story-telling. There are three main types of analytics in data, and they appear in the following order:

Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.

Predictive Analytics: Studying recent and historical data, analysts are now able to make predictions about the future. It is hardly 100% accurate, but it provides insight as to what will most likely happen next. This process often involves data mining, machine learning and statistics.

Prescriptive Analytics: Finally, having a solid prediction for the future, analysts can prescribe a course of action. This turns data into action and leads to real-world decisions.

Cloud: It’s available any and everywhere. Cloud computing simply means storing or accessing data (programs, files, data) over the internet instead of a hard drive.

DaaS: Data-as-a-service treats data as a product. DaaS providers use the cloud to give on-demand access of data to customers. This allows companies to get high quality data quickly. DaaS has been a popular word in 2015, and is playing a major role in marketing.

Data Mining: Data miners explore large sets of data to find patterns and insight. This is a highly analytical process that emphasizes making use of large datasets. This process could likely involve artificial intelligence, machine learning or statistics.

Dark Data: This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Database: A database is an organized collection of data. It may include charts, schemas or tables. It may also be integrated into a Database Management System (DBMS), a software that allows data to be explored and analyzed.

Hadoop (Apache Hadoop): An open source software framework, Hadoop works largely by storing files and processing data. It is also known for large processing power, making it easy to run a multitude of tasks concurrently. It allows businesses to save, access and analyze enormously big amounts of data. Apache is also in charge of other, related programs you may run into: Pig, Hive, and now Spark (more on Spark later).

IoT: The Internet of Things is generally described as the way products are able “talk” to each other. It is a network of objects (for example, your phone, wearable or car) embedded with network connectivity. Driverless cars are perfect examples. They are always pulling information from the cloud and their sensors are relaying information back. The IoT generates huge amounts of data, making it both important and popular for data science. There is also:

IoE (Internet of Everything): This combines products, people and processes to generate even more connectivity.

Machine Learning: An incredibly cool method of data analysis, machine learning automates analytical model building and relies on a machine’s ability to adapt. Using algorithms, models actively learn and better themselves each time they process new data. Though machine learning is not new, it is gaining massive traction as a modern data analysis tool. It enables machines to adapt and grow without needing hours of extra work on the part of scientists.

MapReduce: MapReduce is a programming model for processing and generating large data sets. This model actually does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of bits called tuples. Second, “Reduce” takes all of the broken down tuples and breaks them down even further. The result is a practical breakdown of information.

Neural Network: Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NoSQL: “Non-relational SQL” or “Not only SQL” is much like SQL (discussed below) but does not use relational tables with rows and columns. It is used to manage and stream processing of data. NoSQL includes a number of different databases and models that run horizontally, meaning across servers. This might make it more cost-effective than vertical scaling (as used in SQL).

Petabyte: Yes, it’s big. It’s 1,000,000,000,000,000 bytes. To visualize, Gizmodo described one petabyte as 20 million 4-drawer filing cabinets filled with texts. 20 Petabytes would be all the written works of mankind from the beginning of time translated in every language.

SQL: Also known as Structured Query Language, this is used for the managing and stream processing of data. It is used to communicate with and perform tasks on a database. Standard commands include “Insert,” “Update,” “Delete,” “Create,” and “Drop.” Data appears in a relational table with rows and columns.

R: R is a horribly named programming language that works with statistical computing. It is considered one of the more important and most popular languages in data science.

SaaS: Software-as-a-Service enables vendors to host an application and make it available via the internet. Yes, that’s cloud servicing. SaaS providers provide services over the cloud rather than hard copies.

Spark (Apache Spark): An open-source computing framework originally developed at University of California, Berkely, Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.




开发工具:PyCharm Community Edition(free)


IPython notebook :


scikit-learn sample

keras sample






Approaching (Almost) Any Machine Learning Problem



Hadoop 发行版的选择

大数据应用, Hadoop 仅仅是一个基础, 要用起来还需要安装很多组件, 比如Hive, Mahout, Sqoop, ZooKeeper 等等, 不得不需要考虑兼容性的问题: 版本是否兼容,组件是否有冲突,编译能否通过等, 一大堆事情. 真正要在企业中要用Hadoop, 我一般不推荐直接使用apache hadoop, 使用第三方发行包最稳定/最省事了.
第三方发行商, 有 Cloudera, Hortonworks, MapR, Cloudera 用户数最多, 另外 Hadoop之父目前也供职于Cloudera, 选它基本上没错.

我推荐: Cloudera 发行版

CDH 和 Cloudera Manager 是什么

CDH (Cloudera’s Distribution, including Apache Hadoop), 是Cloudera发行的Hadoop发行版,基于稳定的Hadoop版, 并集成了许多补丁, 可以直接在生产环境中使用.

Cloudera Manager 是 Cloudera 推出的大数据解决方案, 已经在安装/配置/监控方面做了大量的工作.它不仅包含CDH, 而且集成了很多常用的组件, 比如 HBASE, Hue, Impala, Kudu, Oozie, Kafka, Sentry, Solr, Spark, YARN, ZooKeeper 等, 它分为两个版本Cloudera Express 和 Cloudera Enterprise . Cloudera Express免费使用, Cloudera Enterprise 需要支付费用. Express版和Enterprise版差异不算大, 而且可以商用, 缺的只有非常高级的功能以及官方支持.

Cloudera Express和Enterprise的差异: Express版本最高支持50个节点, 足够大多数商业应用使用. http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_feature_differences.html

我推荐: Cloudera Express版

Cloudera 产品下载和安装

考虑到网速和墙的因素, 建议离线的方式安装, 即Manual Installation Using Cloudera Manager Tarballs安装方式.
离线安装Cloudera Manager 5和CDH5(最新版5.1.3) 完全教程
Cloudera Manager 5 和 CDH5 本地(离线)安装指南
CDH5 集群中 Spark 集群模式的安装过程配置过程


使用VM是最快的体验环境搭建方式了, Cloudera 提供 QuickStart VM, 我们还有另一个选择, 即 Oracle Big Data Lite VM.
VirtualBox 以及extension pack下载
Cloudera quickstart VM 下载页面 或直接下载链接
Oracle Big data lite VM下载页面:
quickstart VM 配置教程

Cloudera quickstart VM 下载介质较小, 不到5GB, Oracle Big data lite VM大多了, 要30GB. 我推荐Cloudera quickstart VM.
Cloudera quickstart VM中的几个Accounts,
username: cloudera ,password: cloudera
username: root ,password: cloudera
username: root ,password: cloudera
username: other accounts ,password: cloudera
Hue and Cloudera Manager等服务:
username: cloudera ,password: cloudera

在Oracle VM中, 最重要的东西有:

  • Oracle Enterprise Linux 6.7, 基本上可以等同于CentOS 6.7
  • Oracle Database 12.1, 包括一些大数据方面的增强
  • CDH 5.4.7, 挺新的
  • Cloudera Manager 5.4.7

Oracle VM 推荐的最低配置:

  • Host OS 必须是64 bit
  • 分配 2 core
  • 最少 4 GB 内存
  • 初始分配50GB硬盘空间, 需打开自动扩展

VirtualBox虚拟机网络默认采用NAT(网络地址转换模式)模式, 在该模式下, 虚拟机可以通过主机来连接上internet网络, 非常简单, 我也一直使用这种模式.
只能单向访问, 虚拟机可以通过网络访问到主机, 主机无法通过网络访问到虚拟机.
只能单向访问, 虚拟机访问到网络上的其他主机, 但这些主机无法访问到虚拟机.
办法是有的, 通过端口转发即可, 其实quickstart VM已经给我们将VM上常用的大数据服务端口作了映射.比如 VM hue 端口 8888, 映射到host的同一端口上了.
为了防止guest OS和host OS的ssh 22端口冲突, 我将VM的22端口映射到2022, 将VM的Oracle 1521端口映射成主机的2521端口.


hdfs client: 我推荐使用 snakebite 这个pure python 版hdfs client 目前还不支持python 3. https://github.com/spotify/snakebite
Anaconda, 因为snakebite 的缘故, 我还是使用 Anaconda Python2.7版本