Category Archives: R

机器学习算法 Python&R 速查表

原文出处: Cheatsheet – Python & R codes for common Machine Learning Algorithms
在拿破仑•希尔的名著《思考与致富》中讲述了达比的故事:达比经过几年的时间快要挖掘到了金矿,却在离它三英尺的地方离开了!

现在,我不知道这个故事是否真实。但是,我肯定在我的周围有一些跟达比一样的人,这些人认为,不管遇到什么问题, 机器学习的目的就是执行以及使用2 – 3组算法。他们不去尝试更好的算法和技术,因为他们觉得太困难或耗费时间。

像达比一样,他们无疑是在到达最后一步的时候突然消失了!最后,他们放弃机器学习,说计算量非常大、非常困难或者认为自己的模型已经到达优化的临界点——真的是这样吗?

下面这些速查表能让这些“达比”成为机器学习的支持者。这是10个最常用的机器学习算法,这些算法使用了Python和R代码。考虑到机器学习在构建模型中的应用,这些速查表可以很好作为编码指南帮助你学好这些机器学习算法。Good Luck!

PDF版本1

from:http://colobu.com/2015/11/05/full-cheatsheet-machine-learning-algorithms

快速搭建大数据分析环境

Hadoop 发行版的选择

大数据应用, Hadoop 仅仅是一个基础, 要用起来还需要安装很多组件, 比如Hive, Mahout, Sqoop, ZooKeeper 等等, 不得不需要考虑兼容性的问题: 版本是否兼容,组件是否有冲突,编译能否通过等, 一大堆事情. 真正要在企业中要用Hadoop, 我一般不推荐直接使用apache hadoop, 使用第三方发行包最稳定/最省事了.
第三方发行商, 有 Cloudera, Hortonworks, MapR, Cloudera 用户数最多, 另外 Hadoop之父目前也供职于Cloudera, 选它基本上没错.

我推荐: Cloudera 发行版
***

CDH 和 Cloudera Manager 是什么

CDH (Cloudera’s Distribution, including Apache Hadoop), 是Cloudera发行的Hadoop发行版,基于稳定的Hadoop版, 并集成了许多补丁, 可以直接在生产环境中使用.

Cloudera Manager 是 Cloudera 推出的大数据解决方案, 已经在安装/配置/监控方面做了大量的工作.它不仅包含CDH, 而且集成了很多常用的组件, 比如 HBASE, Hue, Impala, Kudu, Oozie, Kafka, Sentry, Solr, Spark, YARN, ZooKeeper 等, 它分为两个版本Cloudera Express 和 Cloudera Enterprise . Cloudera Express免费使用, Cloudera Enterprise 需要支付费用. Express版和Enterprise版差异不算大, 而且可以商用, 缺的只有非常高级的功能以及官方支持.

Cloudera Express和Enterprise的差异: Express版本最高支持50个节点, 足够大多数商业应用使用. http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_feature_differences.html

我推荐: Cloudera Express版

Cloudera 产品下载和安装

考虑到网速和墙的因素, 建议离线的方式安装, 即Manual Installation Using Cloudera Manager Tarballs安装方式.
几个参考文章:
离线安装Cloudera Manager 5和CDH5(最新版5.1.3) 完全教程
Cloudera Manager 5 和 CDH5 本地(离线)安装指南
CDH5 集群中 Spark 集群模式的安装过程配置过程


使用虚拟机搭建体验大数据平台

使用VM是最快的体验环境搭建方式了, Cloudera 提供 QuickStart VM, 我们还有另一个选择, 即 Oracle Big Data Lite VM.
VirtualBox 以及extension pack下载
Cloudera quickstart VM 下载页面 或直接下载链接
Oracle Big data lite VM下载页面:
quickstart VM 配置教程

Cloudera quickstart VM 下载介质较小, 不到5GB, Oracle Big data lite VM大多了, 要30GB. 我推荐Cloudera quickstart VM.
Cloudera quickstart VM中的几个Accounts,
OS:
username: cloudera ,password: cloudera
username: root ,password: cloudera
MySQL:
username: root ,password: cloudera
username: other accounts ,password: cloudera
Hue and Cloudera Manager等服务:
username: cloudera ,password: cloudera

在Oracle VM中, 最重要的东西有:

  • Oracle Enterprise Linux 6.7, 基本上可以等同于CentOS 6.7
  • Oracle Database 12.1, 包括一些大数据方面的增强
  • CDH 5.4.7, 挺新的
  • Cloudera Manager 5.4.7

Oracle VM 推荐的最低配置:

  • Host OS 必须是64 bit
  • 分配 2 core
  • 最少 4 GB 内存
  • 初始分配50GB硬盘空间, 需打开自动扩展

VirtualBox虚拟机的网络设置的注意事项:
VirtualBox虚拟机网络默认采用NAT(网络地址转换模式)模式, 在该模式下, 虚拟机可以通过主机来连接上internet网络, 非常简单, 我也一直使用这种模式.
虚拟机和主机关系:
只能单向访问, 虚拟机可以通过网络访问到主机, 主机无法通过网络访问到虚拟机.
虚拟机和网络其他主机的关系:
只能单向访问, 虚拟机访问到网络上的其他主机, 但这些主机无法访问到虚拟机.
虚拟机和虚拟机的关系:
互相不能访问
主机有没有办法访问虚拟机?
办法是有的, 通过端口转发即可, 其实quickstart VM已经给我们将VM上常用的大数据服务端口作了映射.比如 VM hue 端口 8888, 映射到host的同一端口上了.
为了防止guest OS和host OS的ssh 22端口冲突, 我将VM的22端口映射到2022, 将VM的Oracle 1521端口映射成主机的2521端口.

安装python环境

hdfs client: 我推荐使用 snakebite 这个pure python 版hdfs client 目前还不支持python 3. https://github.com/spotify/snakebite
Anaconda, 因为snakebite 的缘故, 我还是使用 Anaconda Python2.7版本

可用于大数据分析的几个dataset

from:http://www.cnblogs.com/harrychinese/p/big_data_platform_quickstart.html

R – Quick Guide

R – Overview

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU project called GNU S.

Evolution of R

R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.

  • A large group of individuals has contributed to R by sending code and bug reports.
  • Since mid-1997 there has been a core group (the “R Core Team”) who can modify the R source code archive.

Features of R

As stated earlier, R is a programming language and software environment for statistical analysis, graphics representation and reporting. The following are the important features of R −

  • R is a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
  • R has an effective data handling and storage facility,
  • R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
  • R provides a large, coherent and integrated collection of tools for data analysis.
  • R provides graphical facilities for data analysis and display either directly at the computer or printing at the papers.

As a conclusion, R is world’s most widely used statistics programming language. It’s the # 1 choice of data scientists and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in mission critical business applications. This tutorial will teach you R programming along with suitable examples in simple and easy steps.

R – Environment Setup

Try it Option Online

You really do not need to set up your own environment to start learning R programming language. Reason is very simple, we already have set up R Programming environment online, so that you can compile and execute all the available examples online at the same time when you are doing your theory work. This gives you confidence in what you are reading and to check the result with different options. Feel free to modify any example and execute it online.

Try the following example using Try it option at the website available at the top right corner of the below sample code box −

For most of the examples given in this tutorial, you will find Try itoption at the website, so just make use of it and enjoy your learning.

Local Environment Setup

If you are still willing to set up your environment for R, you can follow the steps given below.

Windows Installation

You can download the Windows installer version of R from R-3.2.2 for Windows (32/64 bit) and save it in a local directory.

As it is a Windows installer (.exe) with a name “R-version-win.exe”. You can just double click and run the installer accepting the default settings. If your Windows is 32-bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the 32-bit and 64-bit versions.

After installation you can locate the icon to run the Program in a directory structure “R\R3.2.2\bin\i386\Rgui.exe” under the Windows Program Files. Clicking this icon brings up the R-GUI which is the R console to do R Programming.

Linux Installation

R is available as a binary for many versions of Linux at the location R Binaries.

The instruction to install Linux varies from flavor to flavor. These steps are mentioned under each type of Linux version in the mentioned link. However, if you are in a hurry, then you can use yum command to install R as follows −

Above command will install core functionality of R programming along with standard packages, still you need additional package, then you can launch R prompt as follows −

Now you can use install command at R prompt to install the required package. For example, the following command will install plotrix package which is required for 3D charts.

R – Basic Syntax

As a convention, we will start learning R programming by writing a “Hello, World!” program. Depending on the needs, you can program either at R command prompt or you can use an R script file to write your program. Let’s check both one by one.

R Command Prompt

Once you have R environment setup, then it’s easy to start your R command prompt by just typing the following command at your command prompt −

This will launch R interpreter and you will get a prompt > where you can start typing your program as follows −

Here first statement defines a string variable myString, where we assign a string “Hello, World!” and then next statement print() is being used to print the value stored in variable myString.

R Script File

Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript. So let’s start with writing following code in a text file called test.R as under −

Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if you are using Windows or other system, syntax will remain same.

When we run the above program, it produces the following result.

Comments

Comments are like helping text in your R program and they are ignored by the interpreter while executing your actual program. Single comment is written using # in the beginning of the statement as follows −

R does not support multi-line comments but you can perform a trick which is something as follows −

Though above comments will be executed by R interpreter, they will not interfere with your actual program. You should put such comments inside, either single or double quote.

R – Data Types

Generally, while doing programming in any programming language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory.

You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory.

In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −

  • Vectors
  • Lists
  • Matrices
  • Arrays
  • Factors
  • Data Frames

The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.

Data Type Example Verify
Logical TRUE, FALSE
it produces the following result −
Numeric 12.3, 5, 999
it produces the following result −
Integer 2L, 34L, 0L
it produces the following result −
Complex 3 + 2i
it produces the following result −
Character ‘a’ , ‘”good”, “TRUE”, ‘23.4’
it produces the following result −
Raw “Hello” is stored as 48 65 6c 6c 6f
it produces the following result −

In R programming, the very basic data types are the R-objects called vectorswhich hold elements of different classes as shown above. Please note in R the number of classes is not confined to only the above six types. For example, we can use many atomic vectors and create an array whose class will become array.

Vectors

When you want to create vector with more than one element, you should usec() function which means to combine the elements into a vector.

When we execute the above code, it produces the following result −

Lists

A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.

When we execute the above code, it produces the following result −

Matrices

A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.

When we execute the above code, it produces the following result −

Arrays

While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3×3 matrices each.

When we execute the above code, it produces the following result −

Factors

Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.

Factors are created using the factor() function.The nlevels functions gives the count of levels.

When we execute the above code, it produces the following result −

Data Frames

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

When we execute the above code, it produces the following result −

R – Variables

A variable provides us with named storage that our programs can manipulate. A variable in R can store an atomic vector, group of atomic vectors or a combination of many Robjects. A valid variable name consists of letters, numbers and the dot or underline characters. The variable name starts with a letter or the dot not followed by a number.

Variable Name Validity Reason
var_name2. valid Has letters, numbers, dot and underscore
var_name% Invalid Has the character ‘%’. Only dot(.) and underscore allowed.
2var_name invalid Starts with a number
.var_name ,
var.name
valid Can start with a dot(.) but the dot(.)should not be followed by a number.
.2var_name invalid The starting dot is followed by a number making it invalid.
_var_name invalid Starts with _ which is not valid

Variable Assignment

The variables can be assigned values using leftward, rightward and equal to operator. The values of the variables can be printed using print() orcat()function. The cat() function combines multiple items into a continuous print output.

When we execute the above code, it produces the following result −

Note − The vector c(TRUE,1) has a mix of logical and numeric class. So logical class is coerced to numeric class making TRUE as 1.

Data Type of a Variable

In R, a variable itself is not declared of any data type, rather it gets the data type of the R – object assigned to it. So R is called a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program.

When we execute the above code, it produces the following result −

Finding Variables

To know all the variables currently available in the workspace we use the ls()function. Also the ls() function can use patterns to match the variable names.

When we execute the above code, it produces the following result −

Note − It is a sample output depending on what variables are declared in your environment.

The ls() function can use patterns to match the variable names.

When we execute the above code, it produces the following result −

The variables starting with dot(.) are hidden, they can be listed using “all.names = TRUE” argument to ls() function.

When we execute the above code, it produces the following result −

Deleting Variables

Variables can be deleted by using the rm() function. Below we delete the variable var.3. On printing the value of the variable error is thrown.

When we execute the above code, it produces the following result −

All the variables can be deleted by using the rm() and ls() function together.

When we execute the above code, it produces the following result −

R – Operators

An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulations. R language is rich in built-in operators and provides following types of operators.

Types of Operators

We have the following types of operators in R programming −

  • Arithmetic Operators
  • Relational Operators
  • Logical Operators
  • Assignment Operators
  • Miscellaneous Operators

Arithmetic Operators

Following table shows the arithmetic operators supported by R language. The operators act on each element of the vector.

Operator Description Example
+ Adds two vectors
it produces the following result −
Subtracts second vector from the first
it produces the following result −
* Multiplies both vectors
it produces the following result −
/ Divide the first vector with the second
When we execute the above code, it produces the following result −
%% Give the remainder of the first vector with the second
it produces the following result −
%/% The result of division of first vector with second (quotient)
it produces the following result −
^ The first vector raised to the exponent of second vector
it produces the following result −

Relational Operators

Following table shows the relational operators supported by R language. Each element of the first vector is compared with the corresponding element of the second vector. The result of comparison is a Boolean value.

Operator Description Example
> Checks if each element of the first vector is greater than the corresponding element of the second vector.
it produces the following result −
< Checks if each element of the first vector is less than the corresponding element of the second vector.
it produces the following result −
== Checks if each element of the first vector is equal to the corresponding element of the second vector.
it produces the following result −
<= Checks if each element of the first vector is less than or equal to the corresponding element of the second vector.
it produces the following result −
>= Checks if each element of the first vector is greater than or equal to the corresponding element of the second vector.
it produces the following result −
!= Checks if each element of the first vector is unequal to the corresponding element of the second vector.
it produces the following result −

Logical Operators

Following table shows the logical operators supported by R language. It is applicable only to vectors of type logical, numeric or complex. All numbers greater than 1 are considered as logical value TRUE.

Each element of the first vector is compared with the corresponding element of the second vector. The result of comparison is a Boolean value.

Operator Description Example
& It is called Element-wise Logical AND operator. It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if both the elements are TRUE.
it produces the following result −
| It is called Element-wise Logical OR operator. It combines each element of the first vector with the corresponding element of the second vector and gives a output TRUE if one the elements is TRUE.
it produces the following result −
! It is called Logical NOT operator. Takes each element of the vector and gives the opposite logical value.
it produces the following result −

The logical operator && and || considers only the first element of the vectors and give a vector of single element as output.

Operator Description Example
&& Called Logical AND operator. Takes first element of both the vectors and gives the TRUE only if both are TRUE.
it produces the following result −
|| Called Logical OR operator. Takes first element of both the vectors and gives the TRUE only if both are TRUE.
it produces the following result −

Assignment Operators

These operators are used to assign values to vectors.

Operator Description Example
<−

or

=

or

<<−

Called Left Assignment

it produces the following result −

->

or

->>

Called Right Assignment

it produces the following result −

Miscellaneous Operators

These operators are used to for specific purpose and not general mathematical or logical computation.

Operator Description Example
: Colon operator. It creates the series of numbers in sequence for a vector.
it produces the following result −
%in% This operator is used to identify if an element belongs to a vector.
it produces the following result −
%*% This operator is used to multiply a matrix with its transpose.
it produces the following result −

R – Decision making

Decision making structures require the programmer to specify one or more conditions to be evaluated or tested by the program, along with a statement or statements to be executed if the condition is determined to be true, and optionally, other statements to be executed if the condition is determined to befalse.

Following is the general form of a typical decision making structure found in most of the programming languages −

Decision Making

R provides the following types of decision making statements. Click the following links to check their detail.

Sr.No. Statement & Description
1 if statementAn if statement consists of a Boolean expression followed by one or more statements.
2 if…else statementAn if statement can be followed by an optional else statement, which executes when the Boolean expression is false.
3 switch statementA switch statement allows a variable to be tested for equality against a list of values.

R – Loops

There may be a situation when you need to execute a block of code several number of times. In general, statements are executed sequentially. The first statement in a function is executed first, followed by the second, and so on.

Programming languages provide various control structures that allow for more complicated execution paths.

A loop statement allows us to execute a statement or group of statements multiple times and the following is the general form of a loop statement in most of the programming languages −

Loop Architecture

R programming language provides the following kinds of loop to handle looping requirements. Click the following links to check their detail.

Sr.No. Loop Type & Description
1 repeat loopExecutes a sequence of statements multiple times and abbreviates the code that manages the loop variable.
2 while loopRepeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body.
3 for loopLike a while statement, except that it tests the condition at the end of the loop body.

Loop Control Statements

Loop control statements change execution from its normal sequence. When execution leaves a scope, all automatic objects that were created in that scope are destroyed.

R supports the following control statements. Click the following links to check their detail.

Sr.No. Control Statement & Description
1 break statementTerminates the loop statement and transfers execution to the statement immediately following the loop.
2 Next statementThe next statement simulates the behavior of R switch.

R – Functions

A function is a set of statements organized together to perform a specific task. R has a large number of in-built functions and the user can create their own functions.

In R, a function is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions.

The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects.

Function Definition

An R function is created by using the keyword function. The basic syntax of an R function definition is as follows −

Function Components

The different parts of a function are −

  • Function Name − This is the actual name of the function. It is stored in R environment as an object with this name.
  • Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.
  • Function Body − The function body contains a collection of statements that defines what the function does.
  • Return Value − The return value of a function is the last expression in the function body to be evaluated.

R has many in-built functions which can be directly called in the program without defining them first. We can also create and use our own functions referred as user defined functions.

Built-in Function

Simple examples of in-built functions are seq(), mean(), max(), sum(x) andpaste(…) etc. They are directly called by user written programs. You can refermost widely used R functions.

When we execute the above code, it produces the following result −

User-defined Function

We can create user-defined functions in R. They are specific to what a user wants and once created they can be used like the built-in functions. Below is an example of how a function is created and used.

Calling a Function

When we execute the above code, it produces the following result −

Calling a Function without an Argument

When we execute the above code, it produces the following result −

Calling a Function with Argument Values (by position and by name)

The arguments to a function call can be supplied in the same sequence as defined in the function or they can be supplied in a different sequence but assigned to the names of the arguments.

When we execute the above code, it produces the following result −

Calling a Function with Default Argument

We can define the value of the arguments in the function definition and call the function without supplying any argument to get the default result. But we can also call such functions by supplying new values of the argument and get non default result.

When we execute the above code, it produces the following result −

Lazy Evaluation of Function

Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the function body.

When we execute the above code, it produces the following result −

R – Strings

Any value written within a pair of single quote or double quotes in R is treated as a string. Internally R stores every string within double quotes, even when you create them with single quote.

Rules Applied in String Construction

  • The quotes at the beginning and end of a string should be both double quotes or both single quote. They can not be mixed.
  • Double quotes can be inserted into a string starting and ending with single quote.
  • Single quote can be inserted into a string starting and ending with double quotes.
  • Double quotes can not be inserted into a string starting and ending with double quotes.
  • Single quote can not be inserted into a string starting and ending with single quote.

Examples of Valid Strings

Following examples clarify the rules about creating a string in R.

When the above code is run we get the following output −

Examples of Invalid Strings