Category Archives: BigData

Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.

In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.

As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.

Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.

Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.

Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.

See: http://research.google.com/pubs/pub36726.html

Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.

Dremel’s architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.

See: http://research.google.com/pubs/pub36632.html

Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.

Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

See: http://kowshik.github.com/JPregel/pregel_paper.pdf

from:http://www.analyticbridge.com/profiles/blogs/percolator-dremel-and-pregel-alternatives-to-hadoop

链家二手房2016年成交数据分析

十二月 8, 2016BigData, Life数据分析, 链家dotte

海淀2016年成交量前20小区：

小区名	小区域	平均单价	平均总价	平均面积	成交套数
保利西山林语	海淀北部新区	43637	375.5	84	179
育新花园	西三旗	66225	462.1	71	164
小南庄社区	苏州桥	82135	523.3	65	134
永泰东里	清河	52026	308.5	60	128
旗胜家园	西三旗	30422	268.3	88	126
上地东里	上地	93510	727.6	84	124
SOCO公社	西三旗	27626	136.4	50	115
八里庄北里	定慧寺	70046	411.9	59	114
东南小区	中关村	104427	685.4	66	99

海淀2016年成交单价最高前10小区：

小区名	小区域	平均单价	平均总价	平均面积	成交套数
万城华府	万柳	126415	2746.7	221	3
蜂鸟家园	万柳	115048	572.1	50	65
玉渊潭南路9号院	军博	112900	1279.6	115	5
保利海德公园	知春路	110739	935.3	82	4
科育小区	中关村	106045	639.8	61	60
康桥水郡	万柳	104725	820.2	81	6
东南小区	中关村	104427	685.4	66	99
碧水云天	万柳	103362	1382.1	134	29
航天社区	中关村	102935	700.7	68	16

海淀2016年成交单价最低前10小区：

小区名	小区域	平均单价	平均总价	平均面积	成交套数
枫丹2号	西三旗	18012	73.3	40	8
颐丰庄园	海淀北部新区	23084	129.5	56	1
车公庄西路35号院	紫竹桥	23729	140.0	59	1
SOCO公社	西三旗	27626	136.4	50	115
宜品上层	上地	27679	153.4	56	21
专家国际公馆	清河	27941	197.0	71	9
辉煌国际	上地	29528	179.5	61	11
华杰大厦	皂君庙	29573	398.0	135	1
前沙涧路7号院	海淀其它	30255	265.0	88	1

更多详细数据见：链家2016成交数据分析

分析的SQL：

1、某小区成交数据按年统计

select name,regionb, strftime(‘%Y’,sign_time) as sign_year,avg(unit_price) as unit_price,avg(total_price),avg(area) from chengjiao where unit_price <>” and unit_price>6000 group by name,regionb,sign_year having name=’荣丰2008′ order by name,sign_year asc,unit_price desc

2、北京成交数据按区统计

select name,regionb,regions,avg(unit_price) as unit_price , avg(total_price) as total_price,avg(area) as area ,count(1) as num from chengjiao where unit_price <>” and regionb=’海淀’ and sign_time>’2016-01-01 00:00:00′ group by name,regionb,regions
–having name=’当代城市家园’
order by unit_price desc
–where name = ‘小南庄社区’ and unit_price <>” order by unit_price desc

数据来源：链家二手房成交记录（2016-1-1 到2016-11-23）

A Beginner’s Guide to Big Data Terminology

十一月 25, 2016BigDataBigdata, Terminologydotte

Big Data includes so many specialized terms that it’s hard to know where to begin. Make sure you can talk the talk before you try to walk the walk.

Data science can be confusing enough without all of the complicated lingo and jargon. For many, the terms NoSQL, DaaS and Neural Networking instill nothing more than the hesitant thought, “this sounds data-related.” It can be difficult to tell a mathematical term from a proper programming language or a dystopian sci-fi world. The first step to getting the most out of data science is understanding the most basic of terminology. That’s why we compiled a list of terms from all across the big data spectrum.

Algorithms: Mathematical formulas or statistical processes used to analyze data. These are used in software to process and analyze any input data.

Analytics: The process of drawing conclusions based on raw information. Through analysis, otherwise meaningless data and numbers can be transformed into something useful. The focus here is on inference rather than big software systems. Perhaps that’s why data analysts are often well-versed in the art of story-telling. There are three main types of analytics in data, and they appear in the following order:

Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.

Predictive Analytics: Studying recent and historical data, analysts are now able to make predictions about the future. It is hardly 100% accurate, but it provides insight as to what will most likely happen next. This process often involves data mining, machine learning and statistics.

Prescriptive Analytics: Finally, having a solid prediction for the future, analysts can prescribe a course of action. This turns data into action and leads to real-world decisions.

Cloud: It’s available any and everywhere. Cloud computing simply means storing or accessing data (programs, files, data) over the internet instead of a hard drive.

DaaS: Data-as-a-service treats data as a product. DaaS providers use the cloud to give on-demand access of data to customers. This allows companies to get high quality data quickly. DaaS has been a popular word in 2015, and is playing a major role in marketing.

Data Mining: Data miners explore large sets of data to find patterns and insight. This is a highly analytical process that emphasizes making use of large datasets. This process could likely involve artificial intelligence, machine learning or statistics.

Dark Data: This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Database: A database is an organized collection of data. It may include charts, schemas or tables. It may also be integrated into a Database Management System (DBMS), a software that allows data to be explored and analyzed.

Hadoop (Apache Hadoop): An open source software framework, Hadoop works largely by storing files and processing data. It is also known for large processing power, making it easy to run a multitude of tasks concurrently. It allows businesses to save, access and analyze enormously big amounts of data. Apache is also in charge of other, related programs you may run into: Pig, Hive, and now Spark (more on Spark later).

IoT: The Internet of Things is generally described as the way products are able “talk” to each other. It is a network of objects (for example, your phone, wearable or car) embedded with network connectivity. Driverless cars are perfect examples. They are always pulling information from the cloud and their sensors are relaying information back. The IoT generates huge amounts of data, making it both important and popular for data science. There is also:

IoE (Internet of Everything): This combines products, people and processes to generate even more connectivity.

Machine Learning: An incredibly cool method of data analysis, machine learning automates analytical model building and relies on a machine’s ability to adapt. Using algorithms, models actively learn and better themselves each time they process new data. Though machine learning is not new, it is gaining massive traction as a modern data analysis tool. It enables machines to adapt and grow without needing hours of extra work on the part of scientists.

MapReduce: MapReduce is a programming model for processing and generating large data sets. This model actually does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of bits called tuples. Second, “Reduce” takes all of the broken down tuples and breaks them down even further. The result is a practical breakdown of information.

Neural Network: Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NoSQL: “Non-relational SQL” or “Not only SQL” is much like SQL (discussed below) but does not use relational tables with rows and columns. It is used to manage and stream processing of data. NoSQL includes a number of different databases and models that run horizontally, meaning across servers. This might make it more cost-effective than vertical scaling (as used in SQL).

Petabyte: Yes, it’s big. It’s 1,000,000,000,000,000 bytes. To visualize, Gizmodo described one petabyte as 20 million 4-drawer filing cabinets filled with texts. 20 Petabytes would be all the written works of mankind from the beginning of time translated in every language.

SQL: Also known as Structured Query Language, this is used for the managing and stream processing of data. It is used to communicate with and perform tasks on a database. Standard commands include “Insert,” “Update,” “Delete,” “Create,” and “Drop.” Data appears in a relational table with rows and columns.

R: R is a horribly named programming language that works with statistical computing. It is considered one of the more important and most popular languages in data science.

SaaS: Software-as-a-Service enables vendors to host an application and make it available via the internet. Yes, that’s cloud servicing. SaaS providers provide services over the cloud rather than hard copies.

Spark (Apache Spark): An open-source computing framework originally developed at University of California, Berkely, Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.

from:http://dataconomy.com/a-beginners-guide-to-big-data-terminology/

python机器学习深度学习总结

十一月 25, 2016BigData, ML&DL, PythonBigdata, DeepLearning, Machine Learning, pythondotte

1、Python环境搭建（Windows）

开发工具：PyCharm Community Edition（free）

Python环境：WinPython 3.5.2.3Qt5
–此环境集成了机器学习和深度学习用到的主要包：
numpy,scipy,matplotlib,pandas,scikit-learn,theano,keras

IPython notebook :

2、示例代码：

scikit-learn sample

keras sample

3、数据集Datasets

GeoHey公共数据

4、kaggle平台

Kaggle是一个数据建模和数据分析竞赛平台。企业和研究者可在其上发布数据，统计学者和数据挖掘专家可在其上进行竞赛以产生最好的模型。这一众包模式依赖于这一事实，即有众多策略可以用于解决几乎所有预测建模的问题，而研究者不可能在一开始就了解什么方法对于特定问题是最为有效的。Kaggle的目标则是试图通过众包的形式来解决这一难题，进而使数据科学成为一场运动。(wiki)

5、常见问题处理

Approaching (Almost) Any Machine Learning Problem

Dotte博客

大数据、云计算、架构、语言的本质、计算的未来

Category Archives: BigData

API of Sites

Percolator, Dremel and Pregel: Alternatives to Hadoop

链家二手房2016年成交数据分析

A Beginner’s Guide to Big Data Terminology

python机器学习深度学习总结