Category Archives: BigData

链家二手房2016年成交数据分析

十二月 8, 2016BigData, Life数据分析, 链家dotte

海淀2016年成交量前20小区：

小区名	小区域	平均单价	平均总价	平均面积	成交套数
保利西山林语	海淀北部新区	43637	375.5	84	179
育新花园	西三旗	66225	462.1	71	164
小南庄社区	苏州桥	82135	523.3	65	134
永泰东里	清河	52026	308.5	60	128
旗胜家园	西三旗	30422	268.3	88	126
上地东里	上地	93510	727.6	84	124
SOCO公社	西三旗	27626	136.4	50	115
八里庄北里	定慧寺	70046	411.9	59	114
东南小区	中关村	104427	685.4	66	99

海淀2016年成交单价最高前10小区：

小区名	小区域	平均单价	平均总价	平均面积	成交套数
万城华府	万柳	126415	2746.7	221	3
蜂鸟家园	万柳	115048	572.1	50	65
玉渊潭南路9号院	军博	112900	1279.6	115	5
保利海德公园	知春路	110739	935.3	82	4
科育小区	中关村	106045	639.8	61	60
康桥水郡	万柳	104725	820.2	81	6
东南小区	中关村	104427	685.4	66	99
碧水云天	万柳	103362	1382.1	134	29
航天社区	中关村	102935	700.7	68	16

海淀2016年成交单价最低前10小区：

小区名	小区域	平均单价	平均总价	平均面积	成交套数
枫丹2号	西三旗	18012	73.3	40	8
颐丰庄园	海淀北部新区	23084	129.5	56	1
车公庄西路35号院	紫竹桥	23729	140.0	59	1
SOCO公社	西三旗	27626	136.4	50	115
宜品上层	上地	27679	153.4	56	21
专家国际公馆	清河	27941	197.0	71	9
辉煌国际	上地	29528	179.5	61	11
华杰大厦	皂君庙	29573	398.0	135	1
前沙涧路7号院	海淀其它	30255	265.0	88	1

更多详细数据见：链家2016成交数据分析

分析的SQL：

1、某小区成交数据按年统计

select name,regionb, strftime(‘%Y’,sign_time) as sign_year,avg(unit_price) as unit_price,avg(total_price),avg(area) from chengjiao where unit_price <>” and unit_price>6000 group by name,regionb,sign_year having name=’荣丰2008′ order by name,sign_year asc,unit_price desc

2、北京成交数据按区统计

select name,regionb,regions,avg(unit_price) as unit_price , avg(total_price) as total_price,avg(area) as area ,count(1) as num from chengjiao where unit_price <>” and regionb=’海淀’ and sign_time>’2016-01-01 00:00:00′ group by name,regionb,regions
–having name=’当代城市家园’
order by unit_price desc
–where name = ‘小南庄社区’ and unit_price <>” order by unit_price desc

数据来源：链家二手房成交记录（2016-1-1 到2016-11-23）

A Beginner’s Guide to Big Data Terminology

十一月 25, 2016BigDataBigdata, Terminologydotte

Big Data includes so many specialized terms that it’s hard to know where to begin. Make sure you can talk the talk before you try to walk the walk.

Data science can be confusing enough without all of the complicated lingo and jargon. For many, the terms NoSQL, DaaS and Neural Networking instill nothing more than the hesitant thought, “this sounds data-related.” It can be difficult to tell a mathematical term from a proper programming language or a dystopian sci-fi world. The first step to getting the most out of data science is understanding the most basic of terminology. That’s why we compiled a list of terms from all across the big data spectrum.

Algorithms: Mathematical formulas or statistical processes used to analyze data. These are used in software to process and analyze any input data.

Analytics: The process of drawing conclusions based on raw information. Through analysis, otherwise meaningless data and numbers can be transformed into something useful. The focus here is on inference rather than big software systems. Perhaps that’s why data analysts are often well-versed in the art of story-telling. There are three main types of analytics in data, and they appear in the following order:

Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.

Predictive Analytics: Studying recent and historical data, analysts are now able to make predictions about the future. It is hardly 100% accurate, but it provides insight as to what will most likely happen next. This process often involves data mining, machine learning and statistics.

Prescriptive Analytics: Finally, having a solid prediction for the future, analysts can prescribe a course of action. This turns data into action and leads to real-world decisions.

Cloud: It’s available any and everywhere. Cloud computing simply means storing or accessing data (programs, files, data) over the internet instead of a hard drive.

DaaS: Data-as-a-service treats data as a product. DaaS providers use the cloud to give on-demand access of data to customers. This allows companies to get high quality data quickly. DaaS has been a popular word in 2015, and is playing a major role in marketing.

Data Mining: Data miners explore large sets of data to find patterns and insight. This is a highly analytical process that emphasizes making use of large datasets. This process could likely involve artificial intelligence, machine learning or statistics.

Dark Data: This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Database: A database is an organized collection of data. It may include charts, schemas or tables. It may also be integrated into a Database Management System (DBMS), a software that allows data to be explored and analyzed.

Hadoop (Apache Hadoop): An open source software framework, Hadoop works largely by storing files and processing data. It is also known for large processing power, making it easy to run a multitude of tasks concurrently. It allows businesses to save, access and analyze enormously big amounts of data. Apache is also in charge of other, related programs you may run into: Pig, Hive, and now Spark (more on Spark later).

IoT: The Internet of Things is generally described as the way products are able “talk” to each other. It is a network of objects (for example, your phone, wearable or car) embedded with network connectivity. Driverless cars are perfect examples. They are always pulling information from the cloud and their sensors are relaying information back. The IoT generates huge amounts of data, making it both important and popular for data science. There is also:

IoE (Internet of Everything): This combines products, people and processes to generate even more connectivity.

Machine Learning: An incredibly cool method of data analysis, machine learning automates analytical model building and relies on a machine’s ability to adapt. Using algorithms, models actively learn and better themselves each time they process new data. Though machine learning is not new, it is gaining massive traction as a modern data analysis tool. It enables machines to adapt and grow without needing hours of extra work on the part of scientists.

MapReduce: MapReduce is a programming model for processing and generating large data sets. This model actually does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of bits called tuples. Second, “Reduce” takes all of the broken down tuples and breaks them down even further. The result is a practical breakdown of information.

Neural Network: Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NoSQL: “Non-relational SQL” or “Not only SQL” is much like SQL (discussed below) but does not use relational tables with rows and columns. It is used to manage and stream processing of data. NoSQL includes a number of different databases and models that run horizontally, meaning across servers. This might make it more cost-effective than vertical scaling (as used in SQL).

Petabyte: Yes, it’s big. It’s 1,000,000,000,000,000 bytes. To visualize, Gizmodo described one petabyte as 20 million 4-drawer filing cabinets filled with texts. 20 Petabytes would be all the written works of mankind from the beginning of time translated in every language.

SQL: Also known as Structured Query Language, this is used for the managing and stream processing of data. It is used to communicate with and perform tasks on a database. Standard commands include “Insert,” “Update,” “Delete,” “Create,” and “Drop.” Data appears in a relational table with rows and columns.

R: R is a horribly named programming language that works with statistical computing. It is considered one of the more important and most popular languages in data science.

SaaS: Software-as-a-Service enables vendors to host an application and make it available via the internet. Yes, that’s cloud servicing. SaaS providers provide services over the cloud rather than hard copies.

Spark (Apache Spark): An open-source computing framework originally developed at University of California, Berkely, Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.

from:http://dataconomy.com/a-beginners-guide-to-big-data-terminology/

python机器学习深度学习总结

十一月 25, 2016BigData, ML&DL, PythonBigdata, DeepLearning, Machine Learning, pythondotte

1、Python环境搭建（Windows）

开发工具：PyCharm Community Edition（free）

Python环境：WinPython 3.5.2.3Qt5
–此环境集成了机器学习和深度学习用到的主要包：
numpy,scipy,matplotlib,pandas,scikit-learn,theano,keras

IPython notebook :

2、示例代码：

scikit-learn sample

keras sample

3、数据集Datasets

GeoHey公共数据

4、kaggle平台

Kaggle是一个数据建模和数据分析竞赛平台。企业和研究者可在其上发布数据，统计学者和数据挖掘专家可在其上进行竞赛以产生最好的模型。这一众包模式依赖于这一事实，即有众多策略可以用于解决几乎所有预测建模的问题，而研究者不可能在一开始就了解什么方法对于特定问题是最为有效的。Kaggle的目标则是试图通过众包的形式来解决这一难题，进而使数据科学成为一场运动。(wiki)

5、常见问题处理

Approaching (Almost) Any Machine Learning Problem

大数据实践-链家房价成交分析

十一月 16, 2016BigDataBigdatadotte

1、数据获取

爬取链家所有成交数据

2、数据预处理

数据清洗

3、数据可视化、初步分析

4、数据挖掘

建立模型

训练

fine-tuning

预测

利用机器学习分析房价走势

在Python中使用线性回归预测数据

用数据说话：北京房价数据背后的数据

如何使用GeoHey公共数据来预测房价

Open dataset

十一月 14, 2016BigData, ML&DLdatasetdotte

Open dataset:
■ 1.http://archive.ics.uci.edu/ml/
—The best-known source of datasets for
machine learning is the University of California at Irvine. We used fewer
than 10 data sets in this book, but there are more than 200 datasets in this repository.
Many of these datasets are used to compare the performance of algorithms
so that researchers can have an objective comparison of performance.
■ 2.http://aws.amazon.com/publicdatasets/
—If you’re a big data cowboy, then
this is the link for you. Amazon has some really big datasets, including the
U.S. census data, the annotated human genome data, a 150 GB log of Wikipedia’s
page traffic, and a 500 GB database of Wikipedia’s link data.
■ 3.http://www.data.gov
—Data.gov is a website launched in 2009 to increase the
public’s access to government datasets. The site was intended to make all
government data public as long as the data was not private or restricted for
security reasons. In 2010, the site had over 250,000 datasets. It’s uncertain
how long the site will remain active. In 2011, the federal government
reduced funding for the Electronic Government Fund, which pays for
Data.gov. The datasets range from products recalled to a list of failed banks.
■4. http://www.data.gov/opendatasites
—Data.gov has a list of U.S. states, cities,
and countries that hold similar open data sites.
■5. http://www.infochimps.com/
—Infochimps is a company that aims to give
everyone access to every dataset in the world. Currently, they have more
than 14,000 datasets available to download. Unlike other listed sites, some
of the datasets on Infochimps are for sale. You can sell your own datasets
here as well.

refer:《Machine Learning in Action.pdf》

Dotte博客

大数据、云计算、架构、语言的本质、计算的未来