Category Archives: BigData

技术面试要了解的算法和数据结构知识

目录

  • 在线练习
  • 在线编程面试
  • 数据结构
  • 算法
  • 贪心算法
  • 位运算
  • 复杂度分析
  • 视频教程
  • 面试宝典
  • 计算机科学资讯
  • 文件结构

在线练习

在线编程面试

数据结构

链表

  • 链表是一种由节点(Node)组成的线性数据集合,每个节点通过指针指向下一个节点。它是一种由节点组成,并能用于表示序列的数据结构。
  • 单链表 :每个节点仅指向下一个节点,最后一个节点指向空(null)。
  • 双链表 :每个节点有两个指针p,n。p指向前一个节点,n指向下一个节点;最后一个节点指向空。
  • 循环链表 :每个节点指向下一个节点,最后一个节点指向第一个节点。
  • 时间复杂度:
    • 索引:O(n)
    • 查找:O(n)
    • 插入:O(1)
    • 删除:O(1)

  • 栈是一个元素集合,支持两个基本操作:push用于将元素压入栈,pop用于删除栈顶元素。
  • 后进先出的数据结构(Last In First Out, LIFO)
  • 时间复杂度
    • 索引:O(n)
    • 查找:O(n)
    • 插入:O(1)
    • 删除:O(1)

队列

  • 队列是一个元素集合,支持两种基本操作:enqueue 用于添加一个元素到队列,dequeue 用于删除队列中的一个元素。
  • 先进先出的数据结构(First In First Out, FIFO)。
  • 时间复杂度
    • 索引:O(n)
    • 查找:O(n)
    • 插入:O(1)
    • 删除:O(1)

  • 树是无向、联通的无环图。

二叉树

  • 二叉树是一个树形数据结构,每个节点最多可以有两个子节点,称为左子节点和右子节点。
  • 满二叉树(Full Tree) :二叉树中的每个节点有 0 或者 2 个子节点。

大数据

  • 完美二叉树(Perfect Binary) :二叉树中的每个节点有两个子节点,并且所有的叶子节点的深度是一样的。
  • 完全二叉树 :二叉树中除最后一层外其他各层的节点数均达到最大值,最后一层的节点都连续集中在最左边。

二叉查找树

  • 二叉查找树(BST)是一种二叉树。其任何节点的值都大于等于左子树中的值,小于等于右子树中的值。
  • 时间复杂度
    • 索引:O(log(n))
    • 查找:O(log(n))
    • 插入:O(log(n))
    • 删除:O(log(n))

大数据

字典树

  • 字典树,又称为基数树或前缀树,是一种用于存储键值为字符串的动态集合或关联数组的查找树。树中的节点并不直接存储关联键值,而是该节点在树中的位置决定了其关联键值。一个节点的所有子节点都有相同的前缀,根节点则是空字符串。

大数据

树状数组

  • 树状数组,又称为二进制索引树(Binary Indexed Tree,BIT),其概念上是树,但以数组实现。数组中的下标代表树中的节点,每个节点的父节点或子节点的下标可以通过位运算获得。数组中的每个元素都包含了预计算的区间值之和,在整个树更新的过程中,这些计算的值也同样会被更新。
  • 时间复杂度
    • 区间求和:O(log(n))
    • 更新:O(log(n))

大数据

线段树

  • 线段树是用于存储区间和线段的树形数据结构。它允许查找一个节点在若干条线段中出现的次数。
  • 时间复杂度
    • 区间查找:O(log(n))
    • 更新:O(log(n))

大数据

  • 堆是一种基于树的满足某些特性的数据结构:整个堆中的所有父子节点的键值都满足相同的排序条件。堆分为最大堆和最小堆。在最大堆中,父节点的键值永远大于等于所有子节点的键值,根节点的键值是最大的。最小堆中,父节点的键值永远小于等于所有子节点的键值,根节点的键值是最小的。
  • 时间复杂度
    • 索引:O(log(n))
    • 查找:O(log(n))
    • 插入:O(log(n))
    • 删除:O(log(n))
    • 删除最大值/最小值:O(1)

大数据

哈希

  • 哈希用于将任意长度的数据映射到固定长度的数据。哈希函数的返回值被称为哈希值、哈希码或者哈希。如果不同的主键得到相同的哈希值,则发生了冲突。
  • Hash Maphash map  是一个存储键值间关系的数据结构。HashMap 通过哈希函数将键转化为桶或者槽中的下标,从而便于指定值的查找。
  • 冲突解决
    • 链地址法( Separate Chaining :在链地址法中,每个桶(bucket)是相互独立的,每一个索引对应一个元素列表。处理HashMap 的时间就是查找桶的时间(常量)与遍历列表元素的时间之和。
    • 开放地址法( Open Addressing :在开放地址方法中,当插入新值时,会判断该值对应的哈希桶是否存在,如果存在则根据某种算法依次选择下一个可能的位置,直到找到一个未被占用的地址。开放地址即某个元素的位置并不永远由其哈希值决定。

大数据

  • 图是G =(V,E)的有序对,其包括顶点或节点的集合 V 以及边或弧的集合E,其中E包括了两个来自V的元素(即边与两个顶点相关联 ,并且该关联为这两个顶点的无序对)。
  • 无向图 :图的邻接矩阵是对称的,因此如果存在节点 u 到节点 v 的边,那节点 v 到节点 u 的边也一定存在。
  • 有向图 :图的邻接矩阵不是对称的。因此如果存在节点 u 到节点 v 的边并不意味着一定存在节点 v 到节点 u 的边。

大数据

算法

排序

快速排序

  • 稳定:否
  • 时间复杂度
    • 最优:O(nlog(n))
    • 最差:O(n^2)
    • 平均:O(nlog(n))

大数据

合并排序

  • 合并排序是一种分治算法。这个算法不断地将一个数组分为两部分,分别对左子数组和右子数组排序,然后将两个数组合并为新的有序数组。
  • 稳定:是
  • 时间复杂度:
    • 最优:O(nlog(n))
    • 最差:O(nlog(n))
    • 平均:O(nlog(n))

桶排序

  • 桶排序是一种将元素分到一定数量的桶中的排序算法。每个桶内部采用其他算法排序,或递归调用桶排序。
  • 时间复杂度
    • 最优:Ω(n + k)
    • 最差: O(n^2)
    • 平均:Θ(n + k)

大数据

基数排序

  • 基数排序类似于桶排序,将元素分发到一定数目的桶中。不同的是,基数排序在分割元素之后没有让每个桶单独进行排序,而是直接做了合并操作。
  • 时间复杂度
    • 最优:Ω(nk)
    • 最差: O(nk)
    • 平均:Θ(nk)

图算法

深度优先 搜索

  • 深度优先搜索是一种先遍历子节点而不回溯的图遍历算法。
  • 时间复杂度:O(|V| + |E|)

大数据

广度优先 搜索

  • 广度优先搜索是一种先遍历邻居节点而不是子节点的图遍历算法。
  • 时间复杂度:O(|V| + |E|)

大数据

拓扑排序

  • 拓扑排序是有向图节点的线性排序。对于任何一条节点 u 到节点 v 的边,u 的下标先于 v。
  • 时间复杂度:O(|V| + |E|)

Dijkstra算法

  • Dijkstra 算法是一种在有向图中查找单源最短路径的算法。
  • 时间复杂度:O(|V|^2)

大数据

Bellman-Ford算法

  • Bellman-Ford  是一种在带权图中查找单一源点到其他节点最短路径的算法。
  • 虽然时间复杂度大于 Dijkstra 算法,但它可以处理包含了负值边的图。
  • 时间复杂度:
    • 最优:O(|E|)
    • 最差:O(|V||E|)

大数据

Floyd-Warshall 算法

  • Floyd-Warshall  算法是一种在无环带权图中寻找任意节点间最短路径的算法。
  • 该算法执行一次即可找到所有节点间的最短路径(路径权重和)。
  • 时间复杂度:
    • 最优:O(|V|^3)
    • 最差:O(|V|^3)
    • 平均:O(|V|^3)

最小生成树算法

  • 最小生成树算法是一种在无向带权图中查找最小生成树的贪心算法。换言之,最小生成树算法能在一个图中找到连接所有节点的边的最小子集。
  • 时间复杂度:O(|V|^2)

Kruskal 算法

  • Kruskal  算法也是一个计算最小生成树的贪心算法,但在 Kruskal 算法中,图不一定是连通的。
  • 时间复杂度:O(|E|log|V|)

贪心算法

  • 贪心算法总是做出在当前看来最优的选择,并希望最后整体也是最优的。
  • 使用贪心算法可以解决的问题必须具有如下两种特性:
    • 最优子结构
      • 问题的最优解包含其子问题的最优解。
    • 贪心选择
      • 每一步的贪心选择可以得到问题的整体最优解。
  • 实例-硬币选择问题
  • 给定期望的硬币总和为 V 分,以及 n 种硬币,即类型是 i 的硬币共有 coinValue[i] 分,i的范围是 [0…n – 1]。假设每种类型的硬币都有无限个,求解为使和为 V 分最少需要多少硬币?
  • 硬币:便士(1美分),镍(5美分),一角(10美分),四分之一(25美分)。
  • 假设总和 V 为41,。我们可以使用贪心算法查找小于或者等于 V 的面值最大的硬币,然后从 V 中减掉该硬币的值,如此重复进行。
    • V = 41 | 使用了0个硬币
    • V = 16 | 使用了1个硬币(41 – 25 = 16)
    • V = 6 | 使用了2个硬币(16 – 10 = 6)
    • V = 1 | 使用了3个硬币(6 – 5 = 1)
    • V = 0 | 使用了4个硬币(1 – 1 = 0)

运算

  • 位运算即在比特级别进行操作的技术。使用位运算技术可以带来更快的运行速度与更小的内存使用。
  • 测试第 k 位:s & (1 << k);
  • 设置第k位:s |= (1 << k);
  • 关闭第k位:s &= ~(1 << k);
  • 切换第k位:s ^= (1 << k);
  • 乘以2n:s << n;
  • 除以2n:s >> n;
  • 交集:s & t;
  • 并集:s | t;
  • 减法:s & ~t;
  • 提取最小非0位:s & (-s);
  • 提取最小0位:~s & (s + 1);
  • 交换值:x ^= y; y ^= x; x ^= y;

运行时分析

大 O 表示

  • 大 O 表示用于表示某个算法的上界,用于描述最坏的情况。

大数据

小 O 表示

  • 小 O 表示用于描述某个算法的渐进上界,二者逐渐趋近。

大 Ω 表示

  • 大 Ω 表示用于描述某个算法的渐进下界。

大数据

小 ω 表示

  • 小 ω 表示用于描述某个算法的渐进下界,二者逐渐趋近。

Theta Θ 表示

  • Theta Θ 表示用于描述某个算法的确界,包括最小上界和最大下界。

大数据

视频教程

面试宝典

 

技术面试要了解的算法和数据结构知识

A Beginner’s Guide to Big Data Terminology

Big Data includes so many specialized terms that it’s hard to know where to begin. Make sure you can talk the talk before you try to walk the walk.

Data science can be confusing enough without all of the complicated lingo and jargon. For many, the terms NoSQL, DaaS and Neural Networking instill nothing more than the hesitant thought, “this sounds data-related.” It can be difficult to tell a mathematical term from a proper programming language or a dystopian sci-fi world. The first step to getting the most out of data science is understanding the most basic of terminology. That’s why we compiled a list of terms from all across the big data spectrum.

Algorithms: Mathematical formulas or statistical processes used to analyze data. These are used in software to process and analyze any input data.

Analytics: The process of drawing conclusions based on raw information. Through analysis, otherwise meaningless data and numbers can be transformed into something useful. The focus here is on inference rather than big software systems. Perhaps that’s why data analysts are often well-versed in the art of story-telling. There are three main types of analytics in data, and they appear in the following order:

Descriptive Analytics: Condensing big numbers into smaller pieces of information. This is similar to summarizing the data story. Rather than listing every single number and detail, there is a general thrust and narrative.

Predictive Analytics: Studying recent and historical data, analysts are now able to make predictions about the future. It is hardly 100% accurate, but it provides insight as to what will most likely happen next. This process often involves data mining, machine learning and statistics.

Prescriptive Analytics: Finally, having a solid prediction for the future, analysts can prescribe a course of action. This turns data into action and leads to real-world decisions.

Cloud: It’s available any and everywhere. Cloud computing simply means storing or accessing data (programs, files, data) over the internet instead of a hard drive.

DaaS: Data-as-a-service treats data as a product. DaaS providers use the cloud to give on-demand access of data to customers. This allows companies to get high quality data quickly. DaaS has been a popular word in 2015, and is playing a major role in marketing.

Data Mining: Data miners explore large sets of data to find patterns and insight. This is a highly analytical process that emphasizes making use of large datasets. This process could likely involve artificial intelligence, machine learning or statistics.

Dark Data: This is information that is gathered and processed by a business, but never put to real use. Instead, it sits in the dark waiting to be analyzed. Companies tend to have a lot of this data laying around without even realizing it.

Database: A database is an organized collection of data. It may include charts, schemas or tables. It may also be integrated into a Database Management System (DBMS), a software that allows data to be explored and analyzed.

Hadoop (Apache Hadoop): An open source software framework, Hadoop works largely by storing files and processing data. It is also known for large processing power, making it easy to run a multitude of tasks concurrently. It allows businesses to save, access and analyze enormously big amounts of data. Apache is also in charge of other, related programs you may run into: Pig, Hive, and now Spark (more on Spark later).

IoT: The Internet of Things is generally described as the way products are able “talk” to each other. It is a network of objects (for example, your phone, wearable or car) embedded with network connectivity. Driverless cars are perfect examples. They are always pulling information from the cloud and their sensors are relaying information back. The IoT generates huge amounts of data, making it both important and popular for data science. There is also:

IoE (Internet of Everything): This combines products, people and processes to generate even more connectivity.

Machine Learning: An incredibly cool method of data analysis, machine learning automates analytical model building and relies on a machine’s ability to adapt. Using algorithms, models actively learn and better themselves each time they process new data. Though machine learning is not new, it is gaining massive traction as a modern data analysis tool. It enables machines to adapt and grow without needing hours of extra work on the part of scientists.

MapReduce: MapReduce is a programming model for processing and generating large data sets. This model actually does two distinct things. First, the “Map” includes turning one dataset into another, more useful and broken down dataset made of bits called tuples. Second, “Reduce” takes all of the broken down tuples and breaks them down even further. The result is a practical breakdown of information.

Neural Network: Artificial Neural Networks are models inspired by the real-life biology of the brain. These are used to estimate mathematical functions and facilitate different kinds of learning algorithms. Deep Learning is a similar term, and is generally seen as a modern buzzword, rebranding the Neural Network paradigm for the modern day.

NoSQL: “Non-relational SQL” or “Not only SQL” is much like SQL (discussed below) but does not use relational tables with rows and columns. It is used to manage and stream processing of data. NoSQL includes a number of different databases and models that run horizontally, meaning across servers. This might make it more cost-effective than vertical scaling (as used in SQL).

Petabyte: Yes, it’s big. It’s 1,000,000,000,000,000 bytes. To visualize, Gizmodo described one petabyte as 20 million 4-drawer filing cabinets filled with texts. 20 Petabytes would be all the written works of mankind from the beginning of time translated in every language.

SQL: Also known as Structured Query Language, this is used for the managing and stream processing of data. It is used to communicate with and perform tasks on a database. Standard commands include “Insert,” “Update,” “Delete,” “Create,” and “Drop.” Data appears in a relational table with rows and columns.

R: R is a horribly named programming language that works with statistical computing. It is considered one of the more important and most popular languages in data science.

SaaS: Software-as-a-Service enables vendors to host an application and make it available via the internet. Yes, that’s cloud servicing. SaaS providers provide services over the cloud rather than hard copies.

Spark (Apache Spark): An open-source computing framework originally developed at University of California, Berkely, Spark was later donated to Apache Software. Spark is mostly used for machine learning and interactive analytics.

from:http://dataconomy.com/a-beginners-guide-to-big-data-terminology/

python机器学习深度学习总结

1、Python环境搭建(Windows)

开发工具:PyCharm Community Edition(free)

Python环境:WinPython 3.5.2.3Qt5
–此环境集成了机器学习和深度学习用到的主要包:
numpy,scipy,matplotlib,pandas,scikit-learn,theano,keras

IPython notebook :

2、示例代码:

scikit-learn sample

keras sample

3、数据集Datasets

GeoHey公共数据

4、kaggle平台

Kaggle是一个数据建模数据分析竞赛平台。企业和研究者可在其上发布数据,统计学者和数据挖掘专家可在其上进行竞赛以产生最好的模型。这一众包模式依赖于这一事实,即有众多策略可以用于解决几乎所有预测建模的问题,而研究者不可能在一开始就了解什么方法对于特定问题是最为有效的。Kaggle的目标则是试图通过众包的形式来解决这一难题,进而使数据科学成为一场运动。(wiki)

5、常见问题处理

Approaching (Almost) Any Machine Learning Problem

 

Open dataset

Open dataset:
■ 1.http://archive.ics.uci.edu/ml/
—The best-known source of datasets for
machine learning is the University of California at Irvine. We used fewer
than 10 data sets in this book, but there are more than 200 datasets in this repository.
Many of these datasets are used to compare the performance of algorithms
so that researchers can have an objective comparison of performance.
■ 2.http://aws.amazon.com/publicdatasets/
—If you’re a big data cowboy, then
this is the link for you. Amazon has some really big datasets, including the
U.S. census data, the annotated human genome data, a 150 GB log of Wikipedia’s
page traffic, and a 500 GB database of Wikipedia’s link data.
■ 3.http://www.data.gov
—Data.gov is a website launched in 2009 to increase the
public’s access to government datasets. The site was intended to make all
government data public as long as the data was not private or restricted for
security reasons. In 2010, the site had over 250,000 datasets. It’s uncertain
how long the site will remain active. In 2011, the federal government
reduced funding for the Electronic Government Fund, which pays for
Data.gov. The datasets range from products recalled to a list of failed banks.
■4. http://www.data.gov/opendatasites
—Data.gov has a list of U.S. states, cities,
and countries that hold similar open data sites.
■5. http://www.infochimps.com/
—Infochimps is a company that aims to give
everyone access to every dataset in the world. Currently, they have more
than 14,000 datasets available to download. Unlike other listed sites, some
of the datasets on Infochimps are for sale. You can sell your own datasets
here as well.

refer:《Machine Learning in Action.pdf》

数据爬取和数据分析案例

数据爬取:

*如何入门 Python 爬虫?

专栏:Python爬虫入门教程

Python爬虫学习系列教程

模拟登录一些知名的网站,为了方便爬取需要登录的网站

Python 爬虫-模拟登录知乎-爬取拉勾网职位信息

Python写的链家爬虫 代码+数据

数据爬取工具或框架:

scrapy

Hawk 【重磅开源】Hawk-数据抓取工具:简明教程

pyspider

使用Wget下载整个网站
you-get(Releases · soimort/you-get · GitHub这里面有各种发布版本)。

刚开始写爬虫用的是urllib2,后来知道了requests,惊为天人。
刚开始解析网页用的是re,后来知道了BeautifulSoup,解析页面不能再轻松。
再后来看别人的爬虫,知道了scrapy,被这个框架惊艳到了。
之后遇到了一些有验证码的网站,于是知道了PIL。但后来知道了opencv,pybrain。当在爬虫中用上人工神经网络识别出验证码,兴奋得守在爬虫旁边看他爬完全站。
再后来知道了threading,知道了celery。(知乎)

使用Python进行验证码识别

数据分析案例:

有哪些网站用爬虫爬取能得到很有价值的数据?

2016豆瓣电影可视化分析报告

京东百万记录分析中国人罩杯分布 | 上150万数据 密码:guvy)

用Python侦测比特币交易的网络可视化分析

如何通过房屋租售比来判断房产的价值或泡沫?
你用 Python 做过什么有趣的数据挖掘/分析项目

知乎问题爬虫

知乎数据 API 接口 (node.js)

拉勾职位信息爬取

赶集租房信息

链家爬虫 (数据:链家数据

使用Python进行验证码识别

个人博客:

沙漠之鹰