Tag Archives: Stackoverflow

Stack Overflow Architecture Update – Now at 95 Million Page Views a Month

A lot has happened since my first article on the Stack Overflow Architecture. Contrary to the theme of that last article, which lavished attention on Stack Overflow’s dedication to a scale-up strategy, Stack Overflow has both grown up and out in the last few years.

Stack Overflow has grown up by more then doubling in size to over 16 million users and multiplying its number of page views nearly 6 times to 95 million page views a month.

Stack Overflow has grown out by expanding into the Stack Exchange Network, which includes Stack Overflow, Server Fault, and Super User for a grand total of 43 different sites. That’s a lot of fruitful multiplying going on.

What hasn’t changed is Stack Overflow’s openness about what they are doing. And that’s what prompted this update. A recent series of posts talks a lot about how they’ve been handling their growth: Stack Exchange’s Architecture in Bullet Points, Stack Overflow’s New York Data Center, Designing For Scalability of Management and Fault Tolerance, Stack Overflow Search — Now 81% Less, Stack Overflow Network Configuration, Does StackOverflow use caching and if so, how?, Which tools and technologies build the Stack Exchange Network?.

Some of the more obvious differences across time are:

  • Just More. More users, more page views, more datacenters, more sites, more developers, more operating systems, more databases, more machines. Just a lot more of more.
  • Linux. Stack Overflow was known for their Windows stack, now they are using a lot more Linux machines for HAProxy, Redis, Bacula, Nagios, logs, and routers. All support functions seem to be handled by Linux, which has required the development of parallel release processes.
  • Fault Tolerance. Stack Overflow is now being served by two different switches on two different internet connections, they’ve added redundant machines, and some functions have moved to a second datacenter.
  • NoSQL. Redis is now used as a caching layer for the entire network. There wasn’t a separate caching tier before so this a big change, as is using a NoSQL database on Linux.

Unfortunately, I couldn’t find any coverage on some of the open questions I had last time, like how they were going to deal with multi-tenancy across so many diffrent properties, but there’s still plenty to learn from. Here’s a roll up a few different sources:

The Stats

  • 95 Million Page Views a Month
  • 800 HTTP requests a second
  • 180 DNS requests a second
  • 55 Megabits per second
  • 16 Million Users  – Traffic to Stack Overflow grew 131% in 2010, to 16.6 million global monthly uniques.

Data Centers

  • 1 Rack with Peak Internet in OR (Hosts our chat and Data Explorer)
  • 2 Racks with Peer 1 in NY (Hosts the rest of the Stack Exchange Network)

Hardware

  • 10 Dell R610 IIS web servers (3 dedicated to Stack Overflow):
    • 1x Intel Xeon Processor E5640 @ 2.66 GHz Quad Core with 8 threads
    • 16 GB RAM
    • Windows Server 2008 R2
  • 2 Dell R710 database servers:
    • 2x Intel Xeon Processor X5680 @ 3.33 GHz
    • 64 GB RAM
    • 8 spindles
    • SQL Server 2008 R2
  • 2 Dell R610  HAProxy servers:
    • 1x Intel Xeon Processor E5640 @ 2.66 GHz
    • 4 GB RAM
    • Ubuntu Server
  • 2 Dell R610 Redis servers:
    • 2x Intel Xeon Processor E5640 @ 2.66 GHz
    • 16 GB RAM
    • CentOS
  • 1 Dell R610 Linux backup server running Bacula:
    • 1x Intel Xeon Processor E5640 @ 2.66 GHz
    • 32 GB RAM
  • 1 Dell R610 Linux management server for Nagios and logs:
    • 1x Intel Xeon Processor E5640 @ 2.66 GHz
    • 32 GB RAM
  • 2 Dell R610 VMWare ESXi domain controllers:
    • 1x Intel Xeon Processor E5640 @ 2.66 GHz
    • 16 GB RAM
  • 2 Linux routers
  • 5 Dell Power Connect switches

Dev Tools

  • C#: Language
  • Visual Studio 2010 Team Suite: IDE
  • Microsoft ASP.NET (version 4.0): Framework
  • ASP.NET MVC 3: Web Framework
  • Razor: View Engine
  • jQuery 1.4.2: Browser Framework:
  • LINQ to SQL, some raw SQL: Data Access Layer
  • Mercurial and Kiln: Source Control
  • Beyond Compare 3: Compare Tool

Software and Technologies Used

  • Stack Overflow uses a WISC stack via BizSpark
  • Windows Server 2008 R2 x64: Operating System
  • SQL Server 2008 R2 running Microsoft Windows Server 2008 Enterprise Edition x64: Database
  • Ubuntu Server
  • CentOS
  • IIS 7.0: Web Server
  • HAProxy: for load balancing
  • Redis: used as the distributed caching layer.
  • CruiseControl.NET: for builds and automated deployment
  • Lucene.NET:  for search
  • Bacula: for backups
  • Nagios: (with n2rrd and drraw plugins) for monitoring
  • Splunk: for logs
  • SQL Monitor: from Red Gate – for SQL Server monitoring
  • Bind: for DNS
  • Rovio:  a little robot (a real robot) allowing remote developers to visit the office “virtually.”
  • Pingdom:  an external monitor and alert service.

External Bits

Code that is not included as part of the development tools:

  • reCAPTCHA
  • DotNetOpenId
  • WMD – Now developed as open source. See github network graph
  • Prettify
  • Google Analytics
  • Cruise Control .NET
  • HAProxy
  • Cacti
  • MarkdownSharp
  • Flot
  • Nginx
  • Kiln
  • CDN: none, all static content is served off the sstatic.net, which is a fast, cookieless domain intended for static content delivered to the Stack Exchange family of websites.

Developers and System Administrators

  • 14 Developers
  • 2 System Administrators

Content

  • License: Creative Commons Attribution-Share Alike 2.5 Generic
  • Standards: OpenSearch, Atom
  • Host: PEAK Internet

More Architecture and Lessons Learned

  • HAProxy is used instead of Windows NLB because HAProxy is cheap, easy, free, works great as a 512MB VM “device” on a network via Hyper-V. It also works in front of the boxes so it’s completely transparent to them, and easier to troubleshoot as a different networking layer instead of being intermixed with all your windows configuration.
  • A CDN is not used because even “cheap” CDNs like Amazon one are very expensive relative to the bandwidth they get bundled into their existing host’s plan. The least they could pay is $1k/month based on Amazon’s CDN rates and their bandwidth usage.
  • Backup is to disk for fast retrieval and to tape for historical archiving.
  • Full Text Search in SQL Server is very badly integrated, buggy, deeply incompetent, so they went to Lucene.
  • Mostly interested in peak HTTP request figures as this is what they need to make sure they can handle.
  • All properties now run on the same Stack Exchange platform. That means Stack Overflow, Super User, Server Fault, Meta, WebApps, and Meta Web Apps are all running on the same software.
  • There are separate StackExchange sites because people have different sets of expertise that shouldn’t cross over to different topic sites. You can be the greatest chef in the world, but that doesn’t qualify you for fixing a server.
  • They aggressively cache everything.
  • All pages accessed by (and subsequently served to) annonymous users are cached via Output Caching.
  • Each site has 3 distinct caches: local, site, global.
  • local cache: can only be accessed from 1 server/site pair
    • To limit network latency they use a local “L1” cache, basically HttpRuntime.Cache, of recently set/read values on a server. This would reduce the cache lookup overhead to 0 bytes on the network.
    • Contains things like user sessions, and pending view count updates.
    • This resides purely in memory, no network or DB access.
  • site cache:  can be accessed by any instance (on any server) of a single site
    • Most cached values go here, things like hot question id lists and user acceptance rates are good examples
    • This resides in Redis (in a distinct DB, purely for easier debugging)
    • Redis is so fast that the slowest part of a cache lookup is the time spent reading and writing bytes to the network.
    • Values are compressed before sending them to Redis. They have plenty of CPU and most of their data are strings so they get a great compression ratio.
    • The CPU usage on their Redis machines is 0%.
  • global cache: which is shared amongst all sites and servers
    • Inboxes, API usage quotas, and a few other truly global things live here
    • This resides in Redis (in DB 0, likewise for easier debugging)
  • Most items in the cache expire after a timeout period (a few minutes usually) and are never explicitly removed. When a specific cache invalidation is required they use Redis messaging to publish removal notices to the “L1” caches.
  • Joel Spolsky is not a Microsoft Loyalist, he doesn’t make the technical decisions for Stack Overflow, and considers Microsoft licensing a rounding error. Consider yourself corrected Hacker News commentor.
  • For their IO system they selected a RAID 10 array of Intel X25 solid state drives . The RAID array eased any concerns about reliability and the SSD drives performed really well in comparision to FusionIO at a much cheaper price.
  • The full boat cost for their Microsoft licenses would be approximately $242K. Since Stack Overflow is using Bizspark they are not paying near the full sticker price, but that’s the max they could pay.
  • Intel NICs are replacing Broadcom NICs and their primary production servers. This solved problems they were having with  connectivity loss, packet loss, and corrupted arp tables.

Related Articles

Stack Overflow Architecture

Update 2: Stack Overflow Architecture Update – Now At 95 Million Page Views A Month

Update: Startup – ASP.NET MVC, Cloud Scale & Deployment shows an interesting alternative approach for a Windows stack using ServerPath/GoGrid for a dedicated database machine, elastic VMs for the front end, and a free load balancer. Stack Overflow is a much loved programmer question and answer site written by two guys nobody has ever heard of before. Well, not exactly. The site was created by top programmer and blog stars Jeff Atwood and Joel Spolsky. In that sense Stack Overflow is like a celebrity owned restaurant, only it should be around for a while. Joel estimates 1/3 of all the programmers in the world have used the site so they must be serving up something good.

I fell in deep like with Stack Overflow for purely selfish reasons, it helped me solve a few difficult problems that were jabbing my eyes out with pain. I also appreciate their no-apologies anthropologically based design philosophy. Use design to engineer in the behaviours you want to encourage and minimize the responses you want to discourage. It’s the conscious awareness of the mechanisms that creates such a satisfying synergy.
What is key about the Stack Overflow story for me is the strong case they make for scale up as a viable solution for a certain potentially large class of problems. The publicity these days is all going scale out using NoSQL databases.
If you need to Google scale then you really have no choice but to go the NoSQL direction. But Stack Overflow is not Google and neither are most sites. When thinking about your design options keep Stack Overflow in mind. In this era of multi-core, large RAM machines and advances in parallel programming techniques, scale up is still a viable strategy and shouldn’t be tossed aside just because it’s not cool anymore. Maybe someday we’ll have the best of both worlds, but for now there’s a big painful choice to be made and that choice decides your fate.
Joel boasts that for 1/10 the hardware they have performance comparable to similarly size sites. He wonders if these other sites have good programmers. Let’s see how they did it and you be the judge.
Site: http://stackoverflow.com

The Stats

  • 16 million page views a month
  • 3 million unique visitors a month (Facebook reaches 77 million unique visitors a month)
  • 6 million visits a month
  • 86% of traffic comes from Google
  • 9 million active programmers in the world and 30% have used Stack Overflow.
  • Cheaper licensing was attained through Microsoft’sBizSpark program. My impression is they pay about $11K for OS and SQL licensing.
  • Monitization strategy: unobtrusive adds, job placement ads, DevDays conferences, extend the software to target other related niches (Server Fault, Super User), develop StackExchangeas a white label and self hosted version of Stack Overflow, and perhaps develop some sort of programmer rating system.

    Platform

  • Microsoft ASP.NET MVC
  • SQL Server 2008
  • C#
  • Visual Studio 2008 Team Suite
  • JQuery
  • LINQ to SQL
  • Subversion
  • Beyond Compare 3
  • VisualSVN 1.5
  • Web Tier – 2 x Lenovo ThinkServer RS110 1U – 4 cores, 2.83 Ghz, 12 MB L2 cache – 500 GB datacenter hard drives, mirrored – 8 GB RAM – 500 GB RAID 1 mirror array
  • Database Tier – 1 x Lenovo ThinkServer RD120 2U – 8 cores, 2.5 Ghz, 24 MB L2 cache – 48 GB RAM
  • A fourth server was added to run superuser.com. All together the servers also run Stack Overflow, Server Fault, and Super User.
  • QNAP TS-409U NAS for backups. Decided not to use a cloud solution because the bandwidth costs of transferring 5 GB of data per day becomes prohibitive.
  • Hosting at http://www.peakinternet.com/. Impressed with their detailed technical responses and reasonable hosting rates.
  • SQL Server’s full text search is used extensively for the site search and detecting if a question has already been asked. Lucene.net is considered an attractive alternative.

    Lessons Learned

    This is a mix of lessons taken from Jeff and Joel and comments from their posts.

  • If you’re comfortable managing servers then buy them. The two biggest problems with renting costs were: 1) the insane cost of memory and disk upgrades 2) the fact that they [hosting providers] really couldn’t manage anything.
  • Make larger one time up front investments to avoid recurring monthly costs which are more expensive in the long term.
  • Update all network drivers. Performance went from 2x slower to 2x faster.
  • Upgrading to 48GB RAM required upgrading MS Enterprise edition.
  • Memory is incredibly cheap. Max it out for almost free performance. At Dell, for example, upgrading from 4G memory to 128G is $4378.
  • Stack Overflow copied a key part of the Wikipedia database design. This turned out to be a mistake which will need massive and painful database refactoring to fix. The refactorings will be to avoid excessive joins in a lot of key queries. This is the key lesson from giant multi-terabyte table schemas (like Google’s BigTable) which are completely join-free. This is significant because Stack Overflow’s database is almost completely in RAM and the joins still exact too high a cost.
  • CPU speed is surprisingly important to the database server. Going from 1.86 GHz, to 2.5 GHz, to 3.5 GHz CPUs causes an almost linear improvement in typical query times. The exception is queries which don’t fit in memory.
  • When renting hardware nobody pays list price for RAM upgrades unless you are on a month-to-month contract.
  • The bottleneck is the database 90% of the time.
  • At low server volume, the key cost driver is not rackspace, power, bandwidth, servers, or software; it is NETWORKING EQUIPMENT. You need a gigabit network between your DB and Web tiers. Between the cloud and your web server, you need firewall, routing, and VPN devices. The moment you add a second web server, you also need a load balancing appliance. The upfront cost of these devices can easily be 2x the cost of a handful of servers.
  • EC2 is for scaling horizontally, that is you can split up your work across many machines (a good idea if you want to be able to scale). It makes even more sense if you need to be able to scale on demand (add and remove machines as load increases / decreases).
  • Scaling out is only frictionless when you use open source software. Otherwise scaling up means paying less for licenses and a lot more for hardware, while scaling out means paying less for the hardware, and a whole lot more for licenses.
  • RAID-10 is awesome in a heavy read/write database workload.
  • Separate application and database duties so each can scale independently of the other. Databases scale up and the applications scale out.
  • Applications should keep state in the database so they scale horizontally by adding more servers.
  • The problem with a scale up strategy is a lack of redundancy. A cluster ads more reliability, but is very expensive when the individual machines are expensive.
  • Few applications can scale linearly with the number of processors. Locks will be taken which serializes processing and ends up reducing the effectiveness of your Big Iron.
  • With larger form factors like 7U power and cooling become critical issues. Using something between 1U and 7U might be easier to make work in your data center.
  • As you add more and more database servers the SQL Server license costs can be outrageous. So by starting scale up and gradually going scale out with non-open source software you can be in a world of financial hurt.
    It’s true there’s not much about their architecture here. We know about their machines, their tool chain, and that they use a two-tier architecture where they access the database directly from the web server code. We don’t know how they implement tags, etc. If interested you’ll be able to glean some of this information from an explanation of their schema.

    Discussion

    As an architecture profile candidate Stack Overflow has earned two important HighScalability badges: the Microsoft Stack Badge and and the Scale Up Badge. Both controversial and interesting topics of discussion.

    Microsoft Stack Badge

    The Microsoft Stack Badge was earned because Stack Overflow uses the entire Microsoft Stack: OS, database, C#, Visual Studio, and ASP .NET. People are always interested in how MS compares to LAMP, but I don’t have many case studies to show them.
    Markus Frind of Plenty of Fish fame is often used as a Microsoft stack poster child, but since he explicitly uses as little of the stack as possible he’s not really a good example. Stack Overflow on the other hand is brash in proclaiming their love for MS, even when that love is occasionally spurned.
    It’s hard to separate out the Microsoft stack and the scale up approach because for licensing reasons they tend to go together. If you find yourself in the position of transitioning from scale up to scale out by adding dozens of cores, MS licensing will bite you.
    Licensing aside I personally find C#, Visual Studio, and .Net a very productive environment. C#/.Net is at least as good as Java/JVM. ASP .NET has always been a confusing mess to me. The knock against SQL Server is you have to pay for it and if that doesn’t bother you then it’s a solid choice. The Windows OS may not be as solid as other alternatives but it works well enough.
    So for a scale up solution a Microsoft stack works, especially if you are already Windows centric.

    Scale Up Badge

    This won’t be a reenactment of the scale out vs scale up vs rent vs buy wars. For a thorough discussion of these issues please take a look at  Scaling Up vs. Scaling Out and Server Hosting — Rent vs. Buy?. If you aren’t confused and if your head doesn’t hurt after reading all that then you haven’t properly understood the material 🙂
    The Scale Up Badge was awarded because Stack Overflow uses a scale up strategy to meet their scaling requirements. When they reach a limit they scale vertically by buying a bigger machine and adding more memory.
    Stack Overflow is in the sweet spot for scale up. It’s not too large, but with an Alexa ranking of 1,666 and 16 million page views a month it’s still a substantial site. Not Google scale, and probably will never have to be, but those are numbers many sites would be thrilled to have. Yet they aren’t uploading large amounts of media. They aren’t dealing with billions of tweets across complex social networks with millions of users. Their number of users is self limiting. And there are still directions they can take if they need to scale (caching, more web servers, faster disks, more denormalization, more memory, some partitioning, etc). All-in-all it’s a well done and very useful two-tier CRUD application.

    NoSQL is Hard

    So should Stack Overflow have scaled out instead of up, just in case?
    What some don’t realize is NoSQL is hard. Relational databases have many many faults, but they make a lot of common tasks simple while hiding both the cost and complexity. If you want to know how many black Prius cars are in inventory, for example, then that’s pretty easy to do.
    Not so with most NoSQL databases (I’ll speak generally here, some NoSQL databases have more features than others). You would have program a counter of black Prius cars yourself, up front, in code. There are no aggregate operators. You must maintain secondary indexes. There’s no searching. There are no distributed queries across partitions. There’s no Group By or Order By. There are no cursors for easy paging through result sets. Returning even 100 large records at time may timeout. There may be quotas that are very restrictive because they must limit the amount of IO for any one operation. Query languages may lack expressive power.
    The biggest problem of all is that transactions can not span arbitrary boundaries. There are no ACID guarantees beyond a single record or small entity group. Once you wrap your head around what this means for the programmer it’s not a pleasant prospect at all. References must be manually maintained. Relationships must be manually maintained. There are no cascading deletes that act correctly during a failure. Every copy of denormalized data must be manually tracked and updated taking into account the possibility of partial failures and externally visible inconsistency.
    All this functionality must be written manually by you in your code. While flexibility to write your own code is great in an OLAP/map-reduce situation, declarative approaches still cover a lot of ground and make for much less brittle code.
    What you gain is the ability to write huge quantities of data. What you lose is complacency. The programmer must be very aware at all times that they are dealing with a system where it costs a lot to perform distribute operations and failure can occur at anytime.
    All this may be the price of building a truly scalable and distributed system, but is this really the price you want to pay?

    The Multitenancy Problem

    With StackExchange Stack Overflow has gone into the multi-tenancy business. They are offering StackExchange either self-hosted or as a hosted white label application.
    It will be interesting to see if their architecture can scale to handle a large number of sites. Salesorce is the king of multitenancy and although it’s true they use Oracle as their database, they basically use very little of Oracle and have written their own table structure, indexing and query processor on top of Oracle. All in order to support multitenancy.
    Salesforce went extreme because supporting a lot of different customers is way more difficult than it seems, especially once you allow customization and support versioning.
    Clearly all customers can’t run in one server for security, customization, and scaling reasons.
    You may think just create a database for each customer, share a server for a certain number of customers, and then add more servers as needed. As long as a customer doesn’t need more than one server you are golden.
    This doesn’t seem to work well in practice. Oddly database managers aren’t optimized for adding or updating databases. Creating databases is a heavyweight operation and can degrade performance for existing customers as system locks are taken. Upgrade issues are also problematic. Adding columns locks tables which causes problems in high traffic situations. Adding new indexes can also take a very long time and degrade performance. Plus each customer will likely have specializations that makes upgrading even more complicated.
    To get around these problems Salesforce’s Craig Weissman, Chief Architect, created an innovative approach where tables are not created for each customer. All data from all customers is mapped into the same data table, including indexes. The schema for that table looks something like orgid, oid, value0, value1…value500. “orgid” is the organization ID and is how data is never mixed up. It’s a very wide and sparse table, which Oracle seems to handle well. Hundreds and hundreds of “tables” and custom fields are mapped into the data table.
    With this approach Salesforce has no option other than to build their own infrastructure to interpret what’s in that table. Oracle is left to handle transactions, concurrency, and deadlock detection. The advatange is because there’s an interpreted layer handling versions and upgrades is relatively simple because the handling logic can be baked in. Strange but true.

    Related Articles

    This list includes a number of posts by Jeff as he chronicles their journey with Stack Overflow. Jeff is wonderful about being open about what they are doing and why. The comment threads are often tremendous. There’s a lot to learn.

  • Learning from StackOverflow.com by Joel Spolsky
  • Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood
  • What Was Stack Overflow Built With?
  • New Stack Overflow Server Glamour Shots
  • New Stack Overflow Servers Ready
  • Server Hosting — Rent vs. Buy? – this is a very informative discussion the pros and cons of renting vs buying.
  • Rent vs. Buy (or EC2 vs. building your own iron) by  Michael Friis
  • Oh, You Wanted “Awesome” Edition – We recently upgraded our database server to 48 GB of memory — because hardware is cheap, and programmers are expensive.
  • Our Backup Strategy – Inexpensive NAS
  • The Economics of Bandwidth
  • Understanding the StackOverflow Database Schema by  Brent Ozar
  • Server Speed Tests – new hardware 2x slower – it was the network.
  • ASP.NET MVC: A New Framework for Building Web Applications
  • Three key things to know about moving MySQL into the cloud by morgan
  • NoSQL Conference
  • Decline of the Enterprise Data Warehouse by Bradford Stephens
  • Webinar: Multitenant Magic – Under the Covers of the Force.com Data Architecture by Craig Weissman, Chief Architect, salesforce.com.

Stack Overflow 架构

Stack Overflow取得了长足发展:规模扩大了一倍多,每月不重复的访问用户超过1600万;每月网页浏览量(PV)增长了近6倍,达到9500万。

Stack Overflow发展壮大成了 Stack Exchange Network,而这个网络包括Stack Overflow、Server Fault和Super User等,旗下总共拥有43个网站,而且发展势头良好。

但不变的是Stack Overflow在其所作所为方面坚持的开放理念,而这才有了今天这篇文章。最近的一连串帖子主要介绍了Stack Overflow在如何应对增长:《Stack Exchange的架构要点介绍》、《Stack Overflow的纽约数据中心》、《为确保管理和容错的高扩展性而设计》、《Stack Overflow搜索——现在时间缩短了81%》、《Stack Overflow网络配置》、《Stack Overflow使用缓存吗?如果使用,怎么使用?》和《哪些工具和技术构建了Stack Exchange Network?》等。(51CTO编辑注:以上文章均为英文。)

这几年来比较明显的一些变化如下:

◆数量更多:更多的用户、更多的PV、更多的数据中心、更多的站点、更多的开发人员、更多的操作系统、更多的数据库、更多的机器。

◆Linux:Stack Overflow因使用Windows系列产品而著称,现在他们使用越来越多的Linux机器,用于HAProxyRedisBaculaNagios、日志和路由器等系统。所有支持功能似乎都由Linux来处理,这就需要开发并行版本发行流程。

◆容错:现在为Stack Overflow提供服务的是使用两条不同互联网连接的两只不同交换机,Stack Overflow添加了冗余机器,一些功能已搬迁到第二个数据中心。

◆NoSQL:Redis现用作整个网络的缓存层。以前没有独立的缓存层,所以这是一大变化,使用基于Linux的NoSQL数据库也是一大变化。

遗憾的是,我没有找到哪些帖子在介绍我上次提出的一些开放问题,比如Stack Overflow如何处理有着众多不同属性的多租户架构,不过我们还是可以从许多方面来了解。下面是收集的一些信息:

统计数字

◆每月网页浏览量9500万次

◆每秒800个HTTP请求

◆每秒180个DNS请求

◆每秒55兆位

◆1600万个用户——Stack Overflow的流量在2010年增长了131%,全球每月不重复访客增至1660万人。

数据中心

 Stack Overflow网络架构

◆1个机架放在俄勒冈州的Peak Internet(用于放置chat和Data Explorer)

◆2个机架放在纽约州的Peer 1(用于放置Stack Exchange Network的其余部分)

硬件

◆10台戴尔R610 IIS Web服务器(3台专门用于Stack Overflow)

◆1个英特尔至强处理器E5640,2.66 GHz四核,8线程

◆16 GB内存

◆Windows Server 2008 R2

◆2台戴尔R710数据库服务器:

◆2个英特尔至强处理器X5680,3.33 GHz

◆64 GB内存

◆8个硬盘

◆SQL Server 2008 R2

◆2台戴尔R610 HAProxy服务器:

◆1个英特尔至强处理器E5640,2.66 GHz

◆4 GB内存

◆Ubuntu Server

◆2台戴尔R610 Redis服务器:

◆2个英特尔至强处理器E5640,2.66 GHz

◆16 GB内存

◆CentOS

◆1台戴尔R610 Linux备份服务器,运行Bacula:

◆1个英特尔至强处理器E5640,2.66 GHz

◆32 GB内存

◆1台戴尔R610 Linux管理服务器,用于Nagios和日志:

◆1个英特尔至强处理器E5640,2.66 GHz

◆32 GB内存

◆2个戴尔R610 VMWare ESXi域控制器:

◆1个英特尔至强处理器E5640,2.66 GHz

◆16 GB内存

◆2只Linux路由器

◆5只戴尔Power Connect交换机

开发工具

◆编程语言:C#

◆集成开发环境(IDE):Visual Studio 2010团队套件

◆框架:微软ASP.NET(版本4.0)

◆Web框架:ASP.NET MVC 3

◆视图引擎:Razor

◆浏览器框架:jQuery 1.4.2

◆数据访问层:LINQ to SQL,一些原始SQL

◆源码控制:Mercurial和Kiln

◆比较工具:Beyond Compare 3

使用的软件和技术

◆Stack Overflow通过BizSpark,使用WISC堆栈

◆操作系统:Windows Server 2008 R2 x64

◆数据库:运行微软Windows Server 2008企业版x64的SQL Server 2008 R2

Ubuntu Server

CentOS

◆Web 服务器:IIS 7.0

◆HAProxy:用于负载均衡

◆Redis:用作分布式缓存层

CruiseControl.NET:用于代码构建和自动化部署

Lucene.NET:用于搜索

Bacula:用于备份

◆Nagios:(n2rrd和drraw插件)用于监控

Splunk:用于日志

◆SQL Monitor:Red Gate公司提供,用于SQL Server监控

Bind:用于DNS

◆Rovio:一个小巧的机器人(真正的机器人),让远程开发人员可以通过“虚拟方式”访问办公室。

◆Pingdom:外部监控和警报服务网站

外部组件

不是作为开发工具一部分而包括的代码:

reCAPTCHA

DotNetOpenId

WMD——现在作为开源而开发。详见github网络图

Prettify

◆Google Analytics

◆Cruise Control .NET

◆HAProxy

Cacti

◆MarkdownSharp

Flot

Nginx

◆Kiln

◆内容分发网络(CDN):无,所有静态内容从sstatic.net来提供,这个快速的、无cookie的域用于将静态内容分发到Stack Exchange系列网站。

开发人员和系统管理员

◆14名开发人员

◆2名系统管理员

内容

◆许可证:Creative Commons Attribution-Share Alike 2.5 Generic

◆标准:OpenSearch,Atom

◆主机:PEAK Internet

架构的更多信息和汲取的经验

◆使用了Proxy,而不是使用Windows网络负载均衡(NLB),因为HAProxy成本低廉、易于使用,还是免费的;而且通过Hyper- V,很适合作为网络上的一个512M虚拟机“设备”。它还在服务器的前端工作,所以对服务器来说完全透明;而且作为不同的网络层,更容易排除故障,而不是 与你的所有窗口配置混杂在一起。

◆没有使用CDN,因为与捆绑在现有主机方案中的带宽相比,连亚马逊CDN这样“便宜的”CDN其费用都非常昂贵。按照亚马逊的CDN费率和Stack Overflow的带宽使用量,每月至少要付1000美元。

◆备份到磁盘上,便于快速恢复;备份到磁带上,便于历史归档。

◆SQL Server的全文搜索机制集成度非常差,问题多多,功能很弱,所以Stack Overflow改用了Lucene。

◆最受关注的是峰值HTTP请求数字,因为这正是他们需要确保能处理的方面。

◆所有属性如今都在同一个Stack Exchange平台上运行。那意味着Stack Overflow、Super User、Server Fault、Meta、WebApps和Meta Web Apps都在同一个软件上运行。

◆有一些独立的StackExchange站点,因为人们拥有不同的专业技能,这些技能并不适用于不同的主题站点。你也许是世界上最出色的大厨,但并不是说你就有能力修复服务器。

◆Stack Overflow尽量把一切都放到缓存中。

◆匿名用户访问的所有页面通过输出缓存(Output Caching)放到缓存中,随后提供给匿名用户。

◆每个站点有三种不同的缓存:本地缓存、站点缓存和全局缓存。

◆本地缓存:只能通过1对服务器/站点来访问。

◆为了限制网络延迟时间,Stack Overflow使用了本地“一级”缓存(基本上是HttpRuntime.Cache),缓存服务器上最近设定/读取的值。这样就可以把网络上的缓存查找开销减小至0字节。

◆缓存里面含有用户会话和等待的视图数更新等内容。

◆缓存完全驻留在内存中,没有网络或数据库访问。

◆站点缓存:可以由一个站点(任何服务器上)的任何实例来访问。

◆大部分缓存的值进入到这里,热点问题ID列表和用户验收率就是两个典例。

◆缓存驻留在Redis(位于不同的数据库,纯粹为了易于调试)。

◆Redis的速度很快,缓存查找中速度最慢的部分就是读取字节并写到网络上。

◆值被发送到Redis之前先进行压缩。Stack Overflow有许多处理器,大部分数据是字符串,所以得到的压缩比很高。

◆Redis机器上的处理器使用率为0%。

◆全局缓存:全局缓存被所有站点和服务器共享。

◆缓存内容包括收件箱、API使用限额和另外几项真正全局的内容。

◆缓存驻留在Redis中(位于数据库0,同样为了易于调试)。

◆缓存中的大部分项目在超时(通常是几分钟)后过期,从来不被明确删除。需要宣布某个特定的缓存项目无效时,他们使用Redis消息传递机制,向“一级”缓存发布删除通知。

◆知名软件工程师、Fog Creek Software公司首席执行官Joel Spolsky不是微软的忠诚分子,他并不为Stack Overflow做出技术决策,认为微软的许可证是个舍入误差。

◆Stack Overflow为自己的输入/输出系统选择了英特尔X25固态硬盘组成的RAID 10阵列。这个RAID阵列消除了可靠性方面的任何问题;与FusionIO相比,固态硬盘的性能确实很好,而价格又便宜得多。

◆微软许可证的总标价约为24.2万美元。由于Stack Overflow使用Bizspark,所以没在支付总标价,但他们能付的最多也就这么多。

◆英特尔网卡取代了博通网卡和主生产服务器。这解决了他们之前面临的问题:连接中断、数据包丢失和地址解析协议(ARP)表损坏。