PASS Summit: Day 5 | Erin Stellato

All right friends, it’s Friday, it’s last day of the 2011 PASS Summit, and I’m sitting at the official bloggers table. I also present my Baselines session today at 2:30 PM in room4C1-2 (yes, that was a shameless plus). Sitting at the bloggers table means that I get to try and make written sense of everything said by Dr. David DeWitt today. This is no small feat, the man is a Microsoft fellow and runs the Jim Gray lab in Madison, WI. I’ll do my best to keep up, and here we go with the keynote!

Rick Heiges took the stage first, and quickly introduced Rob Farley and Buck Woody, who sang Rob’s song, “I Should Have Looked the Other Way.” It. Was. Awesome. Rob sent me a demo of it months ago, and I love that it gained so much momentum that it kicked off today’s keynote. Congrats Rob!

Rick and PASS payed tribute to Wayne Snyder next, as he is retiring from the PASS Board this year. Wayne is the immediate Past President and has been a key member of the SQL Server community for many years. Wayne’s a good speaker, and it’s moving to listen to him say farewell. Thank you for all you have done, Wayne!

The PASS summit is the end of the 2011 cycle, and the 2012 Executive Committee is comprised of Bill Graziano, Douglas McDowell, Thomas LaRock and Rushabh Mehta. There are three at large positions up for election this year…are you running?

Post-summit events include SQLRally Nordic (and the event is now sold old), 24 Hours of PASS, SQL Rally (in Dallas in the spring) and a host of SQL Saturdays throughout the entire year. Summit 2012 is November 6-9th in Seattle, WA. The 2013 Summit will be in Charlotte, NC.

David DeWitt takes stage and he first thanked Rimma Nehme who helped prepared the keynote. Dr. DeWitt loves giving keynotes, but he hates preparing for them. It’s about expectation he says (I feel his pain), and 2010 was the last year he was planning to present. However, Dr. DeWitt’s wife attend and she provided some feedback. She said that Dr. DeWitt needed new glasses, the presentation wasn’t as good as the tweets claimed and he had a math error on a slide. I’d love to meet Mrs. DeWitt. And I love Dr. DeWitt for asking us to vote on his glasses (hashtag #dewittoldspecs for old glasses and #dewittnewspecs for the new ones…I like the new ones!).

Today’s talk is about big data: massive collections of records (pedabytes). In the Facebook data center there are 2700 osql nodes with 60 pedabytes of storage. Wow.

In 2009, people decided there was about a zetabyte of big data out there. A zetabyte is a million pedabytes. They (whoever “they” are) predict something like 35 ZB by 2020. Why is there an explosion in big data? A lot of data is not entered by hand any more, it comes from a huge variety of sources.

How to manage big data? The old guard believes in using a parallel data system. EBay manages about 10 PB of raw data on 256 nodes, Facebook uses an osql system that manages twice as much data.

NOSQL means that it’s not only SQL. For some data is lands outside a RDBMS and may not be worth storing in a RDBMS system. Why do people love NOSQL? Data model flexibility, relaxed consistency (willing to trade consistency for availability), low upfront software costs, developers only learned C/Java in school (wow!) and faster time to insight of data.

Two major types of NOSQL: Key/Value Stores and Hadoop. Key/Value Stores are really intended for single value inserts/updates. Hadoop is more for large scale analytics with no data model, and records are stored in distributed file system. Think of Key/Value Stores as NOSQL OLTP, and Hadoop as NOSQL Data Warehousing.

Unstructured data is a misnomer, because if you look at the data, there is some sort of a schema (unless you want to run string matching against the data all the time). Even if it’s unstructured, there is some underlying structure. These systems are not big on ACID properties, no transactions, no ETL, faster time to insight, and flexibility. Realize that a relational database system provides maturity, stability and efficiency. The NOSQL systems give flexibility.

Dr. DeWitt believes that the world has truly changed. Relational Database Management Systems are no longer the only game in town. But they are not going away, they will still dominate all transactional processing. This is NOT a paradigm shift. But many businesses will end up with data in both universes. Today Dr. DeWitt wants to focus on what Hadoop is, and its ecosystem.

The big data movement all started at Google – they had all the click-stream data to store and analyze, and it needed to scalable, be fault tolerant and simple to program against. They build something called GDS (distributed file system, now called HDFS), and then a new programming model called MapReduce (the process).

The Hadoop Ecosystem includes HDFS, Map/Reduce, Hive & Pig and Sqoop (way to move data), among others. HDFS is the hadoop distributed file system. This is the core of the entire Hadoop ecosystem. A couple design goals are that it’s scalable to thousands of nodes, it assumes that failures (hardware and software) are common. The files are broken into large blocks, typically 6440 MB in size. Breaks the data into 64 MB blocks and stores each as a separate file in NTFS. Hadoop file system does not replace the Windows file system, just sits on top of it. The blocks are stored around the different nodes of the cluster. Block 1 stored on node 1, node 2 and node 4 (triple replication is used to survive two failures). Block 2 on nodes 2, 4 and 5, etc. The writer places the first block on the node creating the file. The first copy is stored locally. The second copy of the block is stored on another node under the same switch. This is to minimize the amount of cross rack traffic. Finally, to deal with rack failures, third copy places on a different rack, maybe in a different data center in the country. Goal is to balance the blocks around the system, and want to be highly fault tolerant.

Something called the NameNode in a cluster, and it is a single point of failure. There is a backup of the NameNode. The third type of node is DataNode, which is responsibility for storing data in the local file system and serving up data to the Clients. NameNode is always checking the state of the DataNodes…that is one of its primary jobs (who’s alive and who has failed). Every time the client software wants to write data, sends a message to the NameNode and the NameNode figures out where the data should go. Then the client writes the data directly to the DataNodes (the NameNode doesn’t handle all the read and write traffic). The reverse happens when a client application needs to read data. The Client asks where it can find a block (via the NameNode), and then the Client reads the data directly from the DataNode.

This is designed from the beginning to run on low cost hardware (and software) that will fail. When a DataNode fails, the NameNode (always keeping track of DataNodes) will detect that and will see what blocks were stored on that DataNode. The blocks are automatically replicated from one of the other two copies to create a new third copy. If the NameNode fails, the BackupNode may require manual intervention to switch over to it (though auto failover is in the works). Under the covers, there is also balancing going on. The NameNode will see a new DataNode when its added and will start rebalancing the file system on its own, under the covers.

Highly scalable, massive files (pedabytes), uses very large blocks of 64 MB chunks which are units of transfer (go from DataNode to Client and back). 64 MB is big! It’s not designed for OLTP, it’s designed for scanning large amounts of data. This does not require special hardware. Remember, it’s inexpensive SATA drives that are bound to fail. This was to reduce cost, but have managed to use one mechanism for failure (using block level replication). A negative is that you have no clue where your data really is, which makes it impossible to optimize and utilize parallel database systems.

MapReduce is a programming framework – engineers can use it to analyze massive amounts of data in place. Users write a map function, the job of the function is to sub-divide and conquer. Take a large problem and divide into a bunch of small problems. Then, perform the same function on all the same pieces. Then combine the output from all the small problems. There is a Job Tracker – where jobs are submitted and coordinates all Map Reduce events, it controls the Task Tracker (little processes running on each node) and it also keeps track of failures. How does this fit with HDFS? On the main node is the Job Tracker and NameNode, on each sub node is the TaskTracker and the DataNode.

Imagine three data nodes, each data node has two blocks and each block has two tuples, and this is zip code data. Start two Map Tasks – each one starts reading blocks from the file. As they process records on the blocks (and want to do a group by here), will apply hash code function so the reduce function can do math against the hash. Mapper 1 saw some of a specific zip code, Mapper 2 saw others of that specific zip code. Data for the same zip code can be distributed across multiple Mappers. Now this needs to get reduced. The Reducers have to pull the data from where the Mappers had it (each Reducers must pull from each Mapper to get its data). Now the Reducer sorts, and then the final function is applied. In general, the number of Map Tasks started is greater than the number of nodes (this is due to failure/data stragglers and data skew). If one of the workers fails, the work can be distributed to another worker. If the master fails, the whole thing starts over again.

Pros of MapReduce: highly fault tolerant, relatively easy to write, removes burden of dealing with failures from programmer. The con is that the schema is embedded in the application code, meaning it’s hard to share data between applications AND cannot benefit from DBMS goodies like indexes, constraints, views, etc.

Hive and Pig: used by Facebook and Yahoo. Facebook produced Hive and Yahoo produced PIG (slightly more declarative). Both use Hadoop MapReduce as a target language for execution. Query is to find the sourceIP address that generated the most adRevenue along with average pageRank. Query takes 3 MapReduce jobs since joins in MR framework are not easy to write (4 pages of 5 point font). In HiveQL, only a half page in 20 point font. Facebook concluded that MapReduce was not easy for end users, users spend days writing programs for simple analysis. Every day Facebook runs about 150,000 jobs, of those, only 500 are MapReduce. All the others are HiveQL jobs (which is basically SQL). Only 500 out of 150,000 – proof that declarative query languages are the way to go. Now the Google peeps have determined that MapReducs is not efficient.

HiveQL takes the best features of SQL and extends the capabilities to add user defined functions. Tables in Hive do have data stored in them (like a relational DBMS), but the data types of the columns are richer than those in SQL and include complex types like associative arrays, lists, structs in addition to the traditional ints, floats, strings and date. Like parallel data systems, data can be partitioned. When data is partitioned gets put into a HDFS directory, and then each part of the partition becomes its own HDFS file. When you partition a Hive table, that attribute becomes the name of a file, and that attribute is stored only one (Dr. DeWitt says that’s cute, because they’re minimizing how much data they have to store).

Example of HiveSQL…Sales by zip code, and have HDFS directory of Sales, and then the HDFS files are by state (Michigan, Iowa, Washington, etc.). If you query for data for Michigan, the query will only be executed against that file. As for Query Optimization, there are limited statistics (file sizes only) so cost-based query processing is essentially impossible. Query execution is handled by the standard MapReduce scheduler for execution.

Let’s compare HDW vs. Hive…Dr. DeWitt had a student do some benchmarking. Ran SQL Server PDW version “next” and current release of Hadoop on Windows. 612 GB spread across 8 data nodes (80 GB per node or so). Ran a basic select count query, then an aggregate with a group by. Hive was slower than PDW by a factor of 4. Next query – join between two tables (getting a max) where the data is partitioned. PDW was against faster than Hive, regardless of where the data was partitioned (on the join attribute or not), by a factor of about 10. There is a huge difference in what PDW can do in terms of analyzing large amounts of data really efficiently. It’s impossible to get the same level of performance on Hive, regardless of how you partition.

Connecting the universes of RDBMS (structured) and Hadoop (unstructured) is needed because data is landing in an unstructured enviroment first a lot more frequently, and because MapReduce is an excellent big data ETL tool. Sqoop is that tool. The other reason is that some analysis is hard to do in SQL, it’s easier in a procedural language or a language like HiveSQL with MapReduce constructs. How is this done? You can unload the entire table, or run a query to pull out just the data you need. Going forward, we will need data from both universes to answer a data question. The limitations are that data can be scanned multiple times to get the necessary information. When trying to connection the universes, this is the problem.

Dr. DeWitt says there has to be a better way to bridge the gap between the two universes. Why not build a database system that understands both universes?! Can execute queries across both without moving data unnecessarily, and utilize something that has the expressive power of a language like HiveQL. Dr. DeWitt and Rimma are going to try to do this in their lab, so stay tuned. Dr. DeWitt asserts that PDW is the bunny, and he just needs to teach it how to understand unstructured data. Will need to improve scalability, a little better fault tolerance…and Dr. DeWitt thinks that doing that is much better than going to a Hadoop-based system.

Are you confused? I’m pretty good. Dr. DeWitt is a great teacher.

What’s the future of RDBMS vs. Hadoop? They are designed to solve different problems and meet different requirements. RDBMS-only or Hadoop-only is NOT going to be the default. Need to make sure the two technologies work together as best as they can. PDW has a great future, they will be complimented by great technologies like Hadoop. Dr. DeWitt’s team is 100% behind PDW.

And at the end…standing ovation (at least from the bloggers) for Dr. DeWitt’s presentation. Awesome.

Lastly, Dr. DeWitt requested that we send ideas for next year’s talk to dewitt@microsoft.com (assuming we want him back, which of course we do). If you want to download the slide deck from today, go here.