SQL University – Internals by the Numbers

Hello everyone, and welcome to Internals week of the 2011 Spring semester of SQL University.  I am Erin Stellato and I will be your professor this week.

When you think Internals, what comes to mind?  It could be any number of topics…  The Storage Engine, the Optimizer, SQLOS, Buffer Cache – I could go on, the topic of Internals is very wide and very deep.  For me, when I start thinking about Internals, I picture this book first:

SQL Server 2008 Internals

SQL Server 2008 Internals

And then this one:

SQL Server 2008 Internals and Troubleshooting

SQL Server 2008 Internals and Troubleshooting

And then I start thinking of things like pages, data types, indexes, the transaction log, etc.  There’s an extensive list of topics that are included when you cover Internals, because you’re talking about how SQL Server works.

It’s a vast product, and there’s a lot I could write about this week.  So you think it would be easy for me to find a few things to write posts about, right?  Wrong.  So, so wrong.  Where do I start?  How much should I cover?  Do I just take one topic and start at 100 level and go to 400 level?  Deciding what to write was harder than deciding between watching Apollo 13 and The Right Stuff.

Suffice to say, I initially felt a bit lost (note to self, next time be more specific tossing out SQLU topics to Jorge), but that was a good lesson.  I was reminded of not only the depth to which you can go with Internals (and boy you better know your stuff when you start going down the rabbit hole), but also the breadth of topics that fall into Internals.  To me, this demonstrates that there is value to everyone knowing something about Internals.  To what level?  Well, that depends on how much you enjoy it, and also what you do every day.  For me, I just like knowing how things work.

Where to Begin

The first time I read the Internals book, which was the SQL Server 2000 version, I knew very little about SQL Server.  I understood tables and indexes to some degree, and I could write some TSQL, but I did not have a grasp of the big picture.  I decided to start this series with a high level overview of database structures, because I believe you have to understand what a database is and why you have one before you can understand how SQL Server works.

To make life more interesting, I notated the hell out of this first post.  There isn’t much about Internals that hasn’t already been covered.  What I wanted to do is give you a list of really great posts and articles that I rely on when I want to know the details (yes, there are MSDN links).  I developed this reading list for you: the person who wants to dig into internals.  If you just want the SpaceCamp tour, reading through this post should suffice.

I am a visual learner, so with apologies to Kendra Little ( blog | @kendra_little ) I created pictures to try and illustrate each topic.  Man do I love Word and Paint.

Finally, I named this post “Internals by the Numbers” because I am including information about maximum values as I go along.  Remember, these are the absolute limits and are not goals you should try to achieve.  As I have often heard Kimberly Tripp say, “Just because you can, doesn’t mean you should.”

Structures

Databases

A database is a logical container.  The logic for this container can be based on any number of things, but it is usually an application.  A company has a need for an application to fill certain business requirements (like tracking every movie I’ve ever seen), and that data gets stored in a database.  At a minimum, there are five databases for an instance (master, model, msdb, tempdb and the Resource database), but you can have up to 32,767 databases on a SQL Server instance.  By default, when you create a database the size is about 3 MB, but the maximum size for a database is 524,272 terabytes.

The lovely cylinder we know as a database:

Filegroups and Files

Within a database, you have filegroups, which are also logical containers.  Any database you create in SQL Server has two filegroups initally: PRIMARY and log.  Typically, additional filegroups are created in which data is stored, with only system tables in PRIMARY; but this is not required.  You can only ever have one log filegroup, but you can have 32,767 filegroups total.

Initial filegroups in a database:

Additional filegroups added to a database:

Within a filegroup, you can have one to many files, with a maximum of 32,767 files per database (not per filegroup, note that is per database).  The files are physical structures.  Files within the same filegroup can exist on different drives or LUNs, and often do for performance benefits.  When you initially create a file, its default size is 2 MB.  The maximum file size for a data file is 16 terabytes.  You can create multiple log files, but there is no advantage to doing so as the log file is used in a circular manner.  Having multiple log files does not mean that the files will get written to in parallel, or will follow the proportional fill model, which data files follow.  The default size for a log file is 1 MB, but this should be pre-sized to a larger value; that value is dependent upon your workload, recovery model and maintenance strategies.  The maximum file size for the log file is 2 terabytes, and I recommend reviewing two posts by Kimberly Tripp that discuss sizing the log file to get an appropriate number of Virtual Log Files.

Filegroups in a database, with files:

Pages and Extents

Data files can be broken down into pages, which are 8K (comprised of 16 512 byte disk segments).  These pages are organized into extents (logical), with each extent comprised of eight 8K pages, thereby totaling 64K.  Pages will only hold data for a specific object (e.g. table, index), but an extent can hold pages for multiple objects.  This is referred to as a mixed extent.  However, only the first few pages of an object can exist in a mixed extent (anywhere from one to seven pages), subsequent pages will all exist in a dedicated extent, where only pages for one specific object reside.  Note: you cannot change the size of pages or extents in SQL Server.  This is different than Oracle, where you can specify block size (block is analogous to page) and set it to 2K, 4K, 8K, 16K or 32K.

One sad, lonely page…

Eight happy pages, or one extent:

Records

Finally, the data that is stored in a database is stored as records, or rows (more on that in a minute).  This data is written to pages.  In earlier versions of SQL Server, there was a limit of 8060 bytes for the row size.  However, this was changed in SQL Server 2005; variable length data can now be pushed off row.

One row written to each page in an extent (not to scale):

Schemas, Tables, Columns and Indexes

All right, so now that we know what a database is and how it is organized logically and physically, we can talk about the data that we want to store in a database.  The data can often be categorized at a high level in some manner, such as by Country (the United States isn’t the only country that produces great movies!).  You can logically group data within a database by using schemas.  Every new database includes the default dbo schema, but you should create schemas to organize your data.

Within each grouping of movies there may be sets of data to track.  If I wanted to store data about all the movies made in the United States, I would probably keep information about actors, best quotes, reviews, sales, production and of course the movies themselves.  You would not want to store all this data in one location; you would group it in some manner.  These distinct data sets can be stored in a database in tables.  I can create one table that stores information about movies, one table that stores information about actors – basically one table for each set of data I want to store and search.  There is a limit of 2,147,483,647 objects in a database, where an object is a table, view, stored procedure, UDF, trigger, rule, default or constraint.

A table can be organized in one of two ways: as a clustered index or as a heap (aka a table without a clustered index).  In a clustered index, data is stored in a particular order, as defined by a clustering key.  This is very different than a heap, where data is not stored in any particular order.

If I created a table for movie information, I would want to track values such as movie name, release date, rating and genre.  These translate into columns in my table (where you could have 1024 columns in a nonwide table, and 30,000 in a wide table).  Every column has a data type, which determines what kind of data it can store; make sure to choose your data types wisely.  Once I have a table defined, I add rows to the table.  For my movie information table, I would add one row for each movie.

One way you can think of a table…realize this is nothing like how it’s stored on disk:

Finally, when I want to find that data, I don’t want to read through the entire table every time, I want to use an index to just find the rows I need.  A nonclustered index is an ordered list of a sub-set of columns from the table.  Any index that is not a clustered index is a nonclustered index.  A table can only have one clustered index, because that index is the data, but you can have a maximum of 999 nonclustered indexes on a table.  However, remember that every time you add a row to the table, delete a row or update a row, the same must occur for any indexes on the table.  “Just because you can, doesn’t mean you should.”

A nonclustered index (in this case, the ReleaseDate and Movie Name columns, ordered by ReleaseDate):

Side note: I recommend naming tables and columns intuitively, without being too excessive, too abbreviated, or including reserved words.   If you continue with our movies example, I would first create a UnitedStates schema to which I would assign all my tables that hold data related to movies released in the US.  For the table that holds movie reviews, I would name it Reviews, rather than something like ReviewTable or Revs or RevData.  For my columns, Name and Date as column names should be avoided.  Better options are MovieName and DateofReview, respectively.  Another note…case does not matter here; I am just using it for readability.

What’s Next?

At this point I have explained the structures of a database from the 10,000 foot level.  I included the maximum values where relevant to not just tell you what they are, but to get you to start thinking about what limits are truly reasonable.  Those limits are not where performance drops off; performance will suffer long before 999 nonclustered indexes.  The challenge of any person who works with data (developer, DBA, BI analyst) is understanding what factors affect performance and why.  SQL Server Internals play an enormous part in that understanding.  To that end, at the bottom of this post I have included links related to the aforementioned data structures and performance, most notably indexes.

Within this post I provided enough material to start your own reference library, but my links are not all-inclusive.  In fact, I am certain I have missed a number of great posts.  Feel free to leave a comment or send me an email if you have a go-to article on one of these topics.  I have started an Internals Reference page here, which includes all the links from this post, and I am happy to add to it.

Additionally, if you do not have either of the Internals books I referenced initially, I recommend buying one or both.  I still reference both on a regular basis.  Your knowledge of Internals may not have an immediate impact in your daily job.  But as you start to really understand how SQL Server works, you will become better equipped to troubleshoot and solve problems as they arise, because you will understand what is going on behind the scenes.

My other posts this week will be related to the Storage Engine, and will dive deeper into topics than I did today.  Your homework?  Make sure you’re familiar with DBCC Page 🙂

Additional Links

Clustered Index Design Guidelines (MSDN)

SQL Server Best Practices Article (MSDN)

Indexes in SQL Server 2005/2008 – Best Practices, Part 1 – Kimberly Tripp

Indexes in SQL Server 2005/2008 – Best Practices, Part 2 – Internals – Kimberly Tripp

The Clustered Index Debate Continues… – Kimberly Tripp

GUIDs as PRIMARY KEYs and/or the clustering key – Kimberly Tripp

Disk space is cheap… – Kimberly Tripp

How much does that key cost (plus sp_helpindex9) – Kimberly Tripp

More considerations for the clustering key – the clustered index debate continues! – Kimberly Tripp

2 Responses to SQL University – Internals by the Numbers
  1. Bradley Ball @SQLBalls
    May 17, 2011 | 8:55 pm

    Great Start on the Internals Week, I’m looking forward to your next post!

  2. Ponnu
    December 6, 2012 | 1:30 pm

    Thanks for this post, It is really much appriciated your help to SQL people.

Leave a Reply

Wanting to leave an <em>phasis on your comment?

Trackback URL http://erinstellato.com/2011/05/sql-university-internals-by-numbers/trackback/