Me: Hey there.
Me: How was your day?
You: Pretty good, how about yours?
Me: Oh, it was interesting. Grab a beer, let me tell you about it…
So, it started out with a call from a customer because they had received the error:
Could not continue scan with NOLOCK due to data movement.
When they tried to run DBCC CHECKDB they ran into this message:
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
I knew I had seen the error before, and it is mentioned in this post of Paul Randal’s ( @PaulRandal ). Even though I wasn’t hitting the exact same set of errors, I tried the recommendation to run it with TABLOCK. In previous cases, I could then get CHECKDB to run but for this customer I still received the error. I then ran the CHECKDB steps individually. I first tried CHECKALLOC and it threw an error. I ran CHECKTABLE for each table, and found that some tables did have corruption, but the core tables in the application did not. This was a good thing. Then I ran CHECKCATALOG and that came back without error. On the suggestion of Amit Banerjee ( blog | @banerjeeamit ), I verified I could select from sys.tables. Because the CHECKALLOC was throwing unusual errors, I couldn’t determine the true extent of the corruption. However, Amit pointed out that I should be able to extract most of the data from the tables and push it over to a new database.
We explained the CHECKDB output to the customer and recommended they restore to their last known good backup. Guess when that was?
You: Oh no…
Me: Yeah…it was over 6 months old.
Me: I know, it was a bummer. So, we discussed the options and eventually the customer decided they wanted us to try and resolve the corruption errors, even if that meant they lost some data.
You: So did you work your magic?
Me: Ha! I wish it was that easy.
Before I connected to the customer environment, I asked them to back up the corrupt database, which they did, and then I asked them to restore that database for me to work against. I wanted the original copy of the database untouched. I wanted to see if CHECKDB with the ALLOW_DATA_LOSS_OPTION would even run. When I got connected to the customer environment, there were two databases: PROD and PROD2. It was explained to me that PROD2 was the restored database. I asked if PROD was the Production database and they said yes. My plan was try and run a repair against PROD2, as that was the restored copy. I ran the CHECKDB against PROD2 to make sure it gave the same errors, which it did, and then I tried CHECKDB with the REPAIR_ALLOW_DATA_LOSS option. It failed, which I thought it might based on my conversation with Amit. So I create a new blank database and did an import into it from PROD, and all the data came over!
You: It did?
Me: Yeah, I was surprised, too…but I got all the indexes recreated and rights granted and then had the customer connect via the application to make sure everything looked ok. So the customer was trying to retrieve some data and there was nothing in there from May. I did a query through SSMS and found data from the last few days. In fact, I discovered that while I was migrating data from the PROD database, users were connected to it! I explained this to the customer and the customer said that was ok.
You: I’m confused.
Me: Yeah, that’s when it started to go downhill. Because you know what? That PROD database that I had migrated data from?
Me: It was a restored copy of the database from six months ago.
Me: That’s exactly what I said. It turns out that the customer had restored the database from six months ago, and named it PROD, and users were connected to that database. And then when I asked the customer to restore the backup of the corrupt database, that was restored as PROD2.
You: So you migrated a database that had no corruption?
You: Weren’t you upset?
You: Wow. Did you blame the customer?
Me: No. At least, not completely. I take 50% of the responsibility here. And this is why I’m telling you about my bummer of an afternoon. I made an assumption based on the name of the database. And I didn’t validate the databases as well as I could have. I could have done better. The customer might not have understood what I was asking. I really don’t know; there was definitely some miscommunication there.
You: Yeah, but you wasted your whole afternoon!
Me: True. And when I got off the phone I was really frustrated. I started walking away from my desk and wanted to yell and vent and just be angry at somebody. But then about 10 feet from my desk I stopped and I said out loud, “I don’t want to spend any energy being upset about this.” I turned around, went back to my desk, and figured out what I would do next. It was time for me to go to a soccer game, which I did, and it was a good mental break. As I was walking to the car afterwards, I realized that I learned two lessons today. First, the motto of “trust but verify” is a good one, but you also have to pay attention to the details. I missed two: the database size and the fact that I migrated the data with no errors. I didn’t mention the database size to you, but they had said the database was in the neighborhood of 30 GB. When I was looking at the PROD database, it was only 13 GB. I just dismissed it as user error. I never checked the PROD2 database. Also, when the data migrated without any errors I should have stopped and run the CHECKDB against PROD, and I didn’t.
The second lesson I learned today is that no matter who makes the mistakes, being angry about is a waste of energy. This is common sense. I see that you’re nodding your head as I say this. I know this. I’ve always known this. But today, I actually lived it. What a difference. I’m not perfect, I’m sure I will still get mad sometimes. But I cannot get over the power that came with saying out loud, “I don’t want to spend any energy being mad about this.” I need to try it more often.
You: Wow. What a day.
Me: No kidding. Want another beer?