Wednesday, March 28, 2012

Is this SQL Test too hard?

I have gotten some criticism from coworkers regarding this test and just wanted to see what you guys think. I realize the wording could use improvement and any criticism towards making it easier to understand is much appreciated.

FWIW - I had to solve this problem on the job so I feel it is a real-world test that helps me understand how people think and if they try to find alternate solutions.

Thanks!

~~~~~~~~~~~~~~~~~~~~

Given a table that has over 100,000 records…

SUBSIDIARY
=========
PARENT_ID INT
CHILD_ID INT
ULTIMATE_PARENT_ID INT
CLEANUP_IND BIT

…where each PARENT_ID can have multiple CHILD_ID values, but the PARENT_ID should not equal the CHILD_ID. After an initial data load, the ULTIMATE_PARENT_ID and CLEANUP_IND columns contain NULL values (see page 2 for sample data).

ULTIMATE_PARENT_ID is defined as the topmost parent in the chain for the particular CHILD_ID record, so if the chain was only 2-level’s deep the ULTIMATE_PARENT_ID is the CHILD_ID’s PARENT_ID’s PARENT_ID.

Please write an answer for all three questions below:

A) Which of the following queries should you run first?
B) Write an optimized query to identify the ULTIMATE_PARENT_ID for each CHILD_ID and set its value into the ULTIMATE_PARENT_ID column.
C) Write a query to identify ALL of the circular references and mark each record that is a circular reference by updating the CLEANUP_IND column to 1.

~~~~~~~~~ Page 2 ~~~~~~~~~

Sample Data, remember though this table has over 100,000 records and the parent-child chain can go n-levels deep – where n is not known.

PARENT_ID CHILD_ID ULTIMATE_PARENT_ID CLEANUP_IND
1024 512 NULL NULL
36 2300 NULL NULL
887 541 NULL NULL
1022 1024 NULL NULL
546 887 NULL NULL
512 2305 NULL NULL
112 967 NULL NULL
697 123 NULL NULL
901 452 NULL NULL
2300 666 NULL NULL
334 445 NULL NULL
512 903 NULL NULL
884 554 NULL NULL
313 313 NULL NULL
554 884 NULL NULL
112 119 NULL NULL
967 555 NULL NULL
2305 333 NULL NULL
333 36 NULL NULL
541 546 NULL NULL
1030 1020 NULL NULL
112 999 NULL NULLI wouldn't even know where to start ;)
Oh and your question A appears to have a large part of the question missing...|||A is referring to B & C - which do you try to solve first and how do you solve them? If you have questions please ask and I can answer.

FWIW - nobody I have interviewed has managed to correctly answer these questions, including an author of a few SQL books.|||Run C then B would be my guess.
Can one parent have more than one child?
EDIT: After looking back at the sample data I can see that they can.
This looks like an aweful lot like tying to work out a department tree diagram (employees and their managers etc) - in which case I *might* be able to do it if I had more time - I may attempt it later|||A hint to question A is to try to solve B & C first.

Yes a parent can have more then one child, think of a Dun & Bradstreet file that lists companies and subsidiaries where each company can be the parent of multiple subsidiaries.|||tests like this are dumb

to find the ultimate parent, question B, in an "optimized" way, well, how often are you gonna want to do that? why does it have to be "optimized" and what does this mean?

define "circular" reference|||tests like this are dumb

to find the ultimate parent, question B, in an "optimized" way, well, how often are you gonna want to do that? why does it have to be "optimized" and what does this mean?

define "circular" reference

You fail.

Seriously - this is a real world example. Say you inherit a database that has no referential integrity set up and you need to analyze it and identify the top most nodes - are you going to throw your hands up in the air and say this problem is dumb?

As to why does it have to be optimized - let's say you manage a nightly job that imports millions of records where you need to ETL the data - optimization here is key.

http://en.wikipedia.org/wiki/Circular_reference|||Say you inherit a database that has no referential integrity set up
...
are you guy going to throw your hands up in the air and say this problem is dumb?

Yes - it should have been set up properly in the first place ;)|||your column names are confusing unless I am missing something. which id refers to "self"? ParentID I assume?

that is, for the first row in your example, is 512 a child of 1024? if so, I would name the columns NodeID, ChildID in that case. ParentID implies that that's the id of my parent, not of myself. know what I mean?|||That is a good point and I will take it into consideration, my only reservation is it specifically drives people into the direction of the correct solution. So far every single person that has taken this test has tried to work from the bottom-up due in part to the naming of the columns. One of my colleagues asked me if this was a "trick test" and I told him in a way it was since there are multiple ways to solve this problem and you will probably not know right off the bat the correct solution unless you have had to tackle a problem similar to this before. I want candidates to explore multiple solutions to come up with a solution that performs optimally.|||You have reservations that naming columns clearly drives people in the direction of a correct solution? are you kidding?

Why not name the columns a,b,c,d then? then it would be even more muddled...

hopefully you start the interview with some easier questions than this one.

I can imagine people getting bogged down on this rather quickly. If they aren't able to make much progress, it would be hard to judge how much they really know.|||I would say this are both moderately difficult SQL problems, in that the require multiple steps rather than a single UPDATE statement, as well as looping and the presumption of an understanding of binary trees.
An SQL Expert should be able to handle them easily, and anybody who wrote a book on SQL should be able to solve them.
If given the test myself, my first response would be that the inclusion of ULTIMATE_PARENT_ID in the schema is poor design, so award bonus points to anybody who catches that.|||I would say this are both moderately difficult SQL problems, in that the require multiple steps rather than a single UPDATE statement, as well as looping and the presumption of an understanding of binary trees.
An SQL Expert should be able to handle them easily, and anybody who wrote a book on SQL should be able to solve them.
If given the test myself, my first response would be that the inclusion of ULTIMATE_PARENT_ID in the schema is poor design, so award bonus points to anybody who catches that.

I agree, I don't give this test over a phone screen or first round, this is for second (last) round candidates.

Denormalization is not always poor design, it depends on the application - if you have a listing stored procedure where you will always be showing the stock ticker (or other hierarchical attribute) for a particular company even if the company is a subsidiary of a company that has a stock ticker then it would be smart to include it in the child records.|||You have reservations that naming columns clearly drives people in the direction of a correct solution? are you kidding?

Why not name the columns a,b,c,d then? then it would be even more muddled...

hopefully you start the interview with some easier questions than this one.

I can imagine people getting bogged down on this rather quickly. If they aren't able to make much progress, it would be hard to judge how much they really know.

A very good side effect of structuring a test like this is seeing how individuals do under pressure, how they behave and how they react towards others that are trying to help them. It is very difficult to evoke this type of pressure in an interview and receive an accurate indication of the candidate's personality. I always help the candidate come to the correct solution and their willingness or stubbornness to learn is an excellent indication of how they will perform on the job.

Keep in mind the solution to this problem does not require knowledge of some advanced t-sql expertise or special SQL 2005 syntax or tricks. You merely have to know how to either write a cursor (for the inefficient solution) or a while loop which is pretty basic SQL 101. The difficulty lies in arriving at the solution and coming up with a solid algorithm. I have started to tell candidates just give me the pseudocode or algorithm you would use to arrive at the solution rather then get bogged down with sql syntax.|||For anyone that likes puzzles like this, I can solve both B & C with 4 update statements (and 1 while loop) - using standard SQL 2000 t-sql (no new SQL 2005 techniques).|||i would drop the table and forget about it.|||i would drop the table and forget about it.
quitter...........|||For anyone that likes puzzles like this, I can solve both B & C with 4 update statements (and 1 while loop) - using standard SQL 2000 t-sql (no new SQL 2005 techniques).
I thought I needed 4 update statements when reality it can be done with 3.|||Keep in mind the solution to this problem does not require knowledge of some advanced t-sql expertise or special SQL 2005 syntax or tricks. You merely have to know how to either write a cursor (for the inefficient solution) or a while loop which is pretty basic SQL 101.I think a good test would be to ask the candidate to recite from memory the syntax required for creating, opening, using, dropping, and deallocating a cursor. If they can do this, then they obviously overuse cursors and fail the interview.sql

No comments:

Post a Comment