Classic Machine Learning Example In SQL Server Analysis Services

By:   |   Updated: 2013-06-04   |   Comments (5)   |   Related: > Analysis Services Development


Problem

Many Data Mining or Machine Learning students have trouble making the transition from a Data Mining tool such as WEKA [1] to the data mining functionality in SQL Server Analysis Services.

Solution

The solution presented here takes a classic example from Data Mining and Machine Learning seen in differing variations in textbooks by Quinlan [2], Mitchell [3], Han, Kamber and Pei [4], and the WEKA application.

The first step is to create a table and load it with data using the T-SQL sample below.

IF EXISTS (SELECT name FROM sys.tables WHERE name = N'tblDecisionTreeExample')
    DROP TABLE dbo.tblDecisionTreeExample;
create table dbo.tblDecisionTreeExample 
(
  PKey integer identity(1,1) Primary Key,
  Outlook varchar(8) not null,
  Temperature varchar(4) not null,
  Humidity varchar(6) not null,
  Windy varchar(5) not null,
  Play varchar(3) not null
)
insert into dbo.tblDecisionTreeExample values ('sunny','hot','high','FALSE','no')
insert into dbo.tblDecisionTreeExample values ('sunny','hot','high','TRUE','no')
insert into dbo.tblDecisionTreeExample values ('overcast','hot','high','FALSE','yes')
insert into dbo.tblDecisionTreeExample values ('rainy','mild','high','FALSE','yes')
insert into dbo.tblDecisionTreeExample values ('rainy','cool','normal','FALSE','yes')
insert into dbo.tblDecisionTreeExample values ('rainy','cool','normal','TRUE','no')
insert into dbo.tblDecisionTreeExample values ('overcast','cool','normal','TRUE','yes')
insert into dbo.tblDecisionTreeExample values ('sunny','mild','high','FALSE','no')
insert into dbo.tblDecisionTreeExample values ('sunny','cool','normal','FALSE','yes')
insert into dbo.tblDecisionTreeExample values ('rainy','mild','normal','FALSE','yes')
insert into dbo.tblDecisionTreeExample values ('sunny','mild','normal','TRUE','yes')
insert into dbo.tblDecisionTreeExample values ('overcast','mild','high','TRUE','yes')
insert into dbo.tblDecisionTreeExample values ('overcast','hot','normal','FALSE','yes')
insert into dbo.tblDecisionTreeExample values ('rainy','mild','high','TRUE','no')
select * from dbo.tblDecisionTreeExample

The output from the T-SQL code is shown below. This data will be used as the input for the data mining algorithm.

 
The output from the T-SQL code

In Visual Studio, create a new Analysis Services Multidimensional and Data Mining Project. In this tip, we will name the project DecisionTreeExample. Click on OK when finished with the New Project window.

In Visual Studio, create a new Analysis Services Multidimensional and Data Mining Project

In the Solution Explorer window, right-click on the Data Sources folder and choose "New Data Source..." to initiate the Data Source Wizard. Click on "Next >".

right-click on the Data Sources folder and choose "New Data Source..."

Choose your data connection, if one exists. If a Data connection does not exist, click on "New..." to create a new data connection. In this example, we are using a connection to the Tips database on the localhost. Click on "Next >".

If a Data connection does not exist, click on "New..."

On the Impersonation Information screen, click on "Use a specific Windows user name and password." Enter your username and password. Click on "Next >".

click on "Use a specific Windows user name and password."

On the Completing the Wizard screen, the data source name can be changed if desired. Click on "Finish".

the data source name can be changed

In the Solution Explorer window, right-click on the Data Source Views folder and choose "New Data Source View..." to launch the Data Source View Wizard. Click on "Next >".

"New Data Source View..."

On the Select a Data Source page in the Relational data sources window, select the data source we created in the above step. Click on "Next >".

the Select a Data Source page in the Relational data sources window

On the Name Matching page, check the box for "Create logical relationships by matching columns." In the Foreign key matches box, press the "Same name as primary key" button. Click on "Next >".

check the box for "Create logical relationships by matching columns."

On the Select Tables and Views page, move tblDecisionTreeExample from the Available Objects box to the Included object box by selecting tblDecisionTreeExample in the Available objects box and then clicking on the ">" box. Click on "Next >".

move tblDecisionTreeExample from the Available Objects box to the Included object box by selecting tblDecisionTreeExample

On the Completing the Wizard page, give the Data Source View a name and click on "Finish".

give the Data Source View a name and click on "Finish"

The Solution Explorer should appear as it does below with one Data Source and one Data Source View defined.

The Solution Explorer should appear as it does below with one Data Source and one Data Source View defined.

In the Solution Explorer window, right-click on the Mining Structures folder and choose "New Mining Structure..." to launch the Data Mining Wizard. Click on "Next >".

right-click on the Mining Structures folder and choose "New Mining Structure..."

On the Select the Definition Method page, press the radio button labeled "From existing relational database or data warehouse". Click on "Next >".

press the radio button labeled "From existing relational database or data warehouse"

On the Create the Data Mining Structure page, press the radio button labeled "Create mining structure with a mining model". Choose the "Microsoft Decision Trees" data mining technique from the drop-down box.

press the radio button labeled "Create mining structure with a mining model"

On the Select Data Source View page, choose "Tips" from the Available data source views. Please note this is the data source view we created earlier. Click on "Next >".

choose "Tips" from the Available data source views

On the Specify Table Types page, make sure the Case box is checked and the Nested box is unchecked for the table named tblDecisionTreeExample.

make sure the Case box is checked and the Nested box is unchecked

On the Specify the Training Data page, in the Key column check the box that corresponds to the PKey column. In the Input column, check the boxes for the Humidity, Outlook, Temperature, and Windy columns. These input columns will be used as input to the decision tree algorithm. In the Predictable column, check the box for the Play column which is the column the decision tree algorithm will attempt to predict. Click on "Next >".

check the box that corresponds to the PKey column

On the Specify Columns' Content and Data Type page, we see the columns to be used in the mining model structure, along with their content and data types. Click on "Next >".

Specify Columns' Content and Data Type page

On the Create Testing Set page, we will set the "Percentage of data for testing" and "Maximum number of cases in testing data set" to zero for this example. Please note that there needs to be a set of data reserved for testing or use 10-fold cross validation to prevent overfitting the data mining model to the training data. Click on "Next >".

set the "Percentage of data for testing" and "Maximum number of cases in testing data set" to zero for this example

On the Completing the Wizard page, the name of the mining structure and model can be changed. Click on "Finish".

the name of the mining structure and model can be changed

The image below shows the Visual Studio Solution Explorer window and the Mining Structure tab of the data mining model that was created in the steps above.

the Visual Studio Solution Explorer window

The next image from Visual Studio shows the Mining Model tab of the data mining model that was created in the steps above.

The next image from Visual Studio shows the Mining Model tab of the data mining model

Under the Mining Model tab, right click on the box labeled "Microsoft_Decision_Trees" to bring up the menu shown below and select "Set Algorithm Parameters..." to display the Algorithm Parameters window.

Under the Mining Model tab, right click on the box labeled "Microsoft_Decision_Trees"

In the Algorithm Parameters window, remove the COMPLEXITY_PENALTY and FORCE_REGRESSOR parameters. Set MINIMUM_SUPPORT to 1, SCORE_METHOD to 1, and SPLIT_METHOD to 2. The description box explains the meanings the parameters and their values.

In the Algorithm Parameters window, remove the COMPLEXITY_PENALTY and FORCE_REGRESSOR parameters

The parameters and their values should now appear as shown below. Click on "OK".

The parameters and their values should now appear as shown below

The next few images show the process of building, deploying, and processing the data mining model on the SQL Server Analysis Services server from the Visual Studio project. In the Solution Explorer window in the Mining Structures folder, right-click on the DecisionTreeExample.dmm mining structure and choose "Process..."

processing the data mining model on the SQL Server Analysis Services server from the Visual Studio project

When asked "Would you like to build and deploy the project first?", choose "Yes".

"Would you like to build and deploy the project first?"

Next, the user might be given a warning about the amount of time it could possibly take to process the model. When asked "Do you wish to continue?", choose "Yes". There are only 14 records so it should only take a few seconds.

the user might be given a warning about the amount of time it could possibly take to process the model

The Process Mining Model window is displayed next. In this example, we will "Process Full" and leave all other settings to their default values. Click on "Run...".

The Process Mining Model window is displayed next

The Process Progress window is displayed while processing is occurring. Upon successful completion, click "Close".

Upon successful completion, click "Close".

Click on the Mining Model Viewer tab. The decision tree model will display on the screen.

Click on the Mining Model Viewer tab

Hovering the mouse pointer over the rectangular objects will pop-up an informational box showing the population of the split.

Hovering the mouse pointer over the rectangular objects will pop-up an informational box showing the population of the split

The image below is taken from the Object Explorer window for Analysis Services in SQL Server Management Studio after deployment and processing. Notice how the objects displayed here match the objects seen in the Solution Explorer window within Visual Studio.

the Object Explorer window for Analysis Services in SQL Server Management Studio
Next Steps

References

  • [1] Witten, I.H., E. Frank, E., Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed., Morgan Kaufmann Publishers, 2005.
  • [2] Quinlan, R., C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
  • [3] Mitchell, T., Machine Learning, McGraw Hill, 1997.
  • [4] Han, J., M. Kamber and J. Pei, Data Mining: Concepts and Techniques, Elsevier, 2012.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2013-06-04

Comments For This Article




Monday, September 28, 2015 - 7:16:13 AM - Dallas Snider Back To Top (38768)

Try to eliminate this error first:  "A connection could not be made to the data source with the DataSourceID of 'DataMiningTest', Name of 'DataMiningTest'." 

Go to your Data Source.  Make sure your connection string is correct by using "Test Connection".  Also, click on the Impersonation Information tab and make sure your User Name and Password are correct under "Use a specific Windows user name and password."


Friday, September 25, 2015 - 12:25:34 PM - Zynp Back To Top (38753)
I'have been trying to succeed in mining structure process step and I'm taking these errors.
 
-Internal error: The operation terminated unsuccessfully.
-Errors in the high-level relational engine. A connection could not be made to the data source with the DataSourceID of 'DataMiningTest', Name of 'DataMiningTest'.
-Errors in the OLAP storage engine: An error occurred while the dimension, with the ID of 'DataMiningTest ~MC-Id', Name of 'DataMiningTest ~MC-Id' was being processed.
-Errors in the OLAP storage engine: An error occurred while the 'Borc' attribute of the 'DataMiningTest ~MC-Id' dimension from the 'MultidimensionalProject3' database was being processed.
-Errors in the OLAP storage engine: An error occurred while the dimension, with the ID of 'DataMiningTest ~MC-Id', Name of 'DataMiningTest ~MC-Id' was being processed.
-Errors in the OLAP storage engine: An error occurred while the 'Calismasekli' attribute of the 'DataMiningTest ~MC-Id' dimension from the 'MultidimensionalProject3' database was being processed.
-Errors in the high-level relational engine. A connection could not be made to the data source with the DataSourceID of 'DataMiningTest', Name of 'DataMiningTest'.
-Errors in the OLAP storage engine: An error occurred while the dimension, with the ID of 'DataMiningTest ~MC-Id', Name of 'DataMiningTest ~MC-Id' was being processed.
-Errors in the OLAP storage engine: An error occurred while the 'Calismasuresi' attribute of the 'DataMiningTest ~MC-Id' dimension from the 'MultidimensionalProject3' database was being processed.
 

Friday, March 13, 2015 - 12:06:52 PM - Oscar Back To Top (36521)

In the Connection Manager, Who type Provider is?

SQL Server Native Client 10.0

SQl Server Native Client 11.0

SQLClient Data Propvider

 

I´m use SQL Server 2008 DB

 

Saludos


Wednesday, October 9, 2013 - 2:48:24 AM - Rosana Back To Top (27086)

Too mechanical. The author must not assume that everybody knows the examples in the referred books. The article would add value to the reader if there is a short description of the purpose and results of the model. A comment about the selection of the values for the algorithm parameters would help to connect all pieces of information together.


Wednesday, August 28, 2013 - 11:36:48 AM - Jeannine Back To Top (26510)

Nice idea to use the data sets from Hans and Kamber and other texts. I was hoping you would comment on the choice of parameters or discuss the results, but perhaps in a follow up post?















get free sql tips
agree to terms