Write a 500-750 word paper describing the model deployment and model life cycle aspects of your model. It will include the following:What are model deployment costs? Be specific.What is a proposed task and timeline for deploying your model?What specific training will be required for those who will be using the model on a regular basis?Can this model be used on a repetitive basis? Explain.How will model quality be tracked over time?How will the model be re-calibrated and maintained over time?What specific benefits to the organization will be realized over time as a result of using the model?***Please keep in mind the Model we choose was the Logistic Regression model, so you would be talking about that model for this assignment. I also attached the previous papers that should help you understand more of the modeling and problem we are looking to solve. Please let me know if you have any other questions.***I also attached an EXAMPLE of what this assignment could possibly look like or how it might be laid out. This example was directly from the instructor.

team_fire___model_validation_rough_draft_needs_external_validation.docx

week_7___example___model_deployment___life_cycle___j_debruyn__1_.docx

week_6___model_building.docx

Unformatted Attachment Preview

Running head: MODEL VALIDATION

1

Model Building

Darren Mans

Renee Taillon

Shelbea Rainbolt

Thomas Salmons

Grand Canyon University: MIS 690

March 10, 2019

MODEL VALIDATION

2

Model Validation

Four predictive models: Logistic Regression, Neural Net, CHAID and Discriminant were

built in IBM SPSS to predict the burn patient status at time of hospital discharge with the

requirement of: 1. a minimal model accuracy of 70% and 2. model overfitting is negated. The

datasets were segmented to investigate the concentration of burn patients under 5 years old. The

decision was made post analysis during a stakeholder meeting that the focus for project moving

forward would be to only consider the entire, unsegmented dataset. The models that focused on

the young age group were suspicious and produced highly sensitive models due to a very small

amount of historical data for the dependent variable in the training dataset; e.g., 5 records with the

status of dead. In addition, the input variable TBSA does account for a discrepancy between the

young patients and the older patients by its definition. The TBSA is based on the Wallace Rule of

Nines, which has a different set of criteria between adults and young children. (Radiation

Emergency Medical Management, n./a.). The final models under consideration only included

variables with significances of 0.05 or less and had accuracies ranging from 91% to 94%; all

highly accurate. These model accuracies met the 70% accuracy criterion in the business problem

statement. Due to the closeness and seemingly robustness of all the models, it was a challenge to

choose just one model. The Logistic Regression model was ultimately chosen since we

considered false positives in the status of dead or alive medical models, by being the most

conservative. We highlight the results of the internal validation by investigating the accuracy,

sensitivity, and specificity; as well as, discuss external validation.

Internal Model Validation

“Scholars have defined a series of methods through which to validate the results obtained

from a logistic regression model”. (Giancristoforo, et al, 2007). In data science it is not

MODEL VALIDATION

3

acceptable to evaluate the performance of a model with the same data that is used to train the

model because “it can easily generate over-optimistic and overfitted models”. (Bulriss, 2018). We

chose the hold-out method of the two (hold-out or cross-validation) for internal model validation

during the model build in IBM SPSS because it is applicable to logistic regression models and it’s

a validation feature built into the SPSS tool. (IBM SPSS, 2019). The 1000 record dataset was

partition into 70% for training, 20% for testing, and 10% for internal model validation. The SPSS

model stream includes the Partition node following the Type node with the parameters shown in

Figure 1.

Figure 1

Figure 1. Figure 2 shows Case Processing Summary history that indicates the number of cases

included in our analysis. The eighth row tells us that zero records are missing data in variables

included in our analysis. Figure 3 shows the accuracy of each of the partitioned datasets for

MODEL VALIDATION

4

Figure 2. Cast Processing Summer for Revised Model – Full Dataset

Figure 2

training, testing and validation and are 93%, 90.5%, and 92%, respectively. Recognize that the

testing accuracy is less than the training accuracy, which is another good indicator of not having

an overfit model. The validation accuracy is 92%, which meets the business problem criterion of a

minimum of 70%. Using a separate validation dataset meets the second criteria of not having an

overfitted model and one of the main reasons for using a hold-out dataset.

Figure 3. Comparison between Predicted and Actual Dependent Variable, Status and Accuracy.

Figure 3

model. Figure 3 illustrates the Confusion Table or Coincidence Matrix for this revised model,

Logistics Regression excluding Race and Gender predictors.

MODEL VALIDATION

5

Figure 4. Confusion Matrix – Full Dataset/Revised Model – Logistic Regression

Figure 4

The Total Positive (TP), Total Negative (TN), False Positive (FP) and False Negative

(FN) are used to determine model sensitivity, specificity, accuracy by the following equations:

=

=

85

= 85+7 = 0.924

+

7

=

= 0.875

+

7+1

The validation dataset has a sensitivity of 92.4%, which is the proportion of actual positive cases

correctly identified. A specificity of 87.5% is the proportion of actual negative cases correctly

identified. The chi-squared is the maximum likelihood estimate of the parameters that is

Figure 4. Model Fitting for Full Dataset Revised Model

Figure 4

MODEL VALIDATION

6

compared to the predicted value is 340.375. This is a high value and since the p-value is less than

our chosen significance level α = 0.05, we can reject the null hypothesis and conclude that there is

an association between the independent variables and the patient’s status at hospital discharge.

The “-2 Log Likelihood” is the “log-likelihood multiplied by -2 and is commonly used to explore

how well a logistic regression model fits the data. The lower the value is, the better your model is

at predicting your binary outcome variable, which is reflected a lower number 233.788. (Strand,

et al, n./a.). It does have a 50% lower value than a model with the intercept only, 574.13). Figure

5 lists the pseudo R-square value for the Nagelkerke value, usually the more important type of

Figure 5. Pseudo R-Square

Figure 5

indicator of the pseudo-R values and is 0.69. Figure 6, the model parameter estimates list another

important statistical consideration, the Wald statistic (to test the statistical significance). You can

see that Wald statistics for TBSA and Age are very high, followed by Inhalation and Flame and

relatively the order of predictor important found in many of the other models , CHAID and Neural

Net, as well.

MODEL VALIDATION

7

Figure 6. Parameter Estimate List of Revised Model

Figure 6

Figures 7 and 8 are the cumulative Gains and Lift graphs for the model. “For a good

model, the gains chart will rise steeply toward 100% and then level off”, which is shown in Figure

7 and validates our model as a good model. (IBM SPSS, 2019). “For a good model, lift should

start well above 1.0 on the left, remain on a high plateau as you move to the right, and then trail

Figure 7. Cumulative Gains Chart.

Figure 7

MODEL VALIDATION

8

Figure 8. Lift Graph for Revised Model for Full Dataset.

Figure 8

off sharply toward 1.0 on the right side of the chart. For a model that provides no information, the

line will hover around 1.0 for the entire graph”. (IBM SPSS, 2019). Figure 8 also reflects a good

model choice.

Figure 9, the AUC, area under the curve, and the GINI score is used to score how well the

Figure 9. AUC and GINI Score for Revised Model – Full Dataset

Figure 9

model describes the data. The GINI score in the range of 0.923 – 0.095, for all the partitioned

datasets, is good and is another confirmation of a good model.

MODEL VALIDATION

9

Predictive Modeling for Logistic Regression

Conclusion

The dataset was partitioned for training, testing, and data validation resulting in accuracies

of 93%, 90.5%, and 92%, respectively. The Nagelkerke pseudo-R is 0.69. The Wald statistics

were reviewed along with the variable p-values. The null hypothesis is that the model is a ‘good

enough’ fit to the data and we will only reject this null hypothesis (i.e. decide it is a ‘poor’ fit) if

there are sufficiently strong grounds to do so (conventionally if p<.05), which for the Logistic
Regression Revised Model:
_ [ ]
= 0.07944 ∗ + 0.08321 ∗ − 1.228 ∗ [ ℎ , = 1, = 0]
− 0.7403 ∗ [ , = 1, = 0] − 6.301
with significance values ranging from 0.000 to 0.008, a p-value < 0.05. The external model
validation….
MODEL VALIDATION
10
References
Bulriss, Randy. (January 11, 2018). Topic – 6: model validation. Grand Canyon University – MIS
690 Capstone. [PowerPoint]. Retrieved 2/14/2019 from:
https://lms-grad.gcu.edu/learningPlatform/
Ferrell, Ken. (June 18, 2018). Burn Data. Grand Canyon University – MIS655. Retrieved from:
https://lc-grad3.gcu.edu/learningPlatform/
Giancristofaro, R. A., & Salmaso, L. (October 2007). Model performance analysis and model
validation in logistic regression. Statistica, [S.l.], v. 63, n. 2, p. 375-396. Retrieved on
3/10/2019 from: https://rivista-statistica.unibo.it/article/view/358.
IBM SPSS. (2019). Modeling nodes. IBM SPSS Manual. [e-Handbook]. Retrieved on 3/1/2019
from:
https://www.ibm.com/support/knowledgecenter/en/SS3RA7_17.0.0/clementine/modeling_
nodes.html.
IBM SPSS. (2019). Reading the results of a model evaluation. IBM SPSS Manual. [e-Handbook].
Retrieved on 3/10/2019 from:
https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler
.help/graphs_evaluation_appearancetab.htm.
Radiation Emergency Medical Management. (n.a.). Burn triage and treatment: thermal injuries.
U.S. Department of Health and Human Services.[Website]. Retrieved on 3/9/2019 from:
https://www.remm.nlm.gov/burns.htm
Strand, Steve, Cadwallader, Stuart, & Firth, David. (n./d.). 4.12 SPSS logistic regression output.
MODEL VALIDATION
11
National Center for Research. [e-Handbook.] Retrieved on 3/9/2019 from:
http://www.restore.ac.uk/srme/www/fac/soc/wie/researchnew/srme/modules/mod4/12/index.html.
Strand, Steve, Cadwallader, Stuart, & Firth, David. (n./d.). Using statistical regression methods in
education research. National Center for Research. [e-Handbook.] Retrieved on 3/9/2019
from: http://www.restore.ac.uk/srme/www/fac/soc/wie/researchnew/srme/glossary/index1695.html?selectedLetter=D#deviance-2ll.
Running head: MODEL DEPLOYMENT
Model Deployment and Model Life Cycle
Jacobus De Bruyn
Grand Canyon University: MIS690
December 13, 2017
1
MODEL DEPLOYMENT
2
Model Deployment and Model Life Cycle
Looking at the model life cycle, once the model is build, tested and signed off, it can be
deployed to the production environment. Using the right metrics and dashboards, the model
needs to be monitored on a continuous basis and modifications made to the model as needed
(Lorica, 2013).
Source: Lorica (2013).
Model Deployment Cost
The following components will be considered in determining the cost of deploying the
model in the production environment:
•
Hardware and software cost needed to deploy the model in the production
landscape.
•
License fees needed for the users that will be using the model.
•
Training cost to train all the project stakeholders.
•
Labor cost of the IT employee cost involved in the preparation for go-live, the cut
over activities and supporting the post go-live activities.
MODEL DEPLOYMENT
•
3
Cost related to the change management activities needed during the project,
during deployment and for post-deployment activities.
Deployment Timeline and Proposed Tasks
The estimated deployment time for the project is three weeks. During the first week, the
focus will be on setting up the production landscape. During week two, the data will be loaded in
the production environment and the model will be setup. The main tasks during week three will
be to create the users, have them log on to the system and perform simulations before go-live at
the end of week 3. During the week following go-live, activities will be focused on user support.
The diagram below provides more detail on the tasks that will be performed during this period.
Training Needs
The stakeholders for my analytics project includes the following teams:
•
IT: responsible for model deployment and future maintenance,
•
Super user: responsible for monitoring the model accuracy,
•
Sales team: using to model for pricing.
MODEL DEPLOYMENT
4
The model that will be deployed is based on a multiple linear regression model. For
training, the teams need to be educated on the multiple linear regression model functionality. The
sales team will also need to have a deeper level of understand of the inputs and outputs of the
model and how to use the model to determine the price of a new product. The IT team also need
to understand how to support the model while the Super user will also need to understand how to
review the model accuracy.
Repetitive Use of Model
Daily there are new sales contracts negotiated with customers. Determining the unit sales
price is a time-consuming task and can delay negotiations or result in the loss of the sale. To
have a faster turnaround time, statistics will be used to predict the product unit pricing that
should be used for new sales and or when marketing of new products. The outcome of the model
and final negotiated unit price will directly impact the revenue of the company. The model will
therefore be used daily.
Model Quality
The model will be used daily to assist with determining the unit sales price for new sales
negotiations. To track the model quality over time, procedures will be established to document
the outcome of the model for each sales request, the actual price that was negotiated with the
customer and the reasons when the model suggested unit price was not used. The consolidated
output will be reviewed monthly to determine if any modifications is needed to the model.
Customer profitability will also be used to determine the effectivity of the model. It needs
to be determined if the model unit price was used for sales transactions with low margins. If that
is the case, the model will need to be recalibrated.
MODEL DEPLOYMENT
5
Model Recalibration and Maintenance
Underlying systems change over time and these changes should be reflected in the input
parameters of the underlying model. Failure to apply changes will result in the decay of the
model over time and will result in underperformance of the model (EvolvedAnalytics, n.d.)
To ensure the accuracy of my model, the output of the model will be tracked and
compared to historical trends and competitor prices. Any decay will that model maintenance is
needed and the model will be recalibrated on the new data.
Organizational Benefits for Using the Model
Using the revenue from the previous year as a base, the goal for the company is to
increase revenue with 3% in the following year. With this profitable growth, the company will be
on tract to become a $2 billion-dollar company by 2020. To achieve this growth in revenue and
gain a competitive advantage, dynamic pricing models needs to be in place to support quick
turnaround on request from customers. With the multiple regression analysis model, the sales
teams will be able to react fast to request and have a model that will guide them to making data
driven decisions.
MODEL DEPLOYMENT
6
References
EvolvedAnalytics. (n.d.). Model Building Lifecycle. Retrieved from http://evolvedanalytics.com/?q=support/learning/modeling/modelLifeCycle
Lorica, B. (2013, July 14). Data scientists tackle the analytic lifecycle. Retrieved from
https://www.oreilly.com/ideas/data-scientists-and-the-analytic-lifecycle
Running head: MODEL BUILDING
1
Model Building
Darren Mans
Renee Taillon
Shelbea Rainbolt
Thomas Salmons
Grand Canyon University: MIS 690
March 3, 2019
MODEL BUILDING
2
Model Building Summary
Four predictive models were developed and analyzed to predict the burn patient status at
time of hospital discharge with the requirement of at least 70% model accuracy. CHAID,
Discriminant, Logistic Regression, and Neural Net were the four classification models
investigated in IBM’s SPSS. The dataset was partitioned into a 70% training dataset, a 20% test
dataset and a 10% validation dataset for most models. Of the1000 burn patient records, 198
records were segmented by age for patients under five years of age and 802 records were
segmented for the patients five years of age and older to compare to the non-segmented dataset.
These models were chosen for analysis based on the type of input and target variables, the
industry-wide known accuracy and potential use of deployment in the field.
Model Configuration and Descriptions
Figure 1 illustrates the IBM SPSS model stream configuration used for all three datasets
for each of the models described below, with the exception of, changing out the modeling node
for each specific model: Logistic Regression, Neural Net, CHAID, and Discriminant.
Figure 1. Logistic Regression - SPSS
Figure 1
MODEL BUILDING
3
The Type node in was used to read in the values of the clean Excel worksheets in IBM
SPSS Modeler. The role of variables was indicated as shown Figure 2. The “Status” variable
was depicted as the target variable, ID and Facility, were depicted as none, and the remaining
variables were set to input variables. The type of the variables was checked to determine
whether the status of “continuous or flag” was appropriate. Then the data was read into the
model stream.
Figure 2. Type Node and Variable Descriptions
Figure 2
The Partition node was set up to partition the data into a 70/30/10 for
training/testing/validation analysis shown in Figure 3 for all models, except the Under 5 Years of
Age segmented data set for Neural Net model, which had 90/5/5 split due to the small amount of
actual records of Status equal to dead. Otherwise, the Neural Net model generated an error and
couldn’t completed execute.
MODEL BUILDING
4
Figure 3. Partition Node Showing 70/30/10 Data Partition.
Figure 3
Logistical Regression
Logistical Regression is known as a statistical method for classifying a target variable
that is categorical by using continuous and categorical input variables. It is analogues linear
regression; but for a categorical target. (IBM, Modeling Nodes, para. 12).
Neural Net
“A neural network can approximate a wide range of predictive models with minimal
demands on model structure and assumption. The form of the relationships is determined during
the learning process.” (IBM, Neural Net, para. 1). The results of the neural net should mimic the
classic models; e.g., if the relationship is linear, then the model would mimic linear regression
and so on. These are highly accurate models and it isn’t easy to relay the “relationship between
variables” because the model isn’t easily understood. This model was chosen because of
accuracy and the fact that the person in the field using the model, once deployed, would not need
to understand the model; just execute it and comprehend the results
CHAID
“CHAID, or Chi-squared Automatic Interaction Detection, is a classification technique
that uses Chi-Square to build decision trees to identify optimal spits. CHAID is one of the
MODEL BUILDING
5
models of choice because it can generate nonbinary trees, where some splits have more than two
branches and is easily understood. Variables for the target and inputs can be continuous or
categorical.” (IBM SPSS, CHAID Node, 2019).
Discriminant
“Discriminant analysis builds a predictive model for group membership. The model is
composed of a discriminant function (or, for more than two groups, a set of discriminant
functions) based on linear combinations of the predictor variables that provide the best
discrimination between the groups. The functions are generated from a sample of cases for which
group membership is known; the functions can then be applied to new cases that have
measurements for the predictor variables but have unknown group membership.” (IBM,
Discriminant Node, para. 15).
Predictive Modeling for Logistic Regression
Logistic Regression analysis was used to develop 3 models for each of the three burn
patient datasets. (Ferrell ...
Purchase answer to see full
attachment