Select Page
  

Write a 500-750 word paper describing the model deployment and model life cycle aspects of your model. It will include the following:What are model deployment costs? Be specific.What is a proposed task and timeline for deploying your model?What specific training will be required for those who will be using the model on a regular basis?Can this model be used on a repetitive basis? Explain.How will model quality be tracked over time?How will the model be re-calibrated and maintained over time?What specific benefits to the organization will be realized over time as a result of using the model?***Please keep in mind the Model we choose was the Logistic Regression model, so you would be talking about that model for this assignment. I also attached the previous papers that should help you understand more of the modeling and problem we are looking to solve. Please let me know if you have any other questions.***I also attached an EXAMPLE of what this assignment could possibly look like or how it might be laid out. This example was directly from the instructor.
team_fire___model_validation_rough_draft_needs_external_validation.docx

week_7___example___model_deployment___life_cycle___j_debruyn__1_.docx

Don't use plagiarized sources. Get Your Custom Essay on
MIS 690 GCU Model Deployment and Model Life Cycle discussion
Just from $10/Page
Order Essay

week_6___model_building.docx

Unformatted Attachment Preview

Running head: MODEL VALIDATION
1
Model Building
Darren Mans
Renee Taillon
Shelbea Rainbolt
Thomas Salmons
Grand Canyon University: MIS 690
March 10, 2019
MODEL VALIDATION
2
Model Validation
Four predictive models: Logistic Regression, Neural Net, CHAID and Discriminant were
built in IBM SPSS to predict the burn patient status at time of hospital discharge with the
requirement of: 1. a minimal model accuracy of 70% and 2. model overfitting is negated. The
datasets were segmented to investigate the concentration of burn patients under 5 years old. The
decision was made post analysis during a stakeholder meeting that the focus for project moving
forward would be to only consider the entire, unsegmented dataset. The models that focused on
the young age group were suspicious and produced highly sensitive models due to a very small
amount of historical data for the dependent variable in the training dataset; e.g., 5 records with the
status of dead. In addition, the input variable TBSA does account for a discrepancy between the
young patients and the older patients by its definition. The TBSA is based on the Wallace Rule of
Nines, which has a different set of criteria between adults and young children. (Radiation
Emergency Medical Management, n./a.). The final models under consideration only included
variables with significances of 0.05 or less and had accuracies ranging from 91% to 94%; all
highly accurate. These model accuracies met the 70% accuracy criterion in the business problem
statement. Due to the closeness and seemingly robustness of all the models, it was a challenge to
choose just one model. The Logistic Regression model was ultimately chosen since we
considered false positives in the status of dead or alive medical models, by being the most
conservative. We highlight the results of the internal validation by investigating the accuracy,
sensitivity, and specificity; as well as, discuss external validation.
Internal Model Validation
“Scholars have defined a series of methods through which to validate the results obtained
from a logistic regression model”. (Giancristoforo, et al, 2007). In data science it is not
MODEL VALIDATION
3
acceptable to evaluate the performance of a model with the same data that is used to train the
model because “it can easily generate over-optimistic and overfitted models”. (Bulriss, 2018). We
chose the hold-out method of the two (hold-out or cross-validation) for internal model validation
during the model build in IBM SPSS because it is applicable to logistic regression models and it’s
a validation feature built into the SPSS tool. (IBM SPSS, 2019). The 1000 record dataset was
partition into 70% for training, 20% for testing, and 10% for internal model validation. The SPSS
model stream includes the Partition node following the Type node with the parameters shown in
Figure 1.
Figure 1
Figure 1. Figure 2 shows Case Processing Summary history that indicates the number of cases
included in our analysis. The eighth row tells us that zero records are missing data in variables
included in our analysis. Figure 3 shows the accuracy of each of the partitioned datasets for
MODEL VALIDATION
4
Figure 2. Cast Processing Summer for Revised Model – Full Dataset
Figure 2
training, testing and validation and are 93%, 90.5%, and 92%, respectively. Recognize that the
testing accuracy is less than the training accuracy, which is another good indicator of not having
an overfit model. The validation accuracy is 92%, which meets the business problem criterion of a
minimum of 70%. Using a separate validation dataset meets the second criteria of not having an
overfitted model and one of the main reasons for using a hold-out dataset.
Figure 3. Comparison between Predicted and Actual Dependent Variable, Status and Accuracy.
Figure 3
model. Figure 3 illustrates the Confusion Table or Coincidence Matrix for this revised model,
Logistics Regression excluding Race and Gender predictors.
MODEL VALIDATION
5
Figure 4. Confusion Matrix – Full Dataset/Revised Model – Logistic Regression
Figure 4
The Total Positive (TP), Total Negative (TN), False Positive (FP) and False Negative
(FN) are used to determine model sensitivity, specificity, accuracy by the following equations:
=
=

85
= 85+7 = 0.924
+

7
=
= 0.875
+
7+1
The validation dataset has a sensitivity of 92.4%, which is the proportion of actual positive cases
correctly identified. A specificity of 87.5% is the proportion of actual negative cases correctly
identified. The chi-squared is the maximum likelihood estimate of the parameters that is
Figure 4. Model Fitting for Full Dataset Revised Model
Figure 4
MODEL VALIDATION
6
compared to the predicted value is 340.375. This is a high value and since the p-value is less than
our chosen significance level α = 0.05, we can reject the null hypothesis and conclude that there is
an association between the independent variables and the patient’s status at hospital discharge.
The “-2 Log Likelihood” is the “log-likelihood multiplied by -2 and is commonly used to explore
how well a logistic regression model fits the data. The lower the value is, the better your model is
at predicting your binary outcome variable, which is reflected a lower number 233.788. (Strand,
et al, n./a.). It does have a 50% lower value than a model with the intercept only, 574.13). Figure
5 lists the pseudo R-square value for the Nagelkerke value, usually the more important type of
Figure 5. Pseudo R-Square
Figure 5
indicator of the pseudo-R values and is 0.69. Figure 6, the model parameter estimates list another
important statistical consideration, the Wald statistic (to test the statistical significance). You can
see that Wald statistics for TBSA and Age are very high, followed by Inhalation and Flame and
relatively the order of predictor important found in many of the other models , CHAID and Neural
Net, as well.
MODEL VALIDATION
7
Figure 6. Parameter Estimate List of Revised Model
Figure 6
Figures 7 and 8 are the cumulative Gains and Lift graphs for the model. “For a good
model, the gains chart will rise steeply toward 100% and then level off”, which is shown in Figure
7 and validates our model as a good model. (IBM SPSS, 2019). “For a good model, lift should
start well above 1.0 on the left, remain on a high plateau as you move to the right, and then trail
Figure 7. Cumulative Gains Chart.
Figure 7
MODEL VALIDATION
8
Figure 8. Lift Graph for Revised Model for Full Dataset.
Figure 8
off sharply toward 1.0 on the right side of the chart. For a model that provides no information, the
line will hover around 1.0 for the entire graph”. (IBM SPSS, 2019). Figure 8 also reflects a good
model choice.
Figure 9, the AUC, area under the curve, and the GINI score is used to score how well the
Figure 9. AUC and GINI Score for Revised Model – Full Dataset
Figure 9
model describes the data. The GINI score in the range of 0.923 – 0.095, for all the partitioned
datasets, is good and is another confirmation of a good model.
MODEL VALIDATION
9
Predictive Modeling for Logistic Regression
Conclusion
The dataset was partitioned for training, testing, and data validation resulting in accuracies
of 93%, 90.5%, and 92%, respectively. The Nagelkerke pseudo-R is 0.69. The Wald statistics
were reviewed along with the variable p-values. The null hypothesis is that the model is a ‘good
enough’ fit to the data and we will only reject this null hypothesis (i.e. decide it is a ‘poor’ fit) if
there are sufficiently strong grounds to do so (conventionally if p<.05), which for the Logistic Regression Revised Model: _ [ ] = 0.07944 ∗ + 0.08321 ∗ − 1.228 ∗ [ ℎ , = 1, = 0] − 0.7403 ∗ [ , = 1, = 0] − 6.301 with significance values ranging from 0.000 to 0.008, a p-value < 0.05. The external model validation…. MODEL VALIDATION 10 References Bulriss, Randy. (January 11, 2018). Topic – 6: model validation. Grand Canyon University – MIS 690 Capstone. [PowerPoint]. Retrieved 2/14/2019 from: https://lms-grad.gcu.edu/learningPlatform/ Ferrell, Ken. (June 18, 2018). Burn Data. Grand Canyon University – MIS655. Retrieved from: https://lc-grad3.gcu.edu/learningPlatform/ Giancristofaro, R. A., & Salmaso, L. (October 2007). Model performance analysis and model validation in logistic regression. Statistica, [S.l.], v. 63, n. 2, p. 375-396. Retrieved on 3/10/2019 from: https://rivista-statistica.unibo.it/article/view/358. IBM SPSS. (2019). Modeling nodes. IBM SPSS Manual. [e-Handbook]. Retrieved on 3/1/2019 from: https://www.ibm.com/support/knowledgecenter/en/SS3RA7_17.0.0/clementine/modeling_ nodes.html. IBM SPSS. (2019). Reading the results of a model evaluation. IBM SPSS Manual. [e-Handbook]. Retrieved on 3/10/2019 from: https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.modeler .help/graphs_evaluation_appearancetab.htm. Radiation Emergency Medical Management. (n.a.). Burn triage and treatment: thermal injuries. U.S. Department of Health and Human Services.[Website]. Retrieved on 3/9/2019 from: https://www.remm.nlm.gov/burns.htm Strand, Steve, Cadwallader, Stuart, & Firth, David. (n./d.). 4.12 SPSS logistic regression output. MODEL VALIDATION 11 National Center for Research. [e-Handbook.] Retrieved on 3/9/2019 from: http://www.restore.ac.uk/srme/www/fac/soc/wie/researchnew/srme/modules/mod4/12/index.html. Strand, Steve, Cadwallader, Stuart, & Firth, David. (n./d.). Using statistical regression methods in education research. National Center for Research. [e-Handbook.] Retrieved on 3/9/2019 from: http://www.restore.ac.uk/srme/www/fac/soc/wie/researchnew/srme/glossary/index1695.html?selectedLetter=D#deviance-2ll. Running head: MODEL DEPLOYMENT Model Deployment and Model Life Cycle Jacobus De Bruyn Grand Canyon University: MIS690 December 13, 2017 1 MODEL DEPLOYMENT 2 Model Deployment and Model Life Cycle Looking at the model life cycle, once the model is build, tested and signed off, it can be deployed to the production environment. Using the right metrics and dashboards, the model needs to be monitored on a continuous basis and modifications made to the model as needed (Lorica, 2013). Source: Lorica (2013). Model Deployment Cost The following components will be considered in determining the cost of deploying the model in the production environment: • Hardware and software cost needed to deploy the model in the production landscape. • License fees needed for the users that will be using the model. • Training cost to train all the project stakeholders. • Labor cost of the IT employee cost involved in the preparation for go-live, the cut over activities and supporting the post go-live activities. MODEL DEPLOYMENT • 3 Cost related to the change management activities needed during the project, during deployment and for post-deployment activities. Deployment Timeline and Proposed Tasks The estimated deployment time for the project is three weeks. During the first week, the focus will be on setting up the production landscape. During week two, the data will be loaded in the production environment and the model will be setup. The main tasks during week three will be to create the users, have them log on to the system and perform simulations before go-live at the end of week 3. During the week following go-live, activities will be focused on user support. The diagram below provides more detail on the tasks that will be performed during this period. Training Needs The stakeholders for my analytics project includes the following teams: • IT: responsible for model deployment and future maintenance, • Super user: responsible for monitoring the model accuracy, • Sales team: using to model for pricing. MODEL DEPLOYMENT 4 The model that will be deployed is based on a multiple linear regression model. For training, the teams need to be educated on the multiple linear regression model functionality. The sales team will also need to have a deeper level of understand of the inputs and outputs of the model and how to use the model to determine the price of a new product. The IT team also need to understand how to support the model while the Super user will also need to understand how to review the model accuracy. Repetitive Use of Model Daily there are new sales contracts negotiated with customers. Determining the unit sales price is a time-consuming task and can delay negotiations or result in the loss of the sale. To have a faster turnaround time, statistics will be used to predict the product unit pricing that should be used for new sales and or when marketing of new products. The outcome of the model and final negotiated unit price will directly impact the revenue of the company. The model will therefore be used daily. Model Quality The model will be used daily to assist with determining the unit sales price for new sales negotiations. To track the model quality over time, procedures will be established to document the outcome of the model for each sales request, the actual price that was negotiated with the customer and the reasons when the model suggested unit price was not used. The consolidated output will be reviewed monthly to determine if any modifications is needed to the model. Customer profitability will also be used to determine the effectivity of the model. It needs to be determined if the model unit price was used for sales transactions with low margins. If that is the case, the model will need to be recalibrated. MODEL DEPLOYMENT 5 Model Recalibration and Maintenance Underlying systems change over time and these changes should be reflected in the input parameters of the underlying model. Failure to apply changes will result in the decay of the model over time and will result in underperformance of the model (EvolvedAnalytics, n.d.) To ensure the accuracy of my model, the output of the model will be tracked and compared to historical trends and competitor prices. Any decay will that model maintenance is needed and the model will be recalibrated on the new data. Organizational Benefits for Using the Model Using the revenue from the previous year as a base, the goal for the company is to increase revenue with 3% in the following year. With this profitable growth, the company will be on tract to become a $2 billion-dollar company by 2020. To achieve this growth in revenue and gain a competitive advantage, dynamic pricing models needs to be in place to support quick turnaround on request from customers. With the multiple regression analysis model, the sales teams will be able to react fast to request and have a model that will guide them to making data driven decisions. MODEL DEPLOYMENT 6 References EvolvedAnalytics. (n.d.). Model Building Lifecycle. Retrieved from http://evolvedanalytics.com/?q=support/learning/modeling/modelLifeCycle Lorica, B. (2013, July 14). Data scientists tackle the analytic lifecycle. Retrieved from https://www.oreilly.com/ideas/data-scientists-and-the-analytic-lifecycle Running head: MODEL BUILDING 1 Model Building Darren Mans Renee Taillon Shelbea Rainbolt Thomas Salmons Grand Canyon University: MIS 690 March 3, 2019 MODEL BUILDING 2 Model Building Summary Four predictive models were developed and analyzed to predict the burn patient status at time of hospital discharge with the requirement of at least 70% model accuracy. CHAID, Discriminant, Logistic Regression, and Neural Net were the four classification models investigated in IBM’s SPSS. The dataset was partitioned into a 70% training dataset, a 20% test dataset and a 10% validation dataset for most models. Of the1000 burn patient records, 198 records were segmented by age for patients under five years of age and 802 records were segmented for the patients five years of age and older to compare to the non-segmented dataset. These models were chosen for analysis based on the type of input and target variables, the industry-wide known accuracy and potential use of deployment in the field. Model Configuration and Descriptions Figure 1 illustrates the IBM SPSS model stream configuration used for all three datasets for each of the models described below, with the exception of, changing out the modeling node for each specific model: Logistic Regression, Neural Net, CHAID, and Discriminant. Figure 1. Logistic Regression - SPSS Figure 1 MODEL BUILDING 3 The Type node in was used to read in the values of the clean Excel worksheets in IBM SPSS Modeler. The role of variables was indicated as shown Figure 2. The “Status” variable was depicted as the target variable, ID and Facility, were depicted as none, and the remaining variables were set to input variables. The type of the variables was checked to determine whether the status of “continuous or flag” was appropriate. Then the data was read into the model stream. Figure 2. Type Node and Variable Descriptions Figure 2 The Partition node was set up to partition the data into a 70/30/10 for training/testing/validation analysis shown in Figure 3 for all models, except the Under 5 Years of Age segmented data set for Neural Net model, which had 90/5/5 split due to the small amount of actual records of Status equal to dead. Otherwise, the Neural Net model generated an error and couldn’t completed execute. MODEL BUILDING 4 Figure 3. Partition Node Showing 70/30/10 Data Partition. Figure 3 Logistical Regression Logistical Regression is known as a statistical method for classifying a target variable that is categorical by using continuous and categorical input variables. It is analogues linear regression; but for a categorical target. (IBM, Modeling Nodes, para. 12). Neural Net “A neural network can approximate a wide range of predictive models with minimal demands on model structure and assumption. The form of the relationships is determined during the learning process.” (IBM, Neural Net, para. 1). The results of the neural net should mimic the classic models; e.g., if the relationship is linear, then the model would mimic linear regression and so on. These are highly accurate models and it isn’t easy to relay the “relationship between variables” because the model isn’t easily understood. This model was chosen because of accuracy and the fact that the person in the field using the model, once deployed, would not need to understand the model; just execute it and comprehend the results CHAID “CHAID, or Chi-squared Automatic Interaction Detection, is a classification technique that uses Chi-Square to build decision trees to identify optimal spits. CHAID is one of the MODEL BUILDING 5 models of choice because it can generate nonbinary trees, where some splits have more than two branches and is easily understood. Variables for the target and inputs can be continuous or categorical.” (IBM SPSS, CHAID Node, 2019). Discriminant “Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership.” (IBM, Discriminant Node, para. 15). Predictive Modeling for Logistic Regression Logistic Regression analysis was used to develop 3 models for each of the three burn patient datasets. (Ferrell ... Purchase answer to see full attachment

Order your essay today and save 10% with the discount code ESSAYHSELP