Predictive Analytics: Auto-mobile Industry


When buying used cars we have biggest risk of opting wrong one for higher price. We often refer them as “Kicks”. Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to buyer after throw-away repair work, and market losses in reselling the vehicle. So, I have collected cars dataset having multiple variables influencing Criterion Variable (Y). 

Here our objective is to classify the cars of different makers bought for different prices among certain periods. After classification of data into clusters, we use the trained data set as a benchmark to classify the car we bought is right deal or wrong deal. 

In this particular model we deploy both classification and regression to classify which car models are profitable and to predict which variable contributes highest proportion to Criterion Variable (y). 

For training, we have dataset containing 34 variables / columns and 72984 Observations For Testing, we have dataset containing 33 variables / columns and 48708 Observations .

This particular dataset got real-time collected values with more than 20 predictor variables and more than Half-Million Observations. The main purpose of building predictive model on this dataset is to figure out which cars have a higher risk of being kick and providing ideal value for buyers. 

In our Dataset “IsBadBuy” is the Criterion Variable (y) and remaining variables are predictor variables. Here our criterion variable is Binary Variable having only two options 0 – NO, 1 – YES There are multiple predictor variables which influence the criterion variable such as Car Maker, Making Year, Mileage, Vehicle Type, Vehicle Model, Number of KPG etc., 

We will split the dataset 60% for training and 40% for testing. 


 Automobile Dealers may confront greatest difficulties while purchasing utilized vehicles at Auto Auction, which may wind up purchasing incorrectly one for higher cost. The auto group calls these tragic buys as “kicks”. Kicked cars frequently come about when there are altered odometers, execution issues and mechanical issues the merchant is not ready to address, issues with getting the vehicle from the vender, or some other dubious issues. These Kicked cars show solid negative effect on auto brokers where they spend immense sum on Vehicle Transportation, Throw-Back Repairs and Reselling that vehicle. In this task we will make sense of which cars have a higher danger of providing so as to be kick genuine quality to vehicle merchants, attempting to give the most ideal stock determination to their clients. Here our goal is to group the cars of diverse creators purchased at distinctive costs among specific periods. After characterization of information into bunches, we utilize the prepared information set, as a benchmark to group the car we purchased is correct arrangement or wrong arrangement.


In this Predictive model we have sent arrangement, Regression and Model Based Reasoning Techniques to identify which car models are productive and to foresee which variable contributes most noteworthy rate to Target Variable. For “Training” and “Validation” purposes we have gathered 34 Variables containing just about 4, 00,000 Data points. Considering execution issues in our model builder platform SAS Enterprise Miner we have imported generally less data points with no compromise in data quality. The fundamental motivation behind building predictive model on this dataset is to make sense of which cars have a higher danger of being kick and giving perfect quality to purchasers. In our Dataset “#IsBadBuy” is the Target Variable (Y) and remaining variables are predictor variables (X). In our Data set Target is Binary Variable having just two alternatives 0 – NO, 1 – YES. There are various predictor variables (X’s) which impact the criterion variable, for example, #Car Maker, #Making Year, #Mileage, #Vehicle Type, #Vehicle Model, #Number of KPG, #Wheel Type and so on., 

Forecasts can be made utilizing models, scientific definitions of the connections among watched amounts of the datasets. In predicting whether a car is purchased for good value or awful value, a model may utilize variables, for example, Make Year of the vehicle, Vehicle Age, Vehicle Type, Mileage, Type of Fuel, Manufacturer and so on. Variables for the most part bolstered to outlined model are alluded as “Input Variables”.

Building predective model takes after few stages beginning from information extraction to visual representations. We will take after the procedure as per the models outlined in SAS Enterprise Miner. Here we have a dataset containing more than 4, 00,000 data points and just about 34 variables including Target variable. Considering execution time and computing resources at the top of the priority list we have lessened the dataset to executable level. Our principle point is to grow completely working predictive model with no intricacy. 

Amid Initial periods of our model we begin from data extraction where we import data into SAS Enterprise Miner utilizing “File Import” Node. In the wake of importing data into enterprise miner workspace we have changed the role and levels of data for Target Variable #IsBadBuy to “Target” and “Binary”. At that point contingent upon other variable levels of data we have set them to Nominal and Interval. Once the extraction is done we will investigate the dataset to discover skewed data or missing values in the data set.

Here we can see the Variable worth in the wake of running “Graph Explore” node. Variables like #WheelType, #WheelTypeID, #RefId and different variables show noteworthy effect on Target variable. By dissecting the information set we can affirm that the recurrence of Target Variables 0’S are 56.27 % and 1’S having 43.72%. In spite of the fact that this is not a perfect circumstance for deploying predictive models, but rather we can guarantee that model can be finely tuned with the gave dataset.

We have to decrease superfluous variables indicating minimum effect on Target variable. In the wake of watching variable worth we need to wipe out variables with least preference. In this way, we have associated “Drop Node” to Import File node and killed 3 Variables #RefId ~ which is only a serial order of numbers relegated to vehicles, #BRYNO and VNZIP1. In the wake of taking out undesirable variables we have to channel the information from Missing qualities for further handling.

After Data Pre-preparing we have to part the information for Training and Validation purposes, in a perfect world we split information into 6:2:2 proportions i.e. 60% for Training, 20% for Validation and 20% for Testing. In any case, in this model we are using 70% for Training and 30% for acceptance. We drag “Data Partition” Node into the model workspace and associate it to Filter Node. Once the Data is divided according to Training and Validation Splits, we assist the procedure by interfacing the Data Partition Node to Various Classification and Regression models for Analysis. 

We have joined the Data Partition Node to Decision Tree node with most extreme profundity of 6 and Sub Tree evaluation measure “Decision”. In the wake of running the node we can watch Fit Statistics of Decision Tree Node, Average Squared Error was accounted for as 0.226333 for Training Data and 0.227025 for Validation data. Misclassification Rate was 0.364469 for Training Data and 0.370291 for Validation Data. Here Decision Tree got to just top variables contributing in prediction. #WheelType and #Auction are the top contributing variables in the decision tree. We rehash the same procedure for Probability Tree; here we set the Sub Tree Assessment Measure to “Average Square Error”. In the wake of investigating the outcomes we can note better execution in the model, with less ASE of 0.223159 for Training Data and 0.220259 for Validation Data. Probability Tree’s been exceptional occupation mapping independent variables with Target variable.

 Here Decision Tree picked top contributing variables for the model. It proposes first #WheelType was the most critical variable in foreseeing whether car is a “Kick”. Next top variable is #Auction (Auto Dealer). The Assessment Plot demonstrates the advancement of Training Data and Validation Data were declining with expansion in number of takes off. Our model works best with least number of takes off. 

We make utilization of Neural Networks for our Predictive model, before continuing to run Neural Networks Node we utilize “Variable Selection” Node for passing variables that are vital for Neural Nets superfluous variables may show negative effect on results. 

In the wake of Running Neural Net Node we can watch the outcomes with slight abatement in Average Squared Error and Misclassification Rate than Decision Trees and Probability Trees, ASE for Training Data 0.215817 and for Validation Data 0.22089.

Preparing Iterations plot proposes that the prescient model was impartial after sixth Iteration. The model performed well with 6 Iterations and with 6 concealed units. 

We drag Auto Neural Net Node to let the model evaluate the information and tune it without human mediation. Auto Neural conforms number of concealed units and emphasess for best model execution and yield. It proposes that the model took 5 Iterations and Average Squared Error; Misclassification Rates were relatively low when contrasted with Neural Nets. 

If there should be an occurrence of Nominal or Binary Target variables we utilize Memory Based Reasoning Node, which utilizes K-Nearest Neighbors for the perception. It arranges the data points in view of the closest perception and scores every data point in view of its closest neighbors. For this model we incline toward RD-Tree as Method as opposed to Scan on the grounds that RD-Tree uses tree structure to organize observations in dimension space dissimilar to Scan which picks closest neighbors by linear squared distance between the observation and all possible k neighbors.

With a specific end goal to figure prediction function/formula we make utilization of Regression Nodes. Here in our model we convey logistic regression and step-wise regression models to formulate prediction function for our dataset by utilizing Target Variable #IsBadBuy as “Y”. Before utilizing Regression models we need to guarantee we don’t have any missing qualities in the dataset, which may deceive the outcomes and prompts inappropriate Regression function. So we have added Impute node to the data partition to verify we are wiping out the missing values. At that point we change the variables to best fit the model utilizing Transform Variables node. Here for our Regression model we have Average Squared Error of 0.217742 for Training Data and 0.221452 for Validation Data. Next we drag another Regression Node with Selection Criterion as Stepwise and Selection model to Profit/Loss.

Iteration Plot proposes the model came to Lowest Average Squared Error at second step for both Training and Validation Data. In the wake of watching both Regression and Step-wise Regression nodes we have most minimal ASE and Misclassification reported for Step-wise Regression. Here we have utilized Ensemble Node, which utilizes two or more models consolidated to empower vigorous expectation and classification. Here we have joined ensemble node to two regression models to make better model with higher precision rates in prediction and they diminish Bias and change. At the point when contrasted with both regression models, Ensemble performed better in Predicting and classifying Outcomes.


Model Comparison:

Out of all models “Auto Neural “performed well in grouping the Data set and predicting the Target variable “#IsBadBuy”. Next Probability Trees did best job after Auto Neural in foreseeing whether the brought car is a Bad purchase or not. Here we can actualize the completely executed model for predicting and classifying Kicks for Auto merchants. From the predictive model we can reason that #WheelType, Auction Dealer/Company, Vehicle State assumes imperative parts in grouping whether the car was Kick or not at the quoted cost.

Predictive Model Diagram