Categories
R

“Exploring the Relationship Between Income and Education Level: A Data Analysis using R Programming”

This project requires you to strictly follow the requirements (I have uploading requirements in the attachment). This project requires the completion of R programming and a report written in APA format based on the data conclusions drawn from the R programming content.

Categories
R

Exploring Risk Factors for Cardiovascular Disease in the Framingham Heart Study Data “Exploring the Relationship Between Categorical Variables and Cardiovascular Health Outcomes”

For this assignment, you will be using the Framingham Heart Study Data. The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. We will be using this original data. As you look over the Framingham Heart Study data and data dictionary to familiarize yourself with this data, you will notice that the study had a longitudinal design. This means that there were multiple observations on the same individuals at different points in time. You will notice variables with the same name, but with 1, 2 or 3 at the end of the name. These numbers indicate the data collection time points. For this assignment, we will only be using the primary variables and the variables at time point 1. Because of this, we can create an analysis file by retaining only the variables we want and removing the variables we do not need. This will make the data file easier to work with. To reduce the dataset to a more manageable size, open the Framingham Heart Study data in EXCEL. Remove all variables that have a name that ends in a ‘2’ or ‘3’. Variables like: sex2, sex3, age2, age3, etc should all be removed. In EXCEL, you can simply highlight those variables you do not want and delete. Next, remove all variables that start with “TIME…..” These are variables like: TIMEAP, TIMEMI, etc. etc. Save your reduced datafile to your computer using a different filename. Call this reduced dataset something like: FHS_assign7.xlxs.
Check on the records to see if there is missing values. Delete records with missing values. Re-save your dataset. Read your new analysis file into R. You are good to go
ASSIGNMENT TASKS
Part A – Mechanics (25 points)For this analysis, the variable “stroke” should be considered the response variable Y and the “diabetes1” variable should be considered the explanatory variable (X). Complete the following
:1) Construct a side by side bar graph to compare these two categorical variables. Describe what you see in this graph. Be sure to label the axes and give titles to the graph.
2) Construct a contingency table complete with Marginal row and column totals for these two variables, then answer the following
a) What is the conditional probability of having a stroke given diabetes is present at time 1? What is the conditional probability of having a stroke given diabetes is NOT present at time 1? b) What are the odds of having a stroke if diabetes is present at time 1? What are the odds of having a stroke of diabetes is not present at time 1?
c) Calculate the odds ratio of having a stroke when diabetes is present relative to when it is not. Interpret this result.d) Specify the null and alternative hypothesis, and then conduct a hypothesis test to see if diabetes is related to having a stroke. Interpret the results.
Part B – Open Ended Analysis (75 points)3) In professional practice, when you have an observational dataset like the Framingham Heart Study data, one typically is looking for risk factors. In other words, explanatory variables that are related to specific response variables of interest. For this last task, you will identify and work with only categorical explanatory variables. The response variables of interest are ANYCHD, STROKE, and DEATH. What categorical explanatory variables seem to indicate elevated risk of Coronary Heart Disease, Stroke or Death? Conduct an analysis. Report and interpret your results
.4) Which of the continuous explanatory variables do you think is most likely indicative of elevated risk of Coronary Heart Disease (ANYCHD), Stroke, or Death? Pick one such variable. Create a new variable that maps the continuous variable’s values into a categorical variable with at least 3 levels. Conduct contingency table analyses relating this newly created categorical variable to ANYCHD, STROKE and DEATH. These analyses should be done separately. In other words, you will have at least 3 separate contingency tables. Do NOT attempt to have multiple dimension contingency tables! Report on the results of your analysis and discuss the results.
5) Reflect on your experiences here. What are your recommendations for future analysis? Congratulations! You’ve completed the Assignment 8. Please save your R-code, because you can re-use or cannibalize this code in future assignments. Your write-up should address each task.

Categories
R

Title: Comparison of Multiple Groups via ANOVA: A Study on Nutrition, Age, and Alcohol Consumption

Comparison of Multiple Groups via ANOVA1) Download theNutrion study data and read it into R-Studio. We will work with the entire data set for this assignment. Use the IFELSE( ) function to create 2 new categorical variables. The variable should be defined as:
Age_Cat = 1 if Age <=19 2 if 20<= Age<=29 3 if 30<=Age<=39 4 if 40<=Age<=49 5 if 50<=Age<=59 6 if 60<=Age<=69 7 if Age>=70
and, Alcohol_Cat = 0 if Alcohol=0
1 if 0=10
If you have trouble using the IFELSE( ) function in R, you could create these new categorical variables in EXCEL, and then just read them into R with the dataset. It works either way.
Report the counts for each value of these 2 new categorical variables.
2) Using the variable Quetelet as the dependent response variable (Y), specify the null and alternative hypotheses and conduct a oneway ANOVA F-test to check for mean differences on the levels of Age_Cat variable, and a separate ANOVA for the Alcohol_Cat variable. Interpret the two hypothesis tests. What do you conclude? If you have a statistically significant result at the alpha=0.05 level, then you must follow up the significant ANOVA with a post hoc analysis. At this point, use 95% Confidence Intervals for each group to determine if there are group mean differences and where they occur. Discuss your findings. 3) Now, using the Calories variable as the dependent response variable (Y), conduct similar ANOVA hypothesis tests and obtain confidence intervals for each group to determine if there are group mean differences relative to Age_Cat and Alcohol_Cat. You will need to clearly set up the null and alternative hypotheses, conduct the test with appropriate statistics, and interpret the individual group confidence intervals. 4) For the FAT, FIBER, and CHOLESTEROL variables, use a 95% confidence interval approach to compare groups, on average, for Age_Cat and Alcohol_Cat. Interpretthe confidence intervals. Use whatever outside information you can obtain to help interpret the results.
5) With the results from this additional analysis, how has the story description from Modeling Assigment #3 changed? You are welcome to bring in information from your prior knowledge and experience to embellish this story. Is the analysis sufficient so far for your story, or is something missing? What should be done next? Write up your synthesis description of what this data set seems to be saying (up to this point) and where we should go from here.

Categories
R

Title: Predicting Ticket Sales Patterns and Top Purchasers for 2016 Using Regression Analysis

Ticket sales patterns (15 points)
i. Create a model to predict ticket revenues or ticket revenue groups
for 2014 using the previous five years of data.
ii. Test your model on 2015 data. Comment.
iii. Make predictions for ticket purchases in 2016 (Like the Moneyball
example, the data of 2016 is missing. Assume that the coefficients
and the intercept values for the model created in point (ii) will be
the same for predicting 2016).
iv. Based on your model, who should be the top 10 ticket purchasers
for 2016?
I need the answers to the questions above, I need the top 10 in an excel file and the rest screenshots of the work done in R.

Categories
R

“Predicting Ticket Sales Patterns: Analysis and Forecasting for 2014-2016”

I was given this answer, BUT I need the answers to the questions with visualizations not just the code written out.
Ticket sales patterns (15 points)
i. Create a model to predict ticket revenues or ticket revenue groups
for 2014 using the previous five years of data.
ii. Test your model on 2015 data. Comment.
iii. Make predictions for ticket purchases in 2016 (Like the Moneyball
example, the data of 2016 is missing. Assume that the coefficients
and the intercept values for the model created in point (ii) will be
the same for predicting 2016).
iv. Based on your model, who should be the top 10 ticket purchasers
for 2016?

Categories
R

Title: Analysis and Evaluation of the State of Texas Regional Water Plans: Coastal Bend (N) Area

Make a detailed report about the Analysis and Evaluation of the State of Texas Regional Water Plans: Coastal Bend (N) Area. (the most important thing to do the analysis Using R-Code Correctly)
Read the Word file carefully, the Excel data is attached in the link
Data Analysis: Analyze the collected data using methods of optimization like (goal programming and genetic algorithms) and uncertainty methods like (monte carlo) to analyze them using R-code. to make recommendations on alternatives for the management plans based on enumerated risks.
Evaluation: Evaluate the effectiveness of the current water plans in meeting the water needs of the assigned region. Identify any potential areas of improvement. What are some of the deficiencies? How well does the water plan address uncertainty? Recommendations: Based on the analysis and evaluation, provide recommendations for improving the water plans. This could include suggestions for new water management strategies or modifications to existing ones.

Categories
R

Exploring Relationships in the Nutrition Study Data

1) Download the Nutrition study data and read it into R-Studio. We will work with the entire data set for this assignment. Use the IFELSE( ) function to create 2 new categorical variables. The variables are:
Alcohol_Use: 1 (yes) if Alcohol > 0
0 (no) if Alcohol=0
Age_retired: 1 if Age >= 65
0 if Age < 65 If you have trouble using the IFELSE( ) function in R, you could create these new categorical variables in EXCEL, and then just read them into R with the dataset. It works either way. Report the counts for each value of these 2 new categorical variables. 2) For this problem, we are going to see of smoking (SMOKE) is related to body mass (QUETELET). Here, Quetelet is the continuous dependent response variable (Y) and Smoke (X) is the categorical explanatory variable. Please complete the following: a) Obtain descriptive statistics on Y for each group. In a table report each group's sample size, mean, standard deviation, and variance. b) Clearly state the null and alternative hypotheses in words and symbols. c) Use R to obtain the test statistic and p-value for the classic pooled variance two sample T-test. Report the test statistic and p-value, and then state the decision to be made. d) Report the formula for the test statistic in part c) and verify the computer's computations using the descriptive statistics from part a). e) Calculate and report confidence intervals for both groups. Discuss the interpretation of the result based on confidence intervals. Is it consistent with the hypothesis test result? If they are different, which should you believe? 3) Moving into a more data analytic framework, then next question would be are there any 2 group categorical variables that exhibit differences relative to the Quetelet variable? Reframing this as more of a direction for an assignment - Using the variable Quetelet as the dependent response variable (Y), conduct hypothesis tests and obtain confidence intervals (for each group) to determine if there are group mean differences relative to the categorical variables: Gender (male vs female) Age_retired Alcohol_use You will need to clearly set up the null and alternative hypotheses, conduct the test with appropriate statistics, and interpret the individual group confidence intervals. Please use tables to summarize your findings. What decisions do you make from these results? How would you summarize the "story" that emerges from these analyses on the Body Mass Quetelet variable? 4) Using the CHOLESTEROL variable as the dependent response variable (Y), conduct hypothesis tests and obtain confidence intervals (for each group) to determine if there are group mean differences relative to: Gender (male vs female) Smoke Age_retired Alcohol_use You will need to clearly set up the null and alternative hypotheses, conduct the test with appropriate statistics, and interpret the individual group confidence intervals. How would you summarize the "story" that emerges from these analyses on the CHOLESTEROL variable? 5) Typically, in an open ended data analytic project, the analyst would look to see whether any of the potential response variables are related to the explanatory categorical variables of interest. To limit the amount of analytical work, for the FAT, FIBER, ALCOHOL variables, use a 95% confidence interval approach to compare groups, on average, for Gender (male vs female) Smoke Age_retired Alcohol_use Do NOT conduct or report on formal Hypothesis tests! How would you summarize the "story" that emerges from these analyses? 6) Given what you've found so far comparing groups, what is surprising to you? What turned up that you did not expect, if anything? What is it that would explain these results? What do you think should be the next steps to any analysis on this Nutrition data? Your write-up should address each task

Categories
R

Title: Predicting Used Car Prices with Neural Networks: A Study on the Impact of Hidden Layers and Nodes

Car Sales. Consider the data on used cars (mlba::ToyotaCorolla ) with 1436 records and details on 38 variables, including Price, Age, KM, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.
Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar.
To ensure everyone gets the same results, use the following code to convert categorical predictors to dummies, create training and holdout data sets, and normalize the training set and holdout set. Note the holdout set is normalized by using the training set.
# load the data and preprocess
toyota.df <- mlba::ToyotaCorolla toyota.df <- mlba::ToyotaCorolla %>%
mutate(
Fuel_Type_CNG = ifelse(Fuel_Type == “CNG”, 1, 0),
Fuel_Type_Diesel = ifelse(Fuel_Type == “Diesel”, 1, 0)
)
# partition
set.seed(1)
idx <- createDataPartition(toyota.df$Price, p=0.6, list=FALSE) train.df <- toyota.df[idx, ] holdout.df <- toyota.df[-idx, ] #Normalize the dataset. Use the training set to determine the normalization. normalizer <- preProcess(train.df, method="range") train.norm.df <- predict(normalizer, train.df) holdout.norm.df <- predict(normalizer, holdout.df) Fit a neural network model to the data. Use a single hidden layer with two nodes. Record the RMS error for the training data and the holdout data. Repeat the process, changing the number of hidden layers and nodes to single layer with 5 nodes, and two layers, 5 nodes in each layer. What happens to the RMS error for the training data as the number of layers and nodes increases? What happens to the RMS error for the holdout data? Comment on the appropriate number of layers and nodes for this application.

Categories
R

“Email Message Analysis: Identifying SPAM and HAM using Derived Variables”

process a collection of email messages and create an R data frame of “derived” variables that
give various measures of the email messages, e.g. the number of recipients to whom the mail was sent, the
percentage of capital words in the body of the text, is the message a reply to another message. See below
for a list of all the variables and also consider other variables you think might help help classify a message
as SPAM versus HAM. The messages are in 5 different directories/folders. The name of the directory indicates whether the messages
it contains are HAM or SPAM. There are 6,541 messages in total. This is a large amount of data.

Categories
R

“Email Message Analysis: Identifying SPAM and HAM using Derived Variables”

process a collection of email messages and create an R data frame of “derived” variables that
give various measures of the email messages, e.g. the number of recipients to whom the mail was sent, the
percentage of capital words in the body of the text, is the message a reply to another message. See below
for a list of all the variables and also consider other variables you think might help help classify a message
as SPAM versus HAM. The messages are in 5 different directories/folders. The name of the directory indicates whether the messages
it contains are HAM or SPAM. There are 6,541 messages in total. This is a large amount of data.