In this assignment, I chose a dataset on exam scores of students with data such as ethnicity, parental education, lunch preference, whether or not they did test preparation, and the scores on three standardized tests. This is the site from which I found the data, https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams?resource=download.
The questions I wanted to answer.
Overall did more students pass or fail? Did the ones who completed the test review gain an advantage? Was there any significance between the students’ average test scores and their lunch opportunities, Did students of parents with higher education have an advantage on math exams? Does gender affect writing scores?
To begin solving my first question did the students who did the test review gain an advantage? To begin I added all the tests for each student to get their average of the three tests and if the average was over or equal to 60% that is considered a pass below is a fail. The frequency of the pass-fail was 713 passes and 287 fails. I then compared this to the number of students that completed the test preparation course. Here is that graph.

My next question was if there was any significance between the student’s average test scores and their lunch opportunity. To do so I used the other column I made which was the percentage of the student’s total test average compared to the two lunch options given free/reduced versus standard. Before running my test to see if it was even worth testing I created a notched box plot.

To run the T-test I made two separate arrays of the test scores separated by each lunch and ran them against each other getting a p-value of 3.18*10^-29. Seeing as this is way below 5% I was able to conclude that there is significance between which lunch students ate versus their average test scores.
The next question I wanted to figure out was if there are any big differences between race/ethnicity and the parental education of the students, here is what I found.

Another test I wanted to run was the math scores of students with parents with a master’s degree versus the parents with only some high school. Before doing so again I made a boxplot of all the parent’s education versus math scores and got this.

Another relationship I wanted to see was if there were any similarities in the writing versus reading test scores based on gender. Since these are both numeric values I made a scatter plot to show them versus each other. Here is what I found.

Now I decided to run a T-test on the data set of males versus females on writing scores and came out with a P-value of 2.96*10^-15 concluding that there is the significance that females tend to do better. One thing I noticed was that this was the highest or closest to 0.05 P-value of all my tests, if I had to guess it is because declaring something like this is significantly based on gender is hard to say whether it is correct or not. In the future running this on a bigger dataset would be beneficial.
Overall by choosing this data set I was able to analyze and predict a lot of new things when it came to students possibly receiving better test scores. I concluded that reviewing the test practice gave the student a much higher chance of receiving a better score than not doing it. I was able to see that free/reduced lunch versus standard lunch also had significance in receiving a better test score. I also ran a test to compare students with parents who had received a master’s degree verse kids with parents of only some high school education and compared those to the student’s scores on the math tests, because often when a student is struggling in math the first person they go to is their parents. I found that there was significance between the two parents’ education and the student’s math scores. I also compared writing versus reading scores based on gender in the form of a scatter plot and was able to conclude that females had higher test scores in both sections of the test, which is what I predicted because women tend to be better writers. In the future, I would like to test gender on writing scores with a bigger dataset, I would like to compare parental education to all test scores individually, I would like to have more columns such as was the student and athlete or possibly build a generator to add in if the student was an athlete at random. I also think adding more to the columns would help prove the statistical tests more right or wrong such as adding in if the student packed lunch.
