1. Question Formulation (10 points): You need to devise a question that can be  answered through data analysis. This question should be of your own creation,  and it should reflect your curiosity and interest.

2. Data collection (15 points): You are responsible for finding the appropriate  dataset that aligns with your chosen question. Ensure that the data is clean and  organized for analysis. If you don't know where to find the data set, you can use  Kaggle.com It can give you more inspiration about the question formulation and  data collection. You need to state where you get your data from in order to  receive credits.

3. Exploratory Data Analysis (30 points): Conduct an EDA to understand the  characteristics of your dataset. This step will help you gain insights and identify  patterns in the data. (Similar to Assignment 2.) Here are some key components of  EDA I am expecting from your paper: (6 points for each following component (if  your EDA does not have any categorical variable), or 5 points each (if your EDA  has the analysis of categorical variables.)

1) summary statistics: compute basic statistics for the dataset, such as mean,  median, standard deviation, minimum, maximum, and quartiles. It provides an  overview of the data's central tendencies and spread.

2) Data Visualization: Create various plots and charts to visualize the data's  distribution and relationships. Common visualization tools include histograms, box  plots, scatter plots, bar graphs, and line graphs.

3) Data Distribution: Examine the distribution of individual variables. This helps in  identifying whether the data is normally distributed, skewed, or exhibits other  patterns. Understanding the distribution can influence the choice of statistical  tests and modeling techniques.

4) Correlation Analysis: Determine the relationships between variables using  correlation coefficients or scatter plots. It can reveal potential associations and  dependencies between variables.

5) Categorical Variables (If your data involves this type of variable and you think it  is important to answer your question. If the categorical variables are not that  important to answer your questions, don't worry about it.): Explore the  distribution of categorical variables using frequency tables, bar charts, or pie  charts.

6) Hypothesis Generation: Eventually your exploratory data analysis can lead to  the formulation of hypotheses about relationships or patterns in the data to  answer your question or guide further analysis.

4. Machine Learning (15 points): Apply a machine learning algorithm to address  your question. You are only required to choose one type of algorithm for this  mini-project but you may have to run it multiple times with different variables,  and you will decide what it is best for your result. You have the flexibility to  choose from the algorithms we've learned in class, but make sure the selected  algorithm is appropriate for your data. Alternatively, if you find a specific  algorithm outside of your class materials that suits your needs, you are welcome  to use it.

5. Project Structure (20 points): While this is a mini-project, your report should  follow a structure similar to a combination of Assignment 2 and Assignment 3.  This means it should include sections for introduction, Data collection and  Preprocessing, EDA, Machine Learning, Results and Discussion, and Conclusion.

6. Data Attribution and References (10 points): In the conclusion section of your  report, make sure to include a subsection titled "Data Attribution and  References." In this subsection, provide a detailed list of the sources where you  obtained your data, including the dataset name, the organization or website from  which it was sourced, and any relevant publication or citation information. Additionally, if you consulted external research papers, articles, or resources  during your project, please list these references in the same section.

General Requirements

1) You will need to write up your questions, findings, interpretations, and results for  this assignment. It will be a great idea to screenshot your codes, results, and graphs  so that you can explain your findings along with them. (It is also easier for me to  follow you when I read your paper). A pdf file is required. There is no page limit but  try to be straightforward with your answers.

2) The py file that you have used to finish your assignment. (It may be a duplicate or  somewhat duplicate of the screenshots that you have inserted in your paper but  that is okay. I would like to look over your codes.)

