Link Search Menu Expand Document

📝 Assignment 1

Due date: Friday, April 12 at 5pm Pacific.

⏳ We recommend attempting each problem ASAP so you can accurately estimate the time needed to complete the assignment.

  • This is not an assignment to start the night before the due date!
  • Remember that MS&E 125 is a 4-unit course. For the median student, this is supposed to translate to 3 weekly hours of lecture and 9 weekly hours of working on assignments and studying.

Unless otherwise stated, assignments are to be done individually. You are welcome to work with others to master the principles and approaches used to solve the homework problems, but the work you turn in should be your own.

This assignment has not been seen by a previous cohort of MS&E 125 students, so there may be some unforeseen hiccups. If anything seems confusing or unclear, please create an Ed post.

We will use this Ed post to track errors and clarifications on HW1.

📮 Submission

Submit your assignment via Gradescope. Make sure to tag your answers properly on Gradescope, or else you may be docked points.

  • For the probability review problems, prepare a photo of your handwritten answers to each problem, and convert the photo to PDF.

  • For the Google Colab submission, first run all of your cells using the Run all command in the Runtime menu. Then, download your completed Google Colab notebook as an .ipynb file. Finally, use this website to convert your .ipynb file to .pdf format. Proofread the PDF to make sure all of your answers and plots are visible and not cut off. If your PDF is longer than 50 pages, it is likely that your code is printing out the entire dataset or a really long vector. Please make sure to comment out any code that prints more information than each question asks you for.

  • Issues converting to .pdf? Make sure there are no error messages in the outputs after you run all cells. Please do not use any special characters in the filename of the .ipynb file that you upload.

  • For the plot presentation, create a text file containing a public link to your screencast. Convert this text file to PDF format. Additionally, submit the public link to your screencast using this Google Form.

  • For the data collection assignment, convert your selfie with the completed data collection form to PDF form. Additionally, fill out this Google Form using your data collection sheet.

Finally, concatenate the four PDFs above using a tool of your choice. For example, you could use this website.

🎲 Probability review (10% of the assignment grade)

Please complete the probability problems below. Each question requires only basic arithmetic (i.e., no calculus is required).

If you have taken an introductory probability course, like MS&E 120 or CS 109, you have likely seen problems like these before.

If you have not taken an introductory probability course, but you have taken an introductory statistics course, you are entirely capable of thriving in MS&E 125! All we ask is that you familiarize yourself with the notation used below, and read up on the basic properties of the expected value and variance of random variables.

Let \( X_i \overset{\mathrm{iid}}{\sim} N(\mu, \sigma^2) \). State the value of the expressions below.

1. \(\mathbb{E}(X_1)\)

2. \(\text{Var}(X_1)\)

3. \(\mathbb{E}(X_1^2)\)

4. \(\mathbb{E}(3X_1)\)

5. \(\text{Var}(3X_1)\)

6. \(\mathbb{E}(X_1+X_2+X_3)\)

7. \(\text{Var}(X_1+X_2+X_3)\)

Now, suppose \( X_i \overset{\mathrm{iid}}{\sim} \text{Bernoulli}(p) \).

8. What are the new values for expressions 1–7 above?

Problem-specific collaboration policy. For these probability review problems, we are allowing unlimited collaboration among students in the course (i.e., you can look directly at each other’s work). However, please help each other fully understand the probability concepts tested here, as they are critical to the material in Week 2 and beyond. If you didn’t get a chance to fill out the study group form before the April 8th deadline, but you’re interested in being added to a study group, please open a private Ed post.

Why complete this problem? Applied statistics sits on a foundation of probability theory. These problems ensure your understanding of core probability concepts so we can progress to more complex topics in applied statistics.

📈 Data manipulation and plotting in R (40%)

Complete the Week 1 lab notebook that we started in lecture.

⏳ This is the most time-consuming component of the first assignment, so get started ASAP (and, if needed, get help early!).

You will be using techniques from this notebook on future assignments and the final project.

  • The concepts in this notebook will also be tested in the quizzes.
  • The course will become increasingly challenging if you do not take the time to fully understand the concepts in this notebook.
  • You are free to use Google Colab’s built-in AI helper, or any other LLM (e.g., ChatGPT). FYI, the Colab AI feature is disabled by default if you use a Stanford Google Drive account to save your notebook.

Problem-specific collaboration policy. Like the probability review problems above, we are allowing unlimited collaboration on this lab notebook. All we ask is that you fully understand the techniques in the notebook, and help explain them to fellow students who are struggling. If you didn’t get a chance to fill out the study group form before the April 8th deadline, but you’re interested in being added to a study group, please open a private Ed post.

Why complete this problem? The techniques in this notebook are the bread and butter of data science. The bulk of exploratory data analyses conducted by data analysts and data scientists can be distilled down to the techniques taught in this notebook (seriously, we’re not exaggerating!).

🗣️ Plot presentation (40%)

📊 Find an informative plot on a topic you find interesting.

  • The plot must be 2D.
  • The plot must fit on a standard laptop screen at a reasonable resolution. You should not need to zoom in to see any part of the plot.
  • The plot must be non-interactive. In other words, you should be able to print the plot on a piece of paper without losing any information.

🎥 Next, record a 1-2 minute screencast describing your chosen plot to another student.

  • In your screencast, you must explain the necessary background required to understand the plot, along with the key takeaway(s) of the plot.
  • While the plot can have labels, its caption should be removed. Your voice should fully describe the takeaways of the plot.
  • Your name and face should not be in the screencast. The plot should take up the entire screen. Feel free to reach out to the teaching team if you cannot find a way to record a screencast anonymously.
  • You should use your mouse pointer in the screencast to indicate particular points of interest on the plot. Alternatively, you can verbally direct the viewer to the points of interests (e.g., “In the top right corner, you can see that…).
  • You are welcome to read off a script, but try not to make it too obvious. You should use an engaging tone that sounds as though you are presenting to a live audience.
  • Remember the three key guiding questions you should address before digging into the details of your plot: (1) What’s on the X axis? (2) What’s on the Y axis? (3) What does a specific point/line/feature on your plot mean in context?

👩‍💻 Here are some resources you might find helpful as a starting point:

As part of next week’s assignment, you will provide anonymous feedback on another student’s submission for this problem.

  • Your screencast, along with the feedback you provide to another student, will be graded by the teaching staff. At no point will other students grade your work.

💾 Upload your screencast to your Stanford Google Drive account.

  • You will submit a public link to your video. See submission details at the top of this page.
  • Before submitting, make sure your video is publicly accessible with the link. One way to do this: Open your link in an Incognito window in Google Chrome.
  • If your video is private/unviewable when we try to access it, we will not be able to confirm that you submitted your video before the deadline.

Why complete this problem? Communication is a critical, but often under-appreciated, component of the data science life cycle. This exercise helps develop your data storytelling ability, which (for now) is something that LLMs struggle to do effectively. This exercise also gets you thinking about ideas for the course project.

🚴‍♀️🚴‍♂️🚴‍♀️ Collecting real data (10%)

In preparation for next week’s assignment, you will each devote 10 minutes to collecting real-world data from the Stanford community.

From a quick peek outside, it’s clear that few Stanford community members wear helmets while they bike on campus. The actual percentage of helmet-wearers at Stanford is unknown (at least publicly). As a class, we will estimate this percentage.

🪖 For any 10-minute span between 10am and 6pm on a weekday, stand next to Jane Stanford Way in front of Memorial Court and tally the number of bicycle riders wearing helmets and not wearing helmets.

  • To collect the data, you will mark two columns on a sheet of paper. When a rider passes with a helmet, mark one column, and when a rider passes without a helmet, mark the other column.
  • You can find a blank data collection sheet here.
  • Don’t be too concerned if you miss a couple cyclists here or there. Just do your best to record everyone who passes by. That being said, you may want to take a minute to practice recording data before starting the clock on your official 10 minutes.
  • Make sure to record the start time and stop time of your data-collection period, along with the date.
  • Keep in mind that we will likely have 100+ submissions for this assignment, so it will be pretty easy to spot faked data. High-quality, real data will also make your life easier on the next assignment. This exercise should take you no more than a half hour.

🤳 Take a selfie with your sheet of paper as soon as you are finished collecting data.

  • Make sure the entire sheet of paper is visible, along with Jane Stanford way and your face.
  • You will submit this selfie as part of this assignment, along with your data.
  • Selfies will not be shared outside of the course staff.

Additionally, please complete the data collection form.

Why complete this problem? This exercise, along with its follow-up exercise in the next assignment, will serve as a memorable and helpful analogy if you ever have to conduct or interpret a survey or poll. Collecting data is a critical, but often under-appreciated, component of the data science life cycle.