Assignment 5: Causal Discovery Your task is to verify the findinings in [Druzdzel & Glymour 1994] [Marek J. Druzdzel and Clark Glymour. Causal inferences from databases: Why universities lose students. In Clark Glymour and Gregory F. Cooper (eds), Computation, Causation, and Discovery, Chapter 19, pages 521-539, AAAI Press, Menlo Park, CA, 1999, an older version is available at: http://www.pitt.edu/~druzdzel/ftp/kdd94.pdf.] concerning student retention in US colleges. The original study was performed on 1992 US News and World Report data. The data that you will be studying is for the year 1993. There are some small differences in variables under study between the 1992 and 1993 data sets. Because freshmen retention and senior graduation rates differred so little (most students who drop out do so in their freshmen year), freshmen retention has been eliminated. What used to be "apgra" is in the 1993 data set called "apret" (this is the percentage of student retention over the four years). (We rerun the 1992 data set and verified that the conclusion regarding the impact of the quality of incoming students on the retention rate is the same.) Otherwise all variables are the same (but the measurements are for the year 1993). Do the 1993 data support Druzdzel & Glymour's conclusions? If not, why not. Can you find anything else going on in the data (i.e., is there any structure in the data?)? What are the causal graphs suggested by GeNIe? What causes student retention? Please, feel free to help GeNIe using your common sense knowledge of interactions between the variables included in the data set (not necessarily the knowledge inputed by Druzdzel & Glymour). For example, you might know something about the time precedence. Perhaps there are pairs of variables for which we are reasonably sure that there is/there is not an arc between them? Check the sensitivity of your result to the significance level. Druzdzel & Glymour made the assumption that the data are normally distributed and linearly dependent (this is an implicit assumption used in the PC algorithm in GeNIe). Please, feel free to check whether the assumption is indeed valid in the data using the graphing capabilities in GeNIe. You don't need to make this a lot of work, but please, run the PC algorithm at least a few times and look critically at the results. Describe your observations and conclusions. I have included the data file on the CourseWeb. The software that you should use in this assignment is GeNIe, available for download at the following location: http://genie.sis.pitt.edu/ The on-line help for learning/causal discovery functionality may not be comprehensive but here is in brief how you can use it: (1) Open the data file through File menu/Open Data File. (2) Choose Learn New Network from Data menu. (3) Choose the PC algorithm and set the significance level (PC algorithm uses classical statistical independence tests and here is where you set the significance level for these tests). (4) You can enter the background knowledge by clicking the Background Knowledge button. Here you can force and forbid causal connections and also order variables in temporal tiers. A variable that is in a later tier cannot cause variables in earlier tiers. (5) The result of the algorithm is a "pattern," i.e., an equivalence class of causal graphs that are compatible with the data. The true graph is one of the graphs represented in the pattern.