AcknowledgmentsΒΆ

The great majority of code in this notebook was written by, or with the help of, Dr. Lynn Cherny, Visiting Knight Chair for the Center for Communication, Culture and Change at the University of Miami. This project would not be possible without her help.

Dealing with large files

These should be in your .gitignore:

  • .csv
  • .zip

Otherwise, we cannot push our git commit due to the 100mb filesize limit.

Use BFG Repo Cleaner, a faster alternative to git-filter-branch

See this thread for examples

Note that Jupyter Notebook files, or .ipynb files are not managed by git; you will have to maintain source control manually

Cleaning the data/removing nulls

To run this, you need to have cleaned files (using the sed bash script) and the CollegeScoreCardDataDictionary csv and a copy of the vars_in_data.csv file.

To run this, we will need to clean our files with a sed bash script:

#!/bin/bash
unset LANG
for filename in MERG*.csv; do
    echo $filename
    sed -e "s/NULL//g" $filename > nonulls.csv
    sed -e "s/PrivacySuppressed//g" nonulls.csv >  clean_$filename
done
In [87]:
mkdir 2_cleaned-data
In [86]:
!bash 1_raw-data/sedscript.bash
merged_1996_PP.csv
merged_1997_PP.csv
merged_1998_PP.csv
merged_1999_PP.csv
merged_2000_PP.csv
merged_2001_PP.csv
merged_2002_PP.csv
merged_2003_PP.csv
merged_2004_PP.csv
merged_2005_PP.csv
merged_2006_PP.csv
merged_2007_PP.csv
merged_2008_PP.csv
merged_2009_PP.csv
merged_2010_PP.csv
merged_2011_PP.csv
merged_2012_PP.csv
merged_2013_PP.csv
In [11]:
import pandas as pd
import numpy as np
In [13]:
codes = pd.read_csv("1_raw-data/CollegeScorecardDataDictionary-09-08-2015.csv", header=0)
In [15]:
nonnull = codes[codes['VALUE'].notnull()][codes['VALUE'] != ' ']
In [16]:
categoric = nonnull.fillna(method="pad")  # fills empty cells with cell above
In [17]:
categoric_vals = categoric.groupby(["VARIABLE NAME"]).groups
In [18]:
categoric_vals.keys()
Out[18]:
['LOCALE',
 'CURROPER',
 'ANNHI',
 'CCSIZSET',
 'RELAFFIL',
 'locale2',
 'CONTROL',
 'CIP01CERT2',
 'CIP01CERT1',
 'PREDDEG',
 'NANTI',
 'HSI',
 'WOMENONLY',
 'HBCU',
 'main',
 'st_fips',
 'AANAPII',
 'HIGHDEG',
 'PBI',
 'CCBASIC',
 'region',
 'CCUGPROF',
 'CIP01ASSOC',
 'MENONLY',
 'DISTANCEONLY',
 'TRIBAL']
In [19]:
def replace(x):
    try:
        return rep[x] # fails on NaN or undefined which is good
    except:
        return x
# get the replacement values for a categoric value
def replace_categoric_vals(df, categoric_vals):
    for key in categoric_vals:
        #print key
        replacement = categoric.groupby(["VARIABLE NAME"]).get_group(key)[["VALUE", "LABEL"]].set_index("VALUE").to_dict()['LABEL']
        rep = dict((int(key), value) for (key, value) in replacement.items())
        df[key] = df[key].map(rep) # map the keys to the new values
        
    return df
In [7]:
vars = pd.read_csv("vars_in_data.csv")
In [21]:
# stuff to id the school for each subset of cols
schoolidvars = range(0,4)

sections = [ ['school', 'root'], ['cost'], ['aid'], ['earnings'], ['repayment'], ['student'], ['completion'] ]

Making a folder for each year of data

In [22]:
years = range(1996, 2014)
path = "2_cleaned-data"
In [304]:
# If you didn't make the directories for the years yet - this will put them in the directory you are in now
!mkdir $path/3_years
for year in years:
    !mkdir $path/3_years/$year

Now our '2_cleaned/data' directory will contain:

  • Each of the cleaned files,
  • a 'years' directory with folders for each year
In [305]:
#
In [308]:
for year in years:
    print "doing year: " + str(year)
    
    # path to your output from the sed script, the cleaned one
    big = pd.read_csv(path + "/clean_merged_" + str(year) + "_PP.csv")
    
    big = replace_categoric_vals(big, categoric_vals)    
    big.rename(columns={big.columns[0]: "UNITID"}, inplace=True) # fix unicode error in files

    
    for varlist in sections:
        filenamebase = varlist[0]
        colIds = list(vars[vars['dev-category'].isin(varlist)].colnumber.values)
        section = big.iloc[:, schoolidvars + colIds].copy() # this might be the problem
        section['Year'] = year
        
        section.to_csv(path + "/3_years/" + str(year) + "/" + filenamebase + "_" + str(year) + ".csv")
doing year: 1996
doing year: 1997
doing year: 1998
doing year: 1999
doing year: 2000
doing year: 2001
doing year: 2002
doing year: 2003
doing year: 2004
doing year: 2005
doing year: 2006
doing year: 2007
doing year: 2008
doing year: 2009
doing year: 2010
doing year: 2011
doing year: 2012
doing year: 2013

A note on warnings

In [309]:
 # Let's move our years data into our base data directory
!mv $path/3_years/ $path/../3_years

Combining years of data

Due to table dimension limitations, we will need to create a table for each category as we cannot make a single table to hold the entire dataset.

In [311]:
!mkdir 4_combined # This is where we store combined files
In [38]:
fileListPath = '3_years/*/school*'
files = !ls $fileListPath
In [40]:
files
Out[40]:
['3_years/1996/school_1996.csv',
 '3_years/1997/school_1997.csv',
 '3_years/1998/school_1998.csv',
 '3_years/1999/school_1999.csv',
 '3_years/2000/school_2000.csv',
 '3_years/2001/school_2001.csv',
 '3_years/2002/school_2002.csv',
 '3_years/2003/school_2003.csv',
 '3_years/2004/school_2004.csv',
 '3_years/2005/school_2005.csv',
 '3_years/2006/school_2006.csv',
 '3_years/2007/school_2007.csv',
 '3_years/2008/school_2008.csv',
 '3_years/2009/school_2009.csv',
 '3_years/2010/school_2010.csv',
 '3_years/2011/school_2011.csv',
 '3_years/2012/school_2012.csv',
 '3_years/2013/school_2013.csv']
In [41]:
for section in sections:

    # Get the string value from the list
    section = section[0]
    
    # Get a list of all files
    fileListPath = '3_years/*/' + section + '*'
    files = !ls $fileListPath
    
    # Create variables for pathnames;
    # Using a 'sed' subprocess seems to prefer this way
    headerPath = '3_years/1996/' + section + '_1996.csv'
    grepPath = '3_years/*/' + section + '*'
    output = section + '.csv'
    
    print 'doing section ' + section
    
    !head -1 $headerPath > $output # get header row
    
    for file in files:
        !sed '1d' $file >> $output # get the rest of the data

    !mv $output 4_combined/
doing section school
doing section cost
doing section aid
doing section earnings
doing section repayment
doing section student
doing section completion
In [2]:
!ls -al 4_combined/  # Note the file sizes
total 1019512
drwxr-xr-x  10 Nicolas  staff        340 Apr  7 01:12 .
drwxr-xr-x  23 Nicolas  staff        782 Apr  7 12:02 ..
-rw-r--r--@  1 Nicolas  staff       6148 Apr  7 00:46 .DS_Store
-rw-r--r--@  1 Nicolas  staff   21098597 Apr  7 01:27 aid.csv
-rw-r--r--   1 Nicolas  staff  317774510 Apr  7 01:17 completion.csv
-rw-r--r--   1 Nicolas  staff   19653098 Apr  7 01:15 cost.csv
-rw-r--r--   1 Nicolas  staff   28285827 Apr  7 01:15 earnings.csv
-rw-r--r--   1 Nicolas  staff   47876274 Apr  7 01:16 repayment.csv
-rw-r--r--   1 Nicolas  staff   43436859 Apr  7 01:15 school.csv
-rw-r--r--   1 Nicolas  staff   43839837 Apr  7 01:16 student.csv

Next Steps

  • investigate individual files in Tableau
  • load into a db and use sql queries to calculate (with joins etc) and summarize

You can concatenate all the ones from all the years with unix at the command line, with each section:

head -1 1996/school_1996.csv > school.csv  # gets the header row first
sed '1d' */school* >> school.csv # skips the header row for each file and adds to school.csv

Then load them into postgres or sql using csvkit's csvsql command.