The great majority of code in this notebook was written by, or with the help of, Dr. Lynn Cherny, Visiting Knight Chair for the Center for Communication, Culture and Change at the University of Miami. This project would not be possible without her help.
These should be in your .gitignore:
Otherwise, we cannot push our git commit due to the 100mb filesize limit.
Use BFG Repo Cleaner, a faster alternative to git-filter-branch
See this thread for examples
Note that Jupyter Notebook files, or .ipynb
files are not managed by git; you will have to maintain source control manually
To run this, you need to have cleaned files (using the sed bash script) and the CollegeScoreCardDataDictionary csv and a copy of the vars_in_data.csv file.
To run this, we will need to clean our files with a sed
bash script:
#!/bin/bash
unset LANG
for filename in MERG*.csv; do
echo $filename
sed -e "s/NULL//g" $filename > nonulls.csv
sed -e "s/PrivacySuppressed//g" nonulls.csv > clean_$filename
done
mkdir 2_cleaned-data
!bash 1_raw-data/sedscript.bash
import pandas as pd
import numpy as np
codes = pd.read_csv("1_raw-data/CollegeScorecardDataDictionary-09-08-2015.csv", header=0)
nonnull = codes[codes['VALUE'].notnull()][codes['VALUE'] != ' ']
categoric = nonnull.fillna(method="pad") # fills empty cells with cell above
categoric_vals = categoric.groupby(["VARIABLE NAME"]).groups
categoric_vals.keys()
def replace(x):
try:
return rep[x] # fails on NaN or undefined which is good
except:
return x
# get the replacement values for a categoric value
def replace_categoric_vals(df, categoric_vals):
for key in categoric_vals:
#print key
replacement = categoric.groupby(["VARIABLE NAME"]).get_group(key)[["VALUE", "LABEL"]].set_index("VALUE").to_dict()['LABEL']
rep = dict((int(key), value) for (key, value) in replacement.items())
df[key] = df[key].map(rep) # map the keys to the new values
return df
vars = pd.read_csv("vars_in_data.csv")
# stuff to id the school for each subset of cols
schoolidvars = range(0,4)
sections = [ ['school', 'root'], ['cost'], ['aid'], ['earnings'], ['repayment'], ['student'], ['completion'] ]
years = range(1996, 2014)
path = "2_cleaned-data"
# If you didn't make the directories for the years yet - this will put them in the directory you are in now
!mkdir $path/3_years
for year in years:
!mkdir $path/3_years/$year
#
for year in years:
print "doing year: " + str(year)
# path to your output from the sed script, the cleaned one
big = pd.read_csv(path + "/clean_merged_" + str(year) + "_PP.csv")
big = replace_categoric_vals(big, categoric_vals)
big.rename(columns={big.columns[0]: "UNITID"}, inplace=True) # fix unicode error in files
for varlist in sections:
filenamebase = varlist[0]
colIds = list(vars[vars['dev-category'].isin(varlist)].colnumber.values)
section = big.iloc[:, schoolidvars + colIds].copy() # this might be the problem
section['Year'] = year
section.to_csv(path + "/3_years/" + str(year) + "/" + filenamebase + "_" + str(year) + ".csv")
# Let's move our years data into our base data directory
!mv $path/3_years/ $path/../3_years
Due to table dimension limitations, we will need to create a table for each category as we cannot make a single table to hold the entire dataset.
!mkdir 4_combined # This is where we store combined files
fileListPath = '3_years/*/school*'
files = !ls $fileListPath
files
for section in sections:
# Get the string value from the list
section = section[0]
# Get a list of all files
fileListPath = '3_years/*/' + section + '*'
files = !ls $fileListPath
# Create variables for pathnames;
# Using a 'sed' subprocess seems to prefer this way
headerPath = '3_years/1996/' + section + '_1996.csv'
grepPath = '3_years/*/' + section + '*'
output = section + '.csv'
print 'doing section ' + section
!head -1 $headerPath > $output # get header row
for file in files:
!sed '1d' $file >> $output # get the rest of the data
!mv $output 4_combined/
!ls -al 4_combined/ # Note the file sizes
You can concatenate all the ones from all the years with unix at the command line, with each section:
head -1 1996/school_1996.csv > school.csv # gets the header row first
sed '1d' */school* >> school.csv # skips the header row for each file and adds to school.csv
Then load them into postgres or sql using csvkit's csvsql command.