October 7, 2018

Organizing files in train/test directories

For a prototype model I'm writing that involves a multi-class problem with facial images, I've found the need to use some basic Python code to organize my image files and partition them into train and test directories. I thought it might be helpful to post this in case others need this to shuffle around their files.

Here we have three classes - NONE, MODERATE, SEVERE.

Write all the filenames to a text file

none_path = ".../images/dataset/NO_PAIN"  
none_txt_path = ".../images/foo.txt"

def write_filenames(img_path, text_path):
    with open(text_path, "w") as f:
        for filename in dirList:
           print(filename.rsplit( ".", 1 )[0]) 
           f.write("%s\n" % filename.rsplit( ".", 1 )[0])
        return f

write_filenames(none_path, none_txt_path)

You can rerun this function write_filenames in order to return a .txt file for each of the directories we have for images for each of the three classes, like so:

write_filenames(moderate_path, moderate_txt_path)
write_filenames(moderate_path, moderate_txt_path)

You'll notice in this specific example above, using rsplit removes the file extension which in this case is .jpg.

If you use the below function, it preserve the file extension which will be important for using shutil later on.

With the file extension

def write_filenames_wext(img_path, text_path):
    with open(text_path, "w") as f:
        for filename in dirList:
           f.write("%s\n" % filename)
        return f

write_filenames_wext(none_path, none_txt_path)

Again, with this you can run the function again for each of the three classes, run it again for moderate and severe.

Put all file names into a dataframe

This part is super easy. Using `pandas`, we use `read_table` to read in all the files names we wrote to each of our three text files. We assign a column header name called "file" for each of the filenames, then add a new column called "class". Then we `append` each of the dataframes together (I usually using `len` to count up the records to ensure the final `data_wext` is correct) for our final dataframe.
import pandas as pd

none_wext = pd.read_table(".../face/NON_fn.txt", names=['file'])
mod_wext = pd.read_table(".../face/MOD_fn.txt", names=['file'])
sev_wext = pd.read_table(".../face/SEV_fn.txt", names=['file'])

none_wext['class'] = 'NONE'
mod_wext['class'] = 'MOD'
sev_wext['class'] = 'SEV'

df1_wext = none_wext.append(mod_wext)
data_wext = df1_wext.append(sev_wext)

data_wext.to_csv(".../face/allfaces_w_ext.csv", index=0)

Train/Test Split

Since you can see that the entire final dataframe is in order of the classes, we need to shuffle them and split this dataset into a train and test set. We'll use the `scikit-learn` split function for this. The default is `shuffle=true`.
from sklearn.model_selection import train_test_split

X = data_wext['file']
y = data_wext['class']

Here's a sample of our X of only the filenames with their extensions:

Here's a sample of our y of labels that are not shuffled, still in the original order:

Now let's split it 33%:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Here's a sample of X_train:

And, here's a sample of y_train which we can see is now shuffled:

At this point, we need to convert our pandas.series into lists so that we can iterate through this list to use shutil.

X_train_l = X_train.tolist()
X_test_l = X_test.tolist()

Copy the training and test set to their own directories

Here, what we need to do is use our X_train and X_test lists to copy over only the files in each of these lists into new directories called `train` and `test`.

So here's our original directory called data_faces:

Only for those files in the lists for X_train and X_test, we want to copy them from data_faces (our source) and move them into two empty directories, train and test.

import os
import shutil

src = '.../face/images/data_faces' #source of all images
dest_test = '.../face/images/test' #test folder
dest_train = '.../face/images/train' #train folder

#move to train
for file in os.listdir(src):
  if file in X_train_l:
    name = os.path.join(src, file )
    if os.path.isfile( name ) :
        shutil.copy( name, dest_train)
    else :
        print 'file does not exist', name

#move to test
for file in os.listdir(src):
  if file in X_test_l:
    name = os.path.join(src, file )
    if os.path.isfile( name ) :
        shutil.copy( name, dest_test)
    else :
        print 'file does not exist', name            

Now you have two directories - one for train and one for test, which you can use for modeling.