Back in April 2019, I gave a presentation at Strata AI on a prototype API
For a prototype model I'm writing that involves a multi-class problem with facial images, I've found the need to use some basic Python code to organize my image files and partition them into train and test directories. I thought it might be helpful to post this in case others need this to shuffle around their files.
Here we have three classes -
Write all the filenames to a text file
none_path = ".../images/dataset/NO_PAIN" none_txt_path = ".../images/foo.txt" def write_filenames(img_path, text_path): dirList=os.listdir(path) with open(text_path, "w") as f: for filename in dirList: print(filename.rsplit( ".", 1 )) f.write("%s\n" % filename.rsplit( ".", 1 )) return f write_filenames(none_path, none_txt_path)
You can rerun this function
write_filenames in order to return a .txt file for each of the directories we have for images for each of the three classes, like so:
write_filenames(moderate_path, moderate_txt_path) write_filenames(moderate_path, moderate_txt_path)
You'll notice in this specific example above, using
rsplit removes the file extension which in this case is
If you use the below function, it preserve the file extension which will be important for using
shutil later on.
With the file extension
def write_filenames_wext(img_path, text_path): dirList=os.listdir(path) with open(text_path, "w") as f: for filename in dirList: print(filename) f.write("%s\n" % filename) return f write_filenames_wext(none_path, none_txt_path)
Again, with this you can run the function again for each of the three classes, run it again for moderate and severe.
Put all file names into a dataframeThis part is super easy. Using `pandas`, we use `read_table` to read in all the files names we wrote to each of our three text files. We assign a column header name called "file" for each of the filenames, then add a new column called "class". Then we `append` each of the dataframes together (I usually using `len` to count up the records to ensure the final `data_wext` is correct) for our final dataframe.
import pandas as pd none_wext = pd.read_table(".../face/NON_fn.txt", names=['file']) mod_wext = pd.read_table(".../face/MOD_fn.txt", names=['file']) sev_wext = pd.read_table(".../face/SEV_fn.txt", names=['file']) none_wext['class'] = 'NONE' mod_wext['class'] = 'MOD' sev_wext['class'] = 'SEV' df1_wext = none_wext.append(mod_wext) data_wext = df1_wext.append(sev_wext) data_wext.to_csv(".../face/allfaces_w_ext.csv", index=0)
Train/Test SplitSince you can see that the entire final dataframe is in order of the classes, we need to shuffle them and split this dataset into a train and test set. We'll use the `scikit-learn` split function for this. The default is `shuffle=true`.
from sklearn.model_selection import train_test_split X = data_wext['file'] y = data_wext['class']
Here's a sample of our
X of only the filenames with their extensions:
Here's a sample of our
y of labels that are not shuffled, still in the original order:
Now let's split it 33%:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Here's a sample of X_train:
And, here's a sample of y_train which we can see is now shuffled:
At this point, we need to convert our
pandas.series into lists so that we can iterate through this list to use
X_train_l = X_train.tolist() X_test_l = X_test.tolist()
Copy the training and test set to their own directoriesHere, what we need to do is use our X_train and X_test lists to copy over only the files in each of these lists into new directories called `train` and `test`.
So here's our original directory called
Only for those files in the lists for X_train and X_test, we want to copy them from
data_faces (our source) and move them into two empty directories,
import os import shutil src = '.../face/images/data_faces' #source of all images dest_test = '.../face/images/test' #test folder dest_train = '.../face/images/train' #train folder #move to train for file in os.listdir(src): if file in X_train_l: name = os.path.join(src, file ) if os.path.isfile( name ) : shutil.copy( name, dest_train) else : print 'file does not exist', name #move to test for file in os.listdir(src): if file in X_test_l: name = os.path.join(src, file ) if os.path.isfile( name ) : shutil.copy( name, dest_test) else : print 'file does not exist', name
Now you have two directories - one for train and one for test, which you can use for modeling.