Image Recognition of American Sign Language

American Sign Language (ASL) is the language used by the deaf community in the United States and Canada; and uses hand gestures, or a series of hand gestures, to communicate information. It is estimated that approximately 500,000 people use ASL as their primary language ¹. To expand access to the deaf community, it is necessary for us to develop tools to facilitate communication between the deaf and hearing communities. For this project, I explored the possibility of using a neural network to recognize hand signs in images.

Findings

This project was conducted using a Kaggle dataset containing approximately 35,000 images of 24 hand signs. Each hand sign represented a different letter of the English alphabet. The letters "J" and "Z" were excluded because their signs require motion and cannot be captured in a single image. The image below shows the hand sign for the letter "C".

A Convolutional Neural Network (CNN) was created to extract features from these images and predict the letter being signed. The first step in creating this network was to identify a set of input values. All images in this dataset were 28x28 grayscale images with 784 pixels. The color of each pixel can be described by a grayscale value between 0 (black) and 1 (white). These 784 values were used as the input to the first layer of the CNN.

After defining the input layer, the next step was to design the hidden layers of the network. Below you can see the network's final architecture.

The first hidden layer is depicted as the "Convolution + Relu" cube on the far left of the above image. This layer takes a matrix input of 28x28x1 values (aka the grayscale values of our image). The convolution layer of the network is then designed to have 32 5x5 filters. The stride is set to 1 and there are 2 layers of zero-padding. That's a lot of numbers to keep track of, so we can use the below visual to keep track of everything.

To simplify the visual, we will consider a 5x5 image. As described earlier, each pixel in the image has been assigned a grayscale value ranging from 0 to 1. In preparation for the first convolution layer of our neural network, we zero-pad this image by adding a layer of cells around the perimeter of the image with value 0. This will ensure that the output of our first layer will be the same size as our input.

Next, we apply a 5x5 filter to our image. A 5x5 filter with only 1 channel will have 25 weights, as shown below.

When this filter is applied to our zero-padded image, it looks at only 5x5 cells at a time. The weights of the filter are then multiplied by their corresponding grayscale value. Finally, the products of each weight and its grayscale value are summed to output a single value. Once this value has been calculated, we move the filter by a stride of 1 (aka we move it over 1 column). This movement is shown below.

For the filter position on the left, we compute our output value as...

z = 0.8*w33 + 0.6*w34+ 0.6*w35+ 0.8*w43+ 0.7*w44+ 0.5*w45+ 0.7*w53+ 0.6*w54+ 0.4*w55

We repeat this computation every time we move the filter until we have a 5x5 matrix of z values. For the last step in the convolution layer, we send these z values through a non-linear activation function. There are many possible choices for an activation function, but we will use a Rectified Linear Unit (ReLu). A ReLu function is shown below.

The plot demonstrates that any negative input to the activation function will result in a 0 output. For positive values, the ReLu will simply output the input value. For example, the ReLu output of the vector [-2, 0, 3, -1, 4] is shown below.

ReLu [-2, 0, 3, -1, 4] = [0, 0, 3, 0, 4]

Earlier, we said that our convolution layer has 32 filters. This means that the above process of placing a filter on our image, moving it around, computing a weighted sum of the inputs, and passing it through the ReLu function is occurring 32 times. Each unique filter allows the CNN to extract more features from the set of input values. For example, I passed the image of the letter "C" through the CNN and visualized the output of the first hidden layer.

The 32 images above show how each filter highlighted different features from the same image. It looks like this first hidden layer was generally trying to understand the shape of the hand and its edges. As we move deeper into the neural network, the outputs of the hidden layers become more meaningful to the computer, but less interpretable to the human eye. Below is the output of the fourth hidden layer. Can you discern the "C" in any of these images?

After the features have been extracted from the images, the data is sent to the Flatten and Dense layers. These layers reduce the data down to one dimension and classify the image as one of the 24 possible letters.

An important design parameter for CNNs is the number of epochs used to train the network. An epoch is a single pass through all of the training data. Earlier I showed a sample 5x5 filter with 25 weights. How does the network decide which weights to use? On the first epoch, or first pass through the training data, the network takes a guess at the weights. For each subsequent epoch, the CNN updates the weights to reduce overall loss. So more epochs generally result in more accurate predictions, but the model also becomes more complex. Thus, we have the classic bias-variance trade-off. As you can imagine, increasing the number of epochs has diminishing returns. The first few epochs will result in dramatic reductions in overall loss, but over time the improvements become less and less. So we use a validation dataset to identify the optimal number of epochs. This helps to ensure we have a good balance between model accuracy and complexity.

Using the test dataset, the model successfully classified the hand sign 94.28% of the time! We can now use a confusion matrix to better understand where the model works well and where it can be improved.

From the confusion matrix, we can see that the letters most commonly confused were "E", "M", and "S". Let's see what these signs look like.

These three letters appear to have very similar signs. Additionally, the low resolution images make it difficult to detect the features of each sign that distinguish it from the others. In the future, it would be beneficial to use higher resolution images to improve the classification accuracy of the CNN. Unfortunately, this model is also limited by its inability to recognize hand signs with movement. If you are interested in learning more about how this problem can be tackled, check out this blog post, where I worked with other students to use motion data to classify 60 dynamic hand gestures with 85% accuracy!

Analysis

The following sections contain the Python code used to build and evaluate the CNN.

Visualizing the Data
Before I could do anything, I needed to load the data into Python. The data was obtained from Kaggle and the test and train split was already performed for me.

# Load files and assign variables to X and Y
train = pd.read_csv("sign-language-mnist/sign_mnist_train.csv")
test = pd.read_csv('sign-language-mnist/sign_mnist_test.csv')

# Define train and test set
X_train = train.drop(labels=['label'], axis=1)
Y_train = train['label']
X_test = test.drop(labels=['label'], axis=1)
Y_test = test['label']

To help me understand the data I was working with, I used the following function to visualize each grayscale image. This function was used to produce the image of the letter "C" shown earlier.

# Show image from dataset
def gen_image(image):
    """Return 28x28 image given grayscale values"""
    pixels = image.reshape((28,28))
    plt.imshow(pixels, cmap='gray')
    plt.show()

Data Pre-Processing
This classification problem has 24 possible classes (26 letters minus "J" and "Z"). The original dataset used numbers 0 through 24 to label the different classes. As a result, the number 9 did not appear in the dataset because it represents the letter "J". To avoid problems processing the data, I re-labeled the classes using numbers 0 through 23. These labels were then assigned as categorical variables using the to_categorical function.

# Define new labels
label_dict = {0:'A',1:'B',2:'C',3:'D',4:'E',5:'F',6:'G',7:'H',8:'I',10:'K',11:'L',12:'M',13:'N',14:'O',15:'P',16:'Q',17:'R',\
             18:'S',19:'T',20:'U',21:'V',22:'W',23:'X',24:'Y'}
label_dict_rev = {'A':0,'B':1,'C':2,'D':3,'E':4,'F':5,'G':6,'H':7,'I':8,'K':9,'L':10,'M':11,'N':12,'O':13,'P':14,'Q':15,'R':16,\
                 'S':17,'T':18,'U':19,'V':20,'W':21,'X':22,'Y':23}

# Assign numbers to corresponding label in test set
Y_test1 = []
for i in Y_test:
    Y_test1.append(label_dict_rev.get(label_dict.get(i)))

# Convert numerical classes to categorical variable
Y_test2 = to_categorical(Y_test1, num_classes = 24)

# Assign numbers to corresponding label in training set
Y_train = train['label']
Y_train1 = []
for i in Y_train:
    Y_train1.append(label_dict_rev.get(label_dict.get(i)))
    
# Classes labeled as 1-24; Need to change to 0-23 for to_categorical
Y_train2 = to_categorical(Y_train1, num_classes = 24)

# Define independent variables for training set
X_train = train.drop(labels=['label'], axis=1)

The pixel values provided in the data set range from 0 to 255. These values were normalized to obtain values ranging from 0 to 1.

# Normalize pixels
X_train1 = X_train/255
X_test1 = X_test/255

Finally, the data was reshaped from one to two dimensions to represent the height and width of the original image. Then, the training data was divided into training and validation sets using a 70/30 split. I used the validation set to determine the optimal number of epochs.

# Convert data to 2D form to represent height x width
X_train2 = X_train1.values.reshape(-1,28,28,1)
X_test2 = X_test1.values.reshape(-1,28,28,1)
# Split training set into 10% validation and 90% training
X_tr, X_val, Y_tr, Y_val = train_test_split(X_train2, Y_train2, test_size = 0.3, random_state=2, stratify=Y_train2)

Modeling
Next, I built the CNN model. Here, each convolutional layer is followed by a max pooling layer and a dropout layer to downsample and reduce overfitting. Then, I selected model parameters such as the learning rate, loss function, and number of epochs.

# Hidden layers
model = Sequential()
# Layer 1
model.add(Conv2D(filters=32, kernel_size=(5,5), padding='Same', activation='relu', input_shape=(28,28,1)))
model.add(MaxPool2D(pool_size=(2,2))) # downsampling
model.add(Dropout(0.25)) # Dropout reduces overfitting
# Layer 2
model.add(Conv2D(filters=64,kernel_size=(3,3),padding='Same',activation='relu'))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Dropout(0.25))
# Fully connected layers
model.add(Flatten())
model.add(Dense(256,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(24, activation='softmax'))

optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# use categorical crossentropy as loss function
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Define model parameters
epochs = 10
batch_size = 64
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', patience=3, verbose=1, factor=0.5, min_lr=0.00001)

The model was then fit using the training set.

model.fit(X_train2, Y_train2, batch_size=batch_size, epochs=epochs, validation_data=(X_val,Y_val), callbacks=[learning_rate_reduction])

Results
The following function was used to determine the overall loss and model accuracy on the test set.

# Get CNN loss and test error
model.evaluate(X_test2, Y_test2)

As shown earlier, these results can also be visualized using a confusion matrix.

from sklearn.metrics import confusion_matrix
results = model.predict(X_test2) # predict test labels
Y_pred_classes = np.argmax(results, axis = 1) # Convert predictions classes to one hot vectors 
Y_true = np.argmax(Y_test2,axis = 1) # Convert validation observations to one hot vectors

# Create confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 

# Print confusion matrix
fig = plt.figure(figsize=(10,8))
ax1 = fig.add_subplot()
sns.heatmap(confusion_mtx, annot=True, fmt="d");
labels=['A','B','C','D','E','F','G','H','I','K','L','M','N','O','P','Q','R',\
             'S','T','U','V','W','X','Y']
ax1.xaxis.set_ticklabels(labels); ax1.yaxis.set_ticklabels(labels);

The full code for this project can be found on Github.

Disclaimer: This work was completed as part of a course project for MIS 382N at The University of Texas at Austin.

References
1. Jay, Michelle. “American Sign Language.” Start ASL, Start ASL, 1 Oct. 2010, www.startasl.com/american-sign-language/.

Katie Grant

Search This Blog