Predicting Engagement on Instagram

If you follow National Geographic on Instagram (@natgeo), then you are well aware they post multitudes of pictures every day depicting everything from animals to people to nature and more. But what do their followers really want to see? After analyzing user engagement, I learned that National Geographic is most successful when they post about animals; and is not-so-successful when they share images of people. The following analysis describes how to measure and predict user engagement, and provides insights into how these results can be used by brands to increase engagement on social media.

Findings

The 500 most recent posts to @natgeo's Instagram were analyzed to determine what their followers are most interested in seeing. Each image was run through a visual recognition software to obtain a list of labels describing the contents of each picture. Below you can see an example post from @natgeo's Instagram and the associated caption and image labels.

Caption: Photo by Trevor Frost @tbfrost | It's World Animal Day! As soon as Keanu the ocelot was steady on his feet, he was climbing anything he could find. Today, at a little over a year old, he can easily scale any tree and will sometimes climb hundreds of feet into the tropical rainforest canopy in seconds. Harry Turner, an ex-soldier who is rehabilitating Keanu with Peruvian non-profit @hojanueva, tells me Keanu even climbs out of curiosity, sometimes just to get a closer look at monkeys. Though ocelots are very good climbers, they are not as graceful in the trees as margays, a related cat that is smaller than the ocelot but found in the same areas. To see video of Keanu climbing, head over to @tbfrost. #WorldAnimal Day

Image Labels: Margay cat, wildcat, cat, feline, carnivore, mammal, animal, panther cat, greenishness color, light brown color

Each post was assigned an engagement score based on the number of comments and likes it received. The posts were then labeled as either high engagement or low engagement based on their score. For this analysis, I was interested to see if I could predict how engaging a post would be based on its contents. After a little bit of feature extraction and model building, I got the following results.

Features	Model Accuracy
Image Labels	65.00%
Captions	70.00%
Image Labels + Captions	71.43%

The best model used both image labels and captions to successfully predict if a post would be engaging 71.43% of the time. While this does provide us with some insight into our problem, it seems like there is still room for improvement. I will discuss these improvements shortly.

To better understand the nature of successful posts, I used Latent Dirichlet Allocation (LDA) to identify the underlying topics in @natgeo's posts. LDA looks at the collection of text (in this case, the image labels and captions) and identifies a set of topics, where each topic can be described by a distribution of words. The output of LDA is several word groups that each represent a topic. The table below shows the five topics this method extracted. I have gone ahead and assigned names to each topic based on my interpretation of the words contained in that topic.

Topic 0: People	Topic 1: Scenery	Topic 2: Animals	Topic 3: Time/Events	Topic 4: Marine Life
woman	nature	animal	year	sea
person	mountain	mammal	time	whale
people	sky	lion	day	water
girl	land	bird	first	fish
indian	water	wildlife	one	shark

Here, we can see that @natgeo primarily posts about people, scenery, animals, current events, and marine life. Makes sense, right?

Next, I looked at how these topics relate to engagement. I split the posts into four quantiles based on their engagement scores. The distribution of topics in the most and least engaging quantiles are described in the graphs below.

It looks like the most engaging quantile has a disproportionate amount of animal pictures; and almost no animal photos appear in the least engaging quantile. On the other hand, photos of people don't appear to capture the interest of Instagram users very well. This topic only appears in ~10% of high engagement posts, despite appearing in ~23% of all posts. Additionally, over 40% of low engagement posts contain people. These results suggest that National Geographic should post more animal pictures and fewer images of people to increase user engagement.

As I mentioned previously, there are likely other variables that have not yet been considered. For example, the post with the highest engagement score in this dataset was a woman underwater.

Photo by @paulnicklen.

How could a person, our least engaging subject, be the most engaging post? And why are we only getting 71% accuracy on our classification model? It is likely that the answers to these questions could be found by evaluating other variables. For example, the presence of a compelling story line or a certain sentiment may increase activity related to the post. These factors should be explored in future analyses to better understand the role they play in user engagement. In the meantime, check out the following section for more details on how the above analysis was performed.

Analysis

Data Pre-Processing

I began this project by scraping the 500 most recent posts from the @natgeo profile using a scraper found on rarcega's Github repository. After installing the package, the following code was entered in the command prompt.

instagram-scraper natgeo -m 500 -t image --media-metadata

The metadata from the posts was saved as a json file, so the following code was used in Jupyter Notebook to read the UTF-8 encoded file.

# read json file with UTF-8 encoding
with open("natgeo.json", "r", encoding='utf8') as read_file:
    data = json.load(read_file)

After loading the file, I iterated through the metadata to obtain the image type (picture or video), number of likes, number of comments, caption, and url for each post.

image_name = []
likes = []
comments = []
captions = []
urls = []

for p in data['GraphImages']: # iterate through all images
    image_name.append(p['__typename']) # get image name
    likes.append(p['edge_media_preview_like']['count']) # number of likes
    comments.append(p['edge_media_to_comment']['count']) # number of comments
    captions.append(p['edge_media_to_caption']['edges'][0]['node']['text']) # caption
    urls.append(p['urls'][0]) # image url (if multiple images in one post - gets url of first image only)

All of this data was then added to a single dataframe for further analysis.

import pandas as pd

all_data = pd.DataFrame({'image_name':image_name,'likes':likes,'comments':comments,'captions':captions,'url':urls})

To avoid problems down the road with image processing, I removed all posts from the dataframe that contained videos.

# Find all posts that contain a video
index_videos = all_data[all_data['image_name'] == 'GraphVideo'].index
# Delete these row indices from dataframe
image_data = all_data.drop(index_videos).reset_index()

Image Labeling Using IBM Watson Visual Recognition

To build a classification model to predict engagement, I needed to identify independent variables that provided information on the content of each post. This information was obtained by running each image through the IBM Watson Visual Recognition software. (If you're running this on your own computer, you will need to replace {apikey}, {version}, and {url} with the values associated with your personal IBM Watson account.)

from ibm_watson import VisualRecognitionV3
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

authenticator = IAMAuthenticator('{apikey}') # enter authentication key

labels = []
for url in image_data['url']: # iterate through all images in dataframe
    visual_recognition = VisualRecognitionV3(
    version='{version}', # enter software version
    authenticator=authenticator
    )

    visual_recognition.set_service_url('{url}') # enter service url
    classes_result = visual_recognition.classify(url=url).get_result() # run image through visual recognition
    
    classes = []
    for dic in classes_result['images'][0]['classifiers'][0]['classes']:
        classes.append(dic['class']) # get image labels
    labels.append(classes) # add labels to list

Once I obtained the image labels, I added them to our existing dataframe.

image_data['labels'] = labels # create column in dataframe for image labels

These past few steps have been a little computationally intensive; and the results will change if @natgeo continues to post to their page. Therefore, I saved the results to a csv file, so I could easily access them later.

image_data.to_csv('images_labeled.csv', index=False) # Save image data and labels to csv file

Quantifying Engagement

The number of likes and the number of comments for each post were used to measure engagement. However, there is a huge range of potential values for both of these metrics, so I normalized the values by dividing by the max value found in the dataset.

# Create engagement score
max_likes = max(image_data['likes']) # max number of likes
max_comments = max(image_data['comments']) # max number of comments
image_data['norm_likes'] = image_data['likes'] *1.0/max_likes # scale # of likes to value between 0-1
image_data['norm_comments'] = image_data['comments'] *1.0/max_comments # scale # of comments to value between 0-1

An engagement score was then calculated by taking a weighted average of the number of likes and comments. Then, the median engagement score was found; and all posts with engagement scores greater than the median were defined as high-engagement posts. Otherwise, posts were considered to be low-engagement. By splitting the data at the median, I created balanced classes for classification.

# Define posts as either high engagement or low engagement
image_data['engagement_score'] = 0.4*image_data['norm_likes'] + 0.6*image_data['norm_comments'] # take weighted average
median_engagement = np.median(image_data['engagement_score']) # find median value
image_data['binary_engagement'] = np.where(image_data['engagement_score'] > median_engagement, 1, 0) # define top 50% of posts as high engagement

Predicting Engagement
After obtaining independent features (image labels) and a target variable (high/low engagement), I built a classification model. First, I turned each list of labels into a document.

def convert_list_to_string(label_list):
    """Convert list of strings into single string"""
    label_string = '' # initialize empty string
    for i in label_list:
        label_string = label_string + ' ' + i # add next word to string
    return label_string

string_labels = image_data['labels'].map(convert_list_to_string) # get single 'document' for each image

Then, I made a 70/30 split of the data for the train and test sets. Next, I counted the number of times each word appeared in each document and computed a Term Frequency-Inverse Document Frequency score. This allowed me to identify the most important words for classification. Finally, I fit a logistic regression model to our training data, predicted on the test set, and computed the model's accuracy.

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def tfidf_classification(X, y):
    """Obtain TF-IDF scores and run logistic regression. Outputs accuracy score."""
    x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=2) # 70/30 train-test split
    # count occurence of each word, convert to TF-IDF representation, and perform logistic regression
    tfidf_pipeline = Pipeline([('count', CountVectorizer()),\
                               ('tfidf',TfidfTransformer()),\
                               ('classifier', LogisticRegression())])
    tfidf_pipeline.fit(x_train, y_train) # fit model to training data
    y_pred = tfidf_pipeline.predict(x_test) # predict on test data
    return accuracy_score(y_test, y_pred) # obtain accuracy score

Using the image labels as our input variables, the model achieved 65% accuracy.

# predict using image labels
tfidf_classification(string_labels, image_data['binary_engagement'])

What happens if post captions are used to classify engagement?

# predict using post caption
tfidf_classification(image_data['captions'], image_data['binary_engagement'])

Turns out, the model's accuracy increases to 70%. What if captions and labels are combined into a single document?

# predict using captions and labels
image_data['captions_and_labels'] = image_data['captions'] + string_labels # create one document with caption and label
tfidf_classification(image_data['captions_and_labels'], image_data['binary_engagement'])

Again, our accuracy increases and we obtain a value of 71.43%.

Topic Modeling

To better understand what topics Instagram users are most interested in, topic modeling using Latent Dirichlet Allocation (LDA) was performed. The following lines of code create a document term matrix from the captions and labels. LDA is then performed on this matrix to obtain a set of five topics.

lda_count = CountVectorizer() # instantiate count vectorizer
X = lda_count.fit_transform(captions_labels_preprocessed) # get term document matrix for captions and labels
vocab = lda_count.get_feature_names() # vocabulary includes all words that appeared in the captions and labels

lda_model = lda.LDA(n_topics=5, n_iter=1500, random_state=2) # run LDA with 5 topics and 1500 iterations
lda_model.fit(X) # fit the model to the corpus containing all captions and labels
topic_word = lda_model.topic_word_
n_top_words = 20 # number of words to display per topic

# Print top words for each topic
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, '/'.join(topic_words)))

Each post was then assigned to a topic based on the probabilities produced by the LDA model.

# create dataframe of topic probabilities
topics_df = pd.DataFrame(lda_model.doc_topic_, columns=['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4'])

topic = []
for row in range(len(topics_df)):
    topic.append(topics_df.iloc[row].idxmax()) # find most probable topic for each image
topics_df['Topic'] = topic # create new column with most probable topic

Finally, I split the dataframe containing the image data into four quantiles based on engagement scores.

# Split dataframe into four quantiles based on engagement score (quantile 3 has highest engagement, 0 has lowest)
quantiles = pd.qcut(image_data['engagement_score'], q=4, labels=False)
high_engagement = topics_df[quantiles == 3] # get high engagement posts
low_engagement = topics_df[quantiles == 0] # get low engagement posts

This allowed me to see which topics typically have the highest and lowest engagement.

# Get topic distributions
topic_dist = (topics_df['Topic'].value_counts()*100.0/len(topics_df)).sort_index() # full data set
high_eng_dist = (high_engagement['Topic'].value_counts()*100.0/len(high_engagement)).sort_index() # high engagement posts only
low_eng_dist = (low_engagement['Topic'].value_counts()*100.0/len(low_engagement)).sort_index() # low engagement posts only

The full code for this project can be found on Github.

Katie Grant

Search This Blog