If you follow National Geographic on Instagram (@natgeo), then you are well aware they post multitudes of pictures every day depicting everything from animals to people to nature and more. But what do their followers really want to see? After analyzing user engagement, I learned that National Geographic is most successful when they post about animals; and is not-so-successful when they share images of people. The following analysis describes how to measure and predict user engagement, and provides insights into how these results can be used by brands to increase engagement on social media.
Caption: Photo by Trevor Frost @tbfrost | It's World Animal Day! As soon as Keanu the ocelot was steady on his feet, he was climbing anything he could find. Today, at a little over a year old, he can easily scale any tree and will sometimes climb hundreds of feet into the tropical rainforest canopy in seconds. Harry Turner, an ex-soldier who is rehabilitating Keanu with Peruvian non-profit @hojanueva, tells me Keanu even climbs out of curiosity, sometimes just to get a closer look at monkeys. Though ocelots are very good climbers, they are not as graceful in the trees as margays, a related cat that is smaller than the ocelot but found in the same areas. To see video of Keanu climbing, head over to @tbfrost. #WorldAnimal Day
Here, we can see that @natgeo primarily posts about people, scenery, animals, current events, and marine life. Makes sense, right?
Next, I looked at how these topics relate to engagement. I split the posts into four quantiles based on their engagement scores. The distribution of topics in the most and least engaging quantiles are described in the graphs below.
It looks like the most engaging quantile has a disproportionate amount of animal pictures; and almost no animal photos appear in the least engaging quantile. On the other hand, photos of people don't appear to capture the interest of Instagram users very well. This topic only appears in ~10% of high engagement posts, despite appearing in ~23% of all posts. Additionally, over 40% of low engagement posts contain people. These results suggest that National Geographic should post more animal pictures and fewer images of people to increase user engagement.
Then, I made a 70/30 split of the data for the train and test sets. Next, I counted the number of times each word appeared in each document and computed a Term Frequency-Inverse Document Frequency score. This allowed me to identify the most important words for classification. Finally, I fit a logistic regression model to our training data, predicted on the test set, and computed the model's accuracy.
Using the image labels as our input variables, the model achieved 65% accuracy.
What happens if post captions are used to classify engagement?
Turns out, the model's accuracy increases to 70%. What if captions and labels are combined into a single document?
Again, our accuracy increases and we obtain a value of 71.43%.
Topic Modeling
Findings
The 500 most recent posts to @natgeo's Instagram were analyzed to determine what their followers are most interested in seeing. Each image was run through a visual recognition software to obtain a list of labels describing the contents of each picture. Below you can see an example post from @natgeo's Instagram and the associated caption and image labels.
Image Labels: Margay cat, wildcat, cat, feline, carnivore, mammal, animal, panther cat, greenishness color, light brown color
Each post was assigned an engagement score based on the number of comments and likes it received. The posts were then labeled as either high engagement or low engagement based on their score. For this analysis, I was interested to see if I could predict how engaging a post would be based on its contents. After a little bit of feature extraction and model building, I got the following results.
Features
|
Model Accuracy
|
Image Labels
|
65.00%
|
Captions
|
70.00%
|
Image Labels + Captions
|
71.43%
|
The best model used both image labels and captions to successfully predict if a post would be engaging 71.43% of the time. While this does provide us with some insight into our problem, it seems like there is still room for improvement. I will discuss these improvements shortly.
To better understand the nature of successful posts, I used Latent Dirichlet Allocation (LDA) to identify the underlying topics in @natgeo's posts. LDA looks at the collection of text (in this case, the image labels and captions) and identifies a set of topics, where each topic can be described by a distribution of words. The output of LDA is several word groups that each represent a topic. The table below shows the five topics this method extracted. I have gone ahead and assigned names to each topic based on my interpretation of the words contained in that topic.
To better understand the nature of successful posts, I used Latent Dirichlet Allocation (LDA) to identify the underlying topics in @natgeo's posts. LDA looks at the collection of text (in this case, the image labels and captions) and identifies a set of topics, where each topic can be described by a distribution of words. The output of LDA is several word groups that each represent a topic. The table below shows the five topics this method extracted. I have gone ahead and assigned names to each topic based on my interpretation of the words contained in that topic.
Topic 0:
People
|
Topic 1:
Scenery
|
Topic 2:
Animals
|
Topic 3:
Time/Events
|
Topic 4:
Marine Life
|
woman
|
nature
|
animal
|
year
|
sea
|
person
|
mountain
|
mammal
|
time
|
whale
|
people
|
sky
|
lion
|
day
|
water
|
girl
|
land
|
bird
|
first
|
fish
|
indian
|
water
|
wildlife
|
one
|
shark
|
Here, we can see that @natgeo primarily posts about people, scenery, animals, current events, and marine life. Makes sense, right?
Next, I looked at how these topics relate to engagement. I split the posts into four quantiles based on their engagement scores. The distribution of topics in the most and least engaging quantiles are described in the graphs below.
As I mentioned previously, there are likely other variables that have not yet been considered. For example, the post with the highest engagement score in this dataset was a woman underwater.
Photo by @paulnicklen. |
How could a person, our least engaging subject, be the most engaging post? And why are we only getting 71% accuracy on our classification model? It is likely that the answers to these questions could be found by evaluating other variables. For example, the presence of a compelling story line or a certain sentiment may increase activity related to the post. These factors should be explored in future analyses to better understand the role they play in user engagement. In the meantime, check out the following section for more details on how the above analysis was performed.
Analysis
Data Pre-Processing
I began this project by scraping the 500 most recent posts from the @natgeo profile using a scraper found on rarcega's Github repository. After installing the package, the following code was entered in the command prompt.
instagram-scraper natgeo -m 500 -t image --media-metadata
The metadata from the posts was saved as a json file, so the following code was used in Jupyter Notebook to read the UTF-8 encoded file.
# read json file with UTF-8 encoding with open("natgeo.json", "r", encoding='utf8') as read_file: data = json.load(read_file)
After loading the file, I iterated through the metadata to obtain the image type (picture or video), number of likes, number of comments, caption, and url for each post.
All of this data was then added to a single dataframe for further analysis.
image_name = [] likes = [] comments = [] captions = [] urls = [] for p in data['GraphImages']: # iterate through all images image_name.append(p['__typename']) # get image name likes.append(p['edge_media_preview_like']['count']) # number of likes comments.append(p['edge_media_to_comment']['count']) # number of comments captions.append(p['edge_media_to_caption']['edges'][0]['node']['text']) # caption urls.append(p['urls'][0]) # image url (if multiple images in one post - gets url of first image only)
All of this data was then added to a single dataframe for further analysis.
import pandas as pd
all_data = pd.DataFrame({'image_name':image_name,'likes':likes,'comments':comments,'captions':captions,'url':urls})
To avoid problems down the road with image processing, I removed all posts from the dataframe that contained videos.
Image Labeling Using IBM Watson Visual Recognition
# Find all posts that contain a video index_videos = all_data[all_data['image_name'] == 'GraphVideo'].index # Delete these row indices from dataframe image_data = all_data.drop(index_videos).reset_index()
Image Labeling Using IBM Watson Visual Recognition
To build a classification model to predict engagement, I needed to identify independent variables that provided information on the content of each post. This information was obtained by running each image through the IBM Watson Visual Recognition software. (If you're running this on your own computer, you will need to replace {apikey}, {version}, and {url} with the values associated with your personal IBM Watson account.)
Once I obtained the image labels, I added them to our existing dataframe.
from ibm_watson import VisualRecognitionV3 from ibm_cloud_sdk_core.authenticators import IAMAuthenticator authenticator = IAMAuthenticator('{apikey}') # enter authentication key labels = [] for url in image_data['url']: # iterate through all images in dataframe visual_recognition = VisualRecognitionV3( version='{version}', # enter software version authenticator=authenticator ) visual_recognition.set_service_url('{url}') # enter service url classes_result = visual_recognition.classify(url=url).get_result() # run image through visual recognition classes = [] for dic in classes_result['images'][0]['classifiers'][0]['classes']: classes.append(dic['class']) # get image labels labels.append(classes) # add labels to list
Once I obtained the image labels, I added them to our existing dataframe.
image_data['labels'] = labels # create column in dataframe for image labels
These past few steps have been a little computationally intensive; and the results will change if @natgeo continues to post to their page. Therefore, I saved the results to a csv file, so I could easily access them later.
image_data.to_csv('images_labeled.csv', index=False) # Save image data and labels to csv file
Quantifying Engagement
The number of likes and the number of comments for each post were used to measure engagement. However, there is a huge range of potential values for both of these metrics, so I normalized the values by dividing by the max value found in the dataset.
An engagement score was then calculated by taking a weighted average of the number of likes and comments. Then, the median engagement score was found; and all posts with engagement scores greater than the median were defined as high-engagement posts. Otherwise, posts were considered to be low-engagement. By splitting the data at the median, I created balanced classes for classification.
Predicting Engagement
After obtaining independent features (image labels) and a target variable (high/low engagement), I built a classification model. First, I turned each list of labels into a document.
# Create engagement score max_likes = max(image_data['likes']) # max number of likes max_comments = max(image_data['comments']) # max number of comments image_data['norm_likes'] = image_data['likes'] *1.0/max_likes # scale # of likes to value between 0-1 image_data['norm_comments'] = image_data['comments'] *1.0/max_comments # scale # of comments to value between 0-1
An engagement score was then calculated by taking a weighted average of the number of likes and comments. Then, the median engagement score was found; and all posts with engagement scores greater than the median were defined as high-engagement posts. Otherwise, posts were considered to be low-engagement. By splitting the data at the median, I created balanced classes for classification.
# Define posts as either high engagement or low engagement image_data['engagement_score'] = 0.4*image_data['norm_likes'] + 0.6*image_data['norm_comments'] # take weighted average median_engagement = np.median(image_data['engagement_score']) # find median value image_data['binary_engagement'] = np.where(image_data['engagement_score'] > median_engagement, 1, 0) # define top 50% of posts as high engagement
Predicting Engagement
After obtaining independent features (image labels) and a target variable (high/low engagement), I built a classification model. First, I turned each list of labels into a document.
def convert_list_to_string(label_list): """Convert list of strings into single string""" label_string = '' # initialize empty string for i in label_list: label_string = label_string + ' ' + i # add next word to string return label_string string_labels = image_data['labels'].map(convert_list_to_string) # get single 'document' for each image
Then, I made a 70/30 split of the data for the train and test sets. Next, I counted the number of times each word appeared in each document and computed a Term Frequency-Inverse Document Frequency score. This allowed me to identify the most important words for classification. Finally, I fit a logistic regression model to our training data, predicted on the test set, and computed the model's accuracy.
from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score def tfidf_classification(X, y): """Obtain TF-IDF scores and run logistic regression. Outputs accuracy score.""" x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=2) # 70/30 train-test split # count occurence of each word, convert to TF-IDF representation, and perform logistic regression tfidf_pipeline = Pipeline([('count', CountVectorizer()),\ ('tfidf',TfidfTransformer()),\ ('classifier', LogisticRegression())]) tfidf_pipeline.fit(x_train, y_train) # fit model to training data y_pred = tfidf_pipeline.predict(x_test) # predict on test data return accuracy_score(y_test, y_pred) # obtain accuracy score
Using the image labels as our input variables, the model achieved 65% accuracy.
# predict using image labels tfidf_classification(string_labels, image_data['binary_engagement'])
What happens if post captions are used to classify engagement?
# predict using post caption tfidf_classification(image_data['captions'], image_data['binary_engagement'])
Turns out, the model's accuracy increases to 70%. What if captions and labels are combined into a single document?
# predict using captions and labels image_data['captions_and_labels'] = image_data['captions'] + string_labels # create one document with caption and label tfidf_classification(image_data['captions_and_labels'], image_data['binary_engagement'])
Again, our accuracy increases and we obtain a value of 71.43%.
Topic Modeling
To better understand what topics Instagram users are most interested in, topic modeling using Latent Dirichlet Allocation (LDA) was performed. The following lines of code create a document term matrix from the captions and labels. LDA is then performed on this matrix to obtain a set of five topics.
Each post was then assigned to a topic based on the probabilities produced by the LDA model.
lda_count = CountVectorizer() # instantiate count vectorizer X = lda_count.fit_transform(captions_labels_preprocessed) # get term document matrix for captions and labels vocab = lda_count.get_feature_names() # vocabulary includes all words that appeared in the captions and labels lda_model = lda.LDA(n_topics=5, n_iter=1500, random_state=2) # run LDA with 5 topics and 1500 iterations lda_model.fit(X) # fit the model to the corpus containing all captions and labels topic_word = lda_model.topic_word_ n_top_words = 20 # number of words to display per topic # Print top words for each topic for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] print('Topic {}: {}'.format(i, '/'.join(topic_words)))
Each post was then assigned to a topic based on the probabilities produced by the LDA model.
# create dataframe of topic probabilities topics_df = pd.DataFrame(lda_model.doc_topic_, columns=['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4']) topic = [] for row in range(len(topics_df)): topic.append(topics_df.iloc[row].idxmax()) # find most probable topic for each image topics_df['Topic'] = topic # create new column with most probable topic
Finally, I split the dataframe containing the image data into four quantiles based on engagement scores.
This allowed me to see which topics typically have the highest and lowest engagement.
The full code for this project can be found on Github.
# Split dataframe into four quantiles based on engagement score (quantile 3 has highest engagement, 0 has lowest) quantiles = pd.qcut(image_data['engagement_score'], q=4, labels=False) high_engagement = topics_df[quantiles == 3] # get high engagement posts low_engagement = topics_df[quantiles == 0] # get low engagement posts
This allowed me to see which topics typically have the highest and lowest engagement.
# Get topic distributions topic_dist = (topics_df['Topic'].value_counts()*100.0/len(topics_df)).sort_index() # full data set high_eng_dist = (high_engagement['Topic'].value_counts()*100.0/len(high_engagement)).sort_index() # high engagement posts only low_eng_dist = (low_engagement['Topic'].value_counts()*100.0/len(low_engagement)).sort_index() # low engagement posts only
The full code for this project can be found on Github.