Trending December 2023 # Introduction To Synthetic Control Using Propensity Score Matching # Suggested January 2024 # Top 19 Popular

You are reading the article Introduction To Synthetic Control Using Propensity Score Matching updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Introduction To Synthetic Control Using Propensity Score Matching

This article was published as a part of the Data Science Blogathon.

Here’s a secret, synthetic control methods can solve this problem with utmost ease and without loss in revenues or customers. If this was an experiment, the hypothesis would be – removing cards as a payment method will not impact revenue and sales. The test base would comprise customers who do not have cards as a payment option, and the control base would have cards as a payment option.

Synthetic Control Method: Case Study

Synthetic control methods(SCM), in simple words, will choose for every test customer a similar control customer using a pre-defined set of features or covariates whose pre-treatment characteristics are similar but have not undergone treatment. In SCM customers are called units, interventions are called treatments, and features are called covariates. Companies use SCM for many practical use cases. Uber, for example used SCM to test whether providing driver contact detail before a ride increases customer satisfaction.

Sometimes hypotheses cannot be tested in an experimental setup due to legal, business, or platform reasons. For example: in Bangalore, Meghana’s biryani is one of the best, and online delivery platforms(Swiggy or Zomato) noticed that there is a high probability that high-spending customers order from Meghana as well. The hypothesis is that Meghana customers on average, spend higher than the rest. But Swiggy or Zomato cannot remove Meghana’s from the platform for 50% of the customers just for the sake of the experiment. There would be customer, as well as restaurant backlash. Using SCM, a synthetic control can be created where test and control customers have similar pre-treatment covariates, and the hypothesis can be validated.

Solving selection bias and using the right business heuristics while choosing the right control becomes very important and can even reverse the outcome of experiments. For example, the test will be a set of customers who ordered Meghana’s. For control, vegetarian customers can be chosen. But intrinsically, test and control are different in that control has never tried non-veg cuisine. Another way to control is customers who never ordered biryani, but this would not be an appropriate control, as Meghana’s is heavily a biryani-serving restaurant. Another way to control is using geographies. As Meghana’s isn’t present in Delhi, Delhi customers can be used as control. But the food habits of Delhi users are different from south Indian users and this will cause bias if the experiment is broken down into various dimensions like age group, gender, and cuisine. Right business heuristic thus becomes a quintessential part while solving for synthetic control.

Table of Content

Practical Steps in Propensity Score Matching

Problem statement

Data Pre-Processing

Propensity Modelling

Matching using NearestNeighbors

Before and After Matching

Statistical Test to measure the impact of treatment

Another Way to Match using NearestNeighbors

psmpy – Python Propensity Score Matching Library

Synthetic controls can be created using matching. Propensity score matching is the most common method used to create SC because it’s easy, less time-consuming, saves a lot of dollars, and can be scaled to a large user base e process can be repeated N times until the most similar test, and control cohorts are matched.

Steps involved in propensity score matching:

Select a large group of customers – age, sex, sales, units, etc. These are covariates that can cause bias.

The main goal is to match the covariates before intervention. For each customer who paid using a credit card – test customer, in control, there needs to be a customer whose pre-treatment covariates are similar to that of the test but did not pay using a card but used UPI or internet banking instead.

Using the probability score, say 0.6 and 0.61 can be matched using k nearest neighbor. Matching can be 1:1 or 1:many, which uses duplicates.

Now investigate how the treatment has affected the outcome. The treatment is binary, 1 or 0. Ordered Meghana’s or not, used card or not. But the outcome can be either continuous – future revenue or binary – churned or not, etc.

Problem Statement

The dataset used here is based on customer payment history. Let’s break it down by taking a cue from the debit card problem statement.

Hypothesis: H0:Customers paying with cards have higher post revenues. Alternate hypothesis:  H1: Customers paying with cards do not have higher post revenues(hence the experiment to remove card payments altogether from the platform). If P<0.05, we can reject the null hypothesis and conclusively say that customers paying with cards do not have higher post revenue.

There are three time periods here: Pre-period, on which the features/covariates are based. Treatment period, where customers pay using cards. Post period where the outcome, in this case, post revenue, is measured. Choosing these periods carefully is important to avoid event bias, ensuring no promotional event was planned and executed during this time. For example, during Oct, Nov due to festive and cashback from banking partners, users prefer credit cards as they offer 500 to 5000 discounts. Considering this period for the analysis would clearly bias the results.

For the sake of this analysis, let’s assume the pre-period was Jan’21 to June’21, and covariates used to train the model are derived from this period. The treatment uses debit/credit cards during the treatment period, and it’s the model’s dependent variable. Let’s consider the treatment period was the first week of July, and the outcome post-period was the next 30 days.

Exploratory Data Analysis of this dataset has been updated here.

Data Pre-Processing

Read the dataset and drop irrelevant features. card_payment is the dependent variable:

cols_basic_model = [‘number_of_cards’, ‘payments_initiated’, ‘payments_failed’, ‘payments_completed’, ‘payments_completed_amount_first_7days’, ‘reward_purchase_count_first_7days’, ‘coins_redeemed_first_7days’, ‘is_referral’, ‘visits_feature_1’, ‘visits_feature_2’, ‘given_permission_1’, ‘given_permission_2’] df_fintec_LR = df_fintec[cols_basic_model] df_fintec_LR[“is_referral”] = df_fintec_LR[“is_referral”].astype(int) df_fintec_LR.fillna(0, inplace = True) oh_cols = [“is_referral”,”given_permission_1″,”given_permission_2″] df_fintec_LR.drop(columns = oh_cols, inplace = True)

Hit Run to see the output

Prepare train test split and scale features(catboost works without feature standardization as well):

df_fintec = df_fintec.rename(columns={'is_churned': 'card_payment'}) X_train, X_test, y_train, y_test = train_test_split(df_fintec_LR, df_fintec[["card_payment"]],random_state = 70, test_size=0.30) scaler = StandardScaler() X_train_Scaled = scaler.transform(X_train) X_test_Scaled = scaler.transform(X_test) X, y =X_train_Scaled ,y_train  Modelling

Use a simple model, as we are more interested in the propensity score:

clf = CatBoostClassifier( iterations=5, learning_rate=0.1, #loss_function='CrossEntropy' ), y_train, verbose=False) y_pred = clf.predict(X_test, prediction_type='Probability') y_pred y_pred_df = pd.DataFrame(y_pred,columns = ["zero","one"]) y_pred_df.head() df_test = pd.concat([X_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1) df_test = pd.concat([df_test.reset_index(drop=True), y_pred_df[["one"]].reset_index(drop=True)], axis=1) df_test['revenue'] = np.random.randint(0,50000, size=len(df_test)) display(df_test.head()) sns.histplot(data=df_test, x='one', hue='card_payment') # multiple="dodge" for

As observed in the histogram, there are clear areas of overlap, thus, similar units can be obtained using PSM. If there was no overall between them, it isn’t possible to find matches.

Matching using NearestNeighbors

Matching using NearestNeighbors.

from sklearn.neighbors import NearestNeighbors caliper = np.std( * 0.25 print(f'caliper (radius) is: {caliper:.4f}') n_neighbors = 10 # setup knn knn = NearestNeighbors(n_neighbors=n_neighbors, radius=caliper) ps = df_test[['one']] # double brackets as a dataframe # distances and indexes distances, neighbor_indexes = knn.kneighbors(ps) print(neighbor_indexes.shape) # the 10 closest points to the first point print(distances[0]) print(neighbor_indexes[0]) # for each point in treatment, we find a matching point in control without replacement # note the 10 neighbors may include both points in treatment and control matched_control = [] # keep track of the matched observations in control for current_index, row in df_test.iterrows(): # iterate over the dataframe if row.card_payment == 0: # the current row is in the control group df_test.loc[current_index, 'matched'] = chúng tôi # set matched to nan else: for idx in neighbor_indexes[current_index, :]: # for each row in treatment, find the k neighbors # make sure the current row is not the idx - don't match to itself # and the neighbor is in the control if (current_index != idx) and (df_test.loc[idx].card_payment == 0): if idx not in matched_control: # this control has not been matched yet df_test.loc[current_index, 'matched'] = idx # record the matching matched_control.append(idx) # add the matched to the list break # try to increase the number of neighbors and/or caliper to get more matches print('total observations in treatment:', len(df_test[df_test.card_payment==1])) print('total matched observations in control:', len(matched_control))

total observations in treatment: 8988

total matched observations in control: 1296

# control have no match treatment_matched = df_test.dropna(subset=['matched']) # drop not matched # matched control observation indexes control_matched_idx = treatment_matched.matched control_matched_idx = control_matched_idx.astype(int) # change to int control_matched = df_test.loc[control_matched_idx, :] # select matched control observations # combine the matched treatment and control df_matched = pd.concat([treatment_matched, control_matched]) df_matched.card_payment.value_counts()

1 1296

0 1296

Name: card_payment, dtype: int64

sns.histplot(data=df_matched, x='number_of_cards', hue='card_payment') sns.histplot(data=df_matched, x='payments_completed_amount_first_7days', hue='card_payment')

From the 2 histograms, it’s clear that post propensity matching, the distribution of covariates is nearly the same for 2 features – the number of cards and payments completed amount first seven days. The same can be plotted for other features as well.

Before and After

The distribution of covariates before and after should show significant differences in SD; only then can it be concluded that matching has been effective.

from numpy import mean from numpy import var from math import sqrt # function to calculate Cohen's d for independent samples def cohen_d(d1, d2): # calculate the size of samples n1, n2 = len(d1), len(d2) # calculate the variance of the samples s1, s2 = var(d1, ddof=1), var(d2, ddof=1) # calculate the pooled standard deviation s = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)) # calculate the means of the samples u1, u2 = mean(d1), mean(d2) # calculate the effect size return (u1 - u2) / s effect_sizes = [] cols = list(X_train.columns) # separate control and treatment for t-test df_control = df_fintec[df_fintec.card_payment==0] df_treatment = df_fintec[df_fintec.card_payment==1] for cl in cols: _, p_before = ttest_ind(df_control[cl], df_treatment[cl]) _, p_after = ttest_ind(df_matched_control[cl], df_matched_treatment[cl]) cohen_d_before = cohen_d(df_treatment[cl], df_control[cl]) cohen_d_after = cohen_d(df_matched_treatment[cl], df_matched_control[cl]) effect_sizes.append([cl,'before', cohen_d_before, p_before]) effect_sizes.append([cl,'after', cohen_d_after, p_after]) df_effect_sizes = pd.DataFrame(effect_sizes, columns=['feature', 'matching', 'effect_size', 'p-value']) fig, ax = plt.subplots(figsize=(15, 5)) sns.barplot(data=df_effect_sizes, x='effect_size', y='feature', hue='matching', orient='h')

Cohen’s D, or standardized mean difference, is one of the most common ways to measure effect size. Before matching, the difference in SD is higher between the test and control (blue bars). After matching the SD difference is lower, the test and control have similar distributions before the treatment period.

Statistical Test to Measure The Impact of Treatment

Students T-tests to compare the means of two groups. If it was post-period retention or churn,  Chi-Squared Test could be used.

# student's t-test for revenue (dependent variable) after matching # p value is not significant now from scipy.stats import ttest_ind print(df_matched_control.revenue.mean(), df_matched_treatment.revenue.mean()) # compare samples _, p = ttest_ind(df_matched_control.revenue, df_matched_treatment.revenue) print(f'p={p:.3f}') # interpret alpha = 0.05 # significance level print('same distributions/same group mean (fail to reject H0 - we do not have enough evidence to reject H0)') else: print('different distributions/different group mean (reject H0)')

25105.91898148148 24040.162037037036


same distributions/same group mean (fail to reject H0 – we do not have enough evidence to reject H0)

Another Way to Match Using NearestNeighbors

Apart from PSM, there are other matching methods as well. A snippet of code for the same.

from sklearn.preprocessing import StandardScaler from sklearn.neighbors import NearestNeighbors def get_matching_pairs(treated_df, non_treated_df, scaler=True): treated_x = treated_df.values non_treated_x = non_treated_df.values if scaler == True: scaler = StandardScaler() if scaler: treated_x = scaler.transform(treated_x) non_treated_x = scaler.transform(non_treated_x) nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(non_treated_x) distances, indices = nbrs.kneighbors(treated_x) indices = indices.reshape(indices.shape[0]) matched = non_treated_df.iloc[indices] return matched import pandas as pd import numpy as np import matplotlib.pyplot as plt treated_df = pd.DataFrame() np.random.seed(1) size_1 = 200 size_2 = 1000 treated_df['x'] = np.random.normal(0,1,size=size_1) treated_df['y'] = np.random.normal(50,20,size=size_1) treated_df['z'] = np.random.normal(0,100,size=size_1) non_treated_df = pd.DataFrame() # two different populations non_treated_df['x'] = list(np.random.normal(0,3,size=size_2)) + list(np.random.normal(-1,2,size=2*size_2)) non_treated_df['y'] = list(np.random.normal(50,30,size=size_2)) + list(np.random.normal(-100,2,size=2*size_2)) non_treated_df['z'] = list(np.random.normal(0,200,size=size_2)) + list(np.random.normal(13,200,size=2*size_2)) matched_df = get_matching_pairs(treated_df, non_treated_df) fig, ax = plt.subplots(figsize=(6,6)) plt.scatter(non_treated_df['x'], non_treated_df['y'], alpha=0.3, label='All non-treated') plt.scatter(treated_df['x'], treated_df['y'], label='Treated') plt.scatter(matched_df['x'], matched_df['y'], marker='x', label='matched') plt.legend() plt.xlim(-1,2)

Using this technique, only non-treated units with proximity to the treated are matched(in green), while the rest have been left out(light blue).

psmpy – Python Propensity Score Matching Library

PSMPY simplifies PSM, effectively reducing it to just 10 lines of code. Model building, matching, and scaling are taken care of by the framework. One downside is that it scales poorly for more than 50K units.

from psmpy import PsmPy from psmpy.plotting import * df_fintec.fillna(0, inplace = True) psm = PsmPy(df_fintec, treatment='card_payment', indx='user_id', exclude = ['is_referral', 'age', 'city', 'device']) # same as my code using balance=False psm.logistic_ps(balance=False) psm.predicted_data psm.knn_matched(matcher='propensity_logit', replacement=False, caliper=None) psm.plot_match(Title='Matching Result', Ylabel='# of obs', Xlabel= 'propensity logit', names = ['treatment', 'control']) display(psm.effect_size) psm.effect_size_plot()

Their covariates values will not be identical for two individuals with identical propensity scores. PSM effectively balances the average values of covariates across cohorts. Basically, if the average of the number of cards across the test and control is the same, but if two users are chosen with the same propensity values, their covariates will differ.

As the number of covariates grows,  the curse of dimensionality affects matching, reducing the chances to nearly zero.


Controlling for inherent bias can lead to good results.

Choose appropriate pre-period covariates based on the problem at hand.

The hypothesis needs to be backed by business acumen and data.

Good luck! Here’s my Linkedin profile if you want to connect with me or want to help improve the article. Check out my other articles on data science and analytics here.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


You're reading Introduction To Synthetic Control Using Propensity Score Matching

Introduction To Supabase: Postgres Database Using Python

This article was published as a part of the Data Science Blogathon.


Supabase is an open-source Backend as a Service (BaaS) that is gaining traction among developers in recent times. Supabase claims to be the open-source Firebase alternative and is backed by big tech companies like Yandex, Mozilla, and Coatue. Just like Google Firebase, Supabase also aims to replace the complete backend for modern web and mobile applications by providing various important features and functionalities like authentication, cloud storage, database, analytics, and edge functions. These services can automatically scale by network traffic without any manual setup.

The major issue with Google Firebase is vendor lock-in, which is one reason companies refrain from using these platforms to reduce development time. In the case of Supabase, there is no fear of vendor lock-in. Another significant difference between Supabase and Firebase is that Supabase is built on top of a Postgres database while Firebase ships a NoSQL database called Firestore. If you are interested in learning about the Firestore database offered by Google Firebase, refer to my previous article here.

In this article, we will learn how to set up the Postgres database on the Supabase platform and perform basic CRUD operations using the python programming language. We will also see some SQL queries to perform some CRUD operations on the Postgres database.

Setting-up Supabase

Navigate to the official Supabase website and sign in with your GitHub account to access Supabase. This is a mandatory step.

Once you sign in with your Github account, you can see all your projects on the dashboard. Please note that the free-tier version lets you create a maximum of up to 2 projects. Let’s go ahead and create a new Supabase project.

Now that we have created a table let’s get our hands dirty and shift our focus to the programming part.

Connecting to Supabase and Performing CRUD Operations

There is a Supabase client for python available that can be installed like any other python module using pip. Let’s go ahead and install it before proceeding further.

$ pip install supabase

Two things are mandatory to connect to the Supabase database: the Project URL and the API Key. To extract this, navigate to the settings tab and select the API section, where you will find the project URL and the API key. Copy the URL, and the anon public key as this will be used to connect to the Supabase database.

Now that we have all the credentials with us let’s go ahead and connect to the Supabase database.

from supabase import create_client import json API_URL = 'your_url' API_KEY = 'your_key' supabase = create_client(API_URL, API_KEY) supabase

Now that we have successfully connected to our database let’s understand how to perform basic CRUD operations.

Let us first see how we can insert a single record into a table in our database. Each record needs to be a dictionary containing all the information. To insert a record into a table, we use the “insert()” function, which accepts the record data as a dictionary.

data = { 'id': 1, 'name': 'Vishnu', 'age': 22, 'country': 'India', 'programming_languages': json.dumps(['C++', 'python', 'Rust']) } supabase.table('demo-database').insert(data).execute() # inserting one record APIResponse(data=[{'id': 1, 'created_at': '2023-07-17T08:58:24.105377+00:00', 'name': 'Vishnu', 'age': 22, 'country': 'India', 'programming_languages': '["C++", "python", "Rust"]'}], count=None)

Now the same “insert()” function can also be used to insert multiple records. To do so, we must pass in a list of dictionaries containing all the records’ data. Let’s see this in action now.

data = [ { 'id': 2, 'name': 'Prakash', 'age': 37, 'country': 'India', 'programming_languages': json.dumps(['C#', 'web assembly']) }, { 'id': 3, 'name': 'Arjun', 'age': 29, 'country': 'Germany', 'programming_languages': json.dumps(['python', 'nodejs', 'Rust']) }, { 'id': 4, 'name': 'Sanjay', 'age': 19, 'country': 'India', 'programming_languages': json.dumps(['python']) }, { 'id': 5, 'name': 'Ram', 'age': 44, 'country': 'India', 'programming_languages': json.dumps(['python', 'Go']) } ] supabase.table('demo-database').insert(data).execute() # inserting multiple records APIResponse(data=[{'id': 2, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Prakash', 'age': 37, 'country': 'India', 'programming_languages': '["C#", "web assembly"]'}, {'id': 3, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Arjun', 'age': 29, 'country': 'Germany', 'programming_languages': '["python", "nodejs", "Rust"]'}, {'id': 4, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Sanjay', 'age': 19, 'country': 'India', 'programming_languages': '["python"]'}, {'id': 5, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Ram', 'age': 44, 'country': 'India', 'programming_languages': '["python", "Go"]'}], count=None)

We can see that all the records are also visible in our dashboard. Now that we have understood how to perform the insertion operation, let’s see how we can fetch some documents from the table.

To fetch some records from the table, we can use the “select()” function just like the select SQL query. Let’s see this in action to have a better understanding.

supabase.table('demo-database').select('*').execute().data # fetching documents [{'id': 1, 'created_at': '2023-07-17T08:58:24.105377+00:00', 'name': 'Vishnu', 'age': 22, 'country': 'India', 'programming_languages': '["C++", "python", "Rust"]'}, {'id': 2, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Prakash', 'age': 37, 'country': 'India', 'programming_languages': '["C#", "web assembly"]'}, {'id': 3, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Arjun', 'age': 29, 'country': 'Germany', 'programming_languages': '["python", "nodejs", "Rust"]'}, {'id': 4, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Sanjay', 'age': 19, 'country': 'India', 'programming_languages': '["python"]'}, {'id': 5, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Ram', 'age': 44, 'country': 'India', 'programming_languages': '["python", "Go"]'}]

The “*” represents that we need all the columns to be returned from the table. This can be done using the SQL editor from our dashboard as well. Let’s go ahead and write a simple SQL query and see how it works.

We can see that we get the same response here as well. Now select queries are pretty dumb if we do not have any sort of filtering. You can understand this part easily if you are slightly familiar with MongoDB queries. Now let’s see a couple of filtering techniques in action.

Let’s try to fetch all the records where the age exceeds 35 years. To execute this, we can make use of filtering techniques. In this case, we will be using the “gt()” function, which stands for the “greater than” operator. Let’s solve this by running a normal SQL query as well in the SQL editor.

supabase.table('demo-database').select('*').gt('age', 35).execute().data # fetching documents with filtering [{'id': 2, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Prakash', 'age': 37, 'country': 'India', 'programming_languages': '["C#", "web assembly"]'}, {'id': 5, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Ram', 'age': 44, 'country': 'India', 'programming_languages': '["python", "Go"]'}]

We can see that the SQL query also returns the same set of records as the response. Now we can add multiple sets of filters as well. To see this in action, let’s try to fetch all the records with ages greater than 35 but less than 40.

supabase.table('demo-database').select('*').gt('age', 35).lt('age', 40).execute().data # multiple filtering [{'id': 2, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Prakash', 'age': 37, 'country': 'India', 'programming_languages': '["C#", "web assembly"]'}]

Now that we have seen how records can be fetched from the table let’s look at how we can update records in the table. For this, we make use of the “update()” function. Let’s try to update the country to France for the record having id as 2. We make use of the “eq()” function that serves as the “equal to” operator for filtering. If we do not include this, all the records in the table will be updated.

supabase.table('demo-database').update({"country": "France"}).eq("id", 2).execute() # updating a record APIResponse(data=[{'id': 2, 'created_at': '2023-07-17T09:02:48.193326+00:00', 'name': 'Prakash', 'age': 37, 'country': 'France', 'programming_languages': '["C#", "web assembly"]'}], count=None)

We can see that the country has been successfully updated to France.

Now let’s look at the last operation, which is the delete operation. To delete a record, we use the “delete()” operation along with some filtering operations lest all the records in the table will be deleted. This is similar to the update operation explained in the previous section. Let’s go ahead and delete a record from the table. We will delete the record having id as 1.

supabase.table("demo-database").delete().eq("id", 1).execute() # deleting a record APIResponse(data=[{'id': 1, 'created_at': '2023-07-17T08:58:24.105377+00:00', 'name': 'Vishnu', 'age': 22, 'country': 'India', 'programming_languages': '["C++", "python", "Rust"]'}], count=None)

We can see that the corresponding record has been deleted successfully.


A few key takeaways:

Understanding how to set up a Postgres database in Supabase

Learning how to perform basic CRUD operations on the Postgres database using python

Learning how to use the SQL editor to write SQL queries

As mentioned earlier, Supabase provides many other services like authentication, cloud storage, real-time database, and edge functions. The real-time database feature is currently not available for python. I will try to cover the cloud storage service provided by Supabase in another article in the coming weeks, so stay tuned!

That’s it for this article. I hope you enjoyed reading this article and learned something new. Thanks for reading and happy learning!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


An Introduction To Python For Seo Pros Using Spreadsheets

2023 far exceeded my expectations in terms of Python adoption within the SEO community.

As we start a new year and I hear more SEO professionals wanting to join in the fun, but frustrated by the initial learning curve, I decided to write this introductory piece with the goal of getting more people involved and contributing.

When you implement the same workflow in Python, you can trivially reproduce the work or even automate the whole workflow.

We are going to learn Python basics while studying code John Mueller recently shared on Twitter that populates Google Sheets. We will modify his code to add a simple visualization.

— 🍌 John 🍌 (@JohnMu) January 3, 2023

Setting up the Python Environment

Similar to working with Excel or Google Sheets, you have two primary options when working with Python.

You can install and run Python on your local computer, or you can run it in the cloud using Google Colab or Jupyter notebooks.

Let’s review each one.

Working with Python on Your Local Computer

I typically choose to work on my Mac when there is software that won’t run in the cloud, for example, when I need to automate a web browser.

You need to download three software packages:


Visual Studio Code.

The Python bindings for Code.

This will take a while to complete.

Once done, search for the Anaconda Navigator and launch it.

You can think of this notebook as similar to a new Excel sheet.

The next step is optional.

I personally use Visual Studio Code when I need to write code in Python and JavaScript or when writing JavaScript code. You can also use it if you want to convert your notebook code into a command-line script.

It is easier to prototype in Jupyter notebooks and when you get everything to work, you can use Visual Studio Code to put everything together in a script or app that others can use from the command line.

Make sure to install the Python extension for VSC. You can find it here.

Visual Studio Code has built-in support for Jupyter Notebooks.

You can create one by typing the keyword combination Command+Shift+P and selecting the option “Python Jupyter Notebook”.

Working with Python in the Cloud

I do most of my Python work on Google Colab notebooks so this is my preferred option.

Learning the basics of Python & Pandas

Mueller shared a Colab notebook that pulls data from Wikipedia and populates a Google Sheet with that data.

Professional programmers need to learn the ins and out of a programming language and that can take a lot of time and effort.

For SEO practitioners, I think a simpler approach that involves studying and adapting existing code, could work better. Please share your feedback if you try this and see if I am right.

We are going to review most of the same basics you learn in typical Python programming tutorials, but with a practical context in mind.

Let’s start by saving Mueller’s notebook to your Google Drive.

Here is the example Google sheet with the output of the notebook.

Overall Workflow

Mueller wants to get topic ideas that perform better in mobile compared to desktop.

— 🍌 John 🍌 (@JohnMu) December 30, 2023

He learned that celebrity, entertainment, and medical content does best on mobile.

We have several pieces to the puzzle.

An empty Google sheet with 6 prefilled columns and 7 columns that need to be filled in

The empty Google sheet includes a Pivot table in a separate tab that shows mobile views represent 70.59% of all views in Wikipedia

The helper function receives the names of the columns to update and a function to call that can return the values for the columns.

After all of the columns are populated, we get a final Google sheet that includes an updated Pivot Table with a break down of the topic.

Python Building Blocks

Let’s learn some common Python building blocks while we review how Mueller’s code retrieves values to populate a couple of fields: the PageId and Description.

# Get the Wikipedia page ID -- needed for a bunch of items. Uses "Article" column def get_PageId(title): # Get page description from Wikipedia def get_description(pageId):

We have two Python functions to retrieve the fields. Python functions are like functions in Google Sheets but you define their behavior in any way you want. They take input, process it and return an output.

Here is the PageId we get when we call get_PageId(“Avengers: Endgame”)


Here is the Description we get when we call get_description(pageId)

'2023 superhero film produced by Marvel Studios'

Let’s step through, line by line, the get_PageId function to learn how it gets the ID of the title of the article that we are passing on.

# call the Wikipedia API to get the PageId of the article with the given title. q = {"action": "query", "format": "json", "prop": "info", "titles": title}

q is a Python dictionary. It holds key-value pairs. If you look up the value of “action”, you get “query” and so on. For example, you’d perform such a lookup using q[“action”].

“action” is a Python string. It represents textual information.

“titles”: title maps the “titles” key to the Python variable title that we passed as input to the function. All keys and values are hardcoded and explicit, except for the last one. This is what the dictionary looks like after we execute this function.

q = {"action": "query", "format": "json", "prop": "info", "titles": "Avengers: Endgame"}

In the next line we have.

Here we have a Python module function urllib.parse.urlencode. Module functions are just like Google sheet functions that provide standard functionality.

Before we call module or library functions, we need to import the module that contains them.

This line at the top of the notebook does that.

import urllib.parse

Let’s clarify the call and see the output we get.

urllib.parse.urlencode({"action": "query", "format": "json", "prop": "info", "titles": "Avengers: Endgame"})

You can find detailed documentation on the urlencode module function here. Its job is to convert a dictionary of URL parameters into a query string. A query string is the part of the URL after the question mark.

This is the output we get after we run it.


This is what our URL definition line looks like after we add the result of urlencode.

The + sign here concatenates the strings to form one.

This resulting string is the API request the notebook sends to Wikipedia.

In the next line of code, we open the dynamically generated URL.

response = requests.get(url)

requests.get is a Python third-party module function. You need to install third-party libraries using the Python tool pip.

!pip install --upgrade -q requests

You can run command line script and tools from a notebook by prepending them with !

The code after ! is not Python code. It is Unix shell code. This article provides a comprehensive list of the most common shell commands.

After you install the third-party module, you need to import it like you do with standard libraries.

import requests

Here is what the translated call looks like.

You can open this request in the browser and see the API response from Wikipedia. The function call allows us to do this without manually opening a web browser.

The results from the chúng tôi call gets stored in the Python variable response.

This is what the result looks like.

{“batchcomplete”: “”,

“query”: {“pages”: {“44254295”: {“contentmodel”: “wikitext”,

“lastrevid”: 933501003,

“length”: 177114,

“ns”: 0,

“pageid”: 44254295,

“pagelanguage”: “en”,

“pagelanguagedir”: “ltr”,

“pagelanguagehtmlcode”: “en”,

“title”: “Avengers: Endgame”,

“touched”: “2023-01-03T17:13:02Z”}}}}

You can think of this complex data structure as a dictionary where some values include other dictionaries and so forth.

The next line of code slices and dices this data structure to extract the PageId.

result = list(response.json()["query"]["pages"].keys())[0]

Let’s step through it to see how it gets it.


When we look up the value for the key “query”, we get a smaller dictionary.

{“pages”: {“44254295”: {“contentmodel”: “wikitext”,

“lastrevid”: 933501003,

“length”: 177114,

“ns”: 0,

“pageid”: 44254295,

“pagelanguage”: “en”,

“pagelanguagedir”: “ltr”,

“pagelanguagehtmlcode”: “en”,

“title”: “Avengers: Endgame”,

“touched”: “2023-01-03T17:13:02Z”}}}

Then, we look up the value of “pages” in this smaller dictionary.


We get an even smaller one. We are drilling down on the big response data structure.

{“44254295”: {“contentmodel”: “wikitext”,

“lastrevid”: 933501003,

“length”: 177114,

“ns”: 0,

“pageid”: 44254295,

“pagelanguage”: “en”,

“pagelanguagedir”: “ltr”,

“pagelanguagehtmlcode”: “en”,

“title”: “Avengers: Endgame”,

“touched”: “2023-01-03T17:13:02Z”}}

The PageId is available in two places in this slice of the data structure. As the only key, or as a value in the nested dictionary.

John made the most sensible choice, which is to use the key to avoid further exploration.


The response from this call is a Python dictionary view of the keys. You can learn more about dictionary view in this article.


We have what we are looking for, but not in the right format.

In the next step, we convert the dictionary view into a Python list.


This what the conversion looks like.


Python lists are like rows in a Google sheet. They generally contain multiple values separated by commas, but in this case, there is only one.

Finally, we extract the only element that we care about from the list. The first one.


The first element in Python lists starts at index 0.

Here is the final result.


As this is an identifier, is better to keep as a string, but if we needed a number to perform arithmetic operations, we would do another transformation.


In this case, we get a Python integer.


The main differences between strings and integers are the types of operations that you can perform with them. As you saw before we can use the + operator to concatenate two strings, but if we used the same operator in two numbers, it would add them together.

"44254295" + "3" = "442542953" 44254295 + 3 = 44254298

As a side note, I should mention jq, a cool command line tool that allows you to slice and dice JSON responses directly from curl calls (another awesome command line tool). curl allows you to do the equivalent of what we are doing with the requests module here, but with limitations.

So far we’ve learned how to create functions and data types that allow us to extract data and filter data from third-party sites (Wikipedia in our case).

Let’s call the next function in John’s notebook to learn another important building block: flow control structures.


This is what the API URL looks like. You can try it in the browser.

Here what the response looks like.

{“ns”: 0,

“pageid”: 44254295,

“terms”: {“alias”: [“Avengers Endgame”, “Avengers End Game”, “Avengers 4”],

“description”: [“2023 superhero film produced by Marvel Studios”],

“label”: [“Avengers: Endgame”]},

“title”: “Avengers: Endgame”}

This is the code that will step through to understand control flows in Python.

# some pages don't have descriptions, so we can't blindly grab the value if "terms" in rs and "description" in rs["terms"]: result = rs["terms"]["description"][0] else: result = "" return result

This part checks if the response structure (above) includes a key named “terms”. It uses the Python If … Else control flow operator. Control flow operators are the algorithmic building blocks of programs in most languages, including Python.

if "terms" in rs

If this check is successful, we look up the value of such key with rs[“terms”]

We expect the result to be another dictionary and check it to see if there is a key with the value “description”.

"description" in rs["terms"]

If both checks are successful, then we extract and store the description value.

result = rs["terms"]["description"][0]

We expect the final value to be a Python list, and we only want the first element as we did before.

The and Python logical operator combines both checks into one where both need to be true for it to be true.

If the check is false, the description is an empty string.

result = "" Populating Google Sheets from Python

With a solid understanding of Python basic building blocks, now we can focus on the most exciting part of Mueller’s notebook: automatically populating Google Sheets with the values we are pulling from Wikipedia.

# by ‘functionToCall(parameterName)’. Show a progressbar while doing so. # Only calculate / update rows without values there, unless forceUpdate=True.

Let’s step through some interesting parts of this function.

The functionality to update Google Sheets is covered by a third-party module.

We need to install it and import it before we can use it.

!pip install --upgrade -q gspread import gspread

At the end of every helper function that fills a column, we have a call like the one above.

We are passing the relevant columns and the function that will get the corresponding values.

columnNr = df.columns.get_loc(fieldName) + 1 # column number of output field

The first thing we want to know is which column we need to update. When we run the code above we get 7, which is the column position of the PageId in the sheet (starting with 1).

for index, row in df.iterrows():

In this line of code, we have another control flow operator, the Python For Loops. For loops allow you to iterate over elements that represent collections, for example, lists and dictionaries.

In our case above, we are iterating over a dictionary where the index variable will hold the key, and the row variable will hold the value.

To be more precise, we are iterating over a Python dictionary view, a dictionary view is like a read-only and faster copy of the dictionary, which is perfect for iteration.

When you print iterrows, you don’t actually get the values, but a Python iterator object.

Iterators are functions that access data on demand, require less memory and perform faster than accessing collections manually.

INDEX: 2 ROW: Article César Alonso de las Heras Views 1,944,569 PartMobile 79.06% ViewsMobile 1,537,376 ViewsDesktop 407,193 PageId 18247033 Description WikiInLinks WikiOutLinks ExtOutLinks WikidataId WikidataInstance Name: 2, dtype: object sdsdsds

This is an example iteration of the for loop. I printed the index and row values.

# if we already did it, don't recalculate unless 'forceUpdate' is set. if forceUpdate or not row[fieldName]: result = functionToCall(row[parameterName])

forceUpdate is a Python boolean value which defaults to False. Booleans can only be true or false.

row[“PageId”] is empty initially, so not row[“PageId”] is true and the next line will execute. The or operator allows the next line to execute for subsequent runs only when the flag forceUpdate is true.

result = functionToCall(get_PageId)

This is the code that calls our custom function to get the page ids.

The result value for the example iteration is 39728003

When you review the function carefully, you will notice that we use df which is not defined in the function. The code that does that is at the beginning of the notebook.

# Convert to a DataFrame and render. # (A DataFrame is overkill, but I wanted to play with them more :)) import pandas as pd df = pd.DataFrame.from_records(worksheetRows)

The code uses the third-party module pandas to create a data frame from the Google Sheet rows. I recommend reading this 10 minutes to pandas article to get familiar. It is a very powerful data manipulation library.

Finally, let’s see how to we update the Google Sheet.

row[fieldName] = result # save locally worksheet.update_cell(index+1, columnNr, result) # update sheet too

This code can be translated to.

row["PageId"] = 39728003 # save locally worksheet.update_cell(3+1, 7, 39728003) # update sheet too

# (This is always confusing, but it works) from google.colab import auth auth.authenticate_user() import gspread from oauth2client.client import GoogleCredentials gc = gspread.authorize(GoogleCredentials.get_application_default()) worksheetRows = worksheet.get_all_values()

I left this code for last because it is the last thing that gets executed and it is also more complicated than the previous code. However, it is the first thing you need to execute in the notebook.

First, we import the third-party module gspread, and complete an Oauth authentication in Chrome to get access to Google Sheets.

worksheet =“Wikipedia-Views-2023”).sheet1 worksheetRows = worksheet.get_all_values()

We manipulate the Google sheet with the worksheet variable and we use the worksheetRows variable to create the pandas Dataframe.

Visualizing from Python

Now we get to your homework.

I wrote code to partially reproduce John’s pivot table and plot a simple bar chart.

Your job is to add this code to your copy of the notebook and add print(varible_name) statements to understand what I am doing. This is how I analyzed John’s code.

Here is the code.

#Visualize from Python df.groupby("WikidataInstance").agg({"ViewsMobile": chúng tôi "ViewsDesktop": np.sum}) # the aggregation doesn't work because the numbers include commas # This gives an error ValueError: Unable to parse string "1,038,950,248" at position 0 #pd.to_numeric(df["ViewsMobile"]) # StackOverflow is your friend :) import locale from locale import atoi locale.setlocale(locale.LC_NUMERIC, '') #df[["ViewsMobile", "ViewsDesktop"]].applymap(atoi) df["ViewsMobile"] = df["ViewsMobile"].apply(atoi) df["ViewsDesktop"] = df["ViewsDesktop"].apply(atoi) # We try again and it works totals_df = df.groupby("WikidataInstance").agg({"ViewsMobile": chúng tôi "ViewsDesktop": np.sum}) totals_df #Here we plot totals_df.head(20).plot(kind="bar")

Resources to Learn More

Formula Challenge #2: Matching Terms

This Formula Challenge originally appeared as part of Google Sheets Tip #52, my weekly newsletter, on 27 May 2023.

Sign up here so you don’t miss out on future Formula Challenges:

Find all the Formula Challenges archived here.

Your Challenge

Start with this small data table in your Google Sheet:

Your challenge is to create a single-cell formula that takes a string of search Terms and returns all the Results that have at least one matching term in the Terms column.

For example, this search (in cell E2 say)

Raspberries, Orange, Apple

would return the results (in cell F2 say):


like this (where the yellow is your formula):

Check out the ready-made Formula Challenge template.

The Solution Solution One: Using the FILTER function

or even:

These elegant solutions were also the shortest solutions submitted.

There were a lot of similar entries that had an ArrayFormula function inside the Filter, but this is not required since the Filter function will output an array automatically.

How does this formula work?

Let’s begin in the middle and rebuild the formula in steps:




", "


The SPLIT function outputs the three fruits from cell E2 into separate cells:

Raspberries    Orange    Apple

so the output is now:

Then bring the power of regular expression formulas in Google Sheets to the table, to match the data in column B. The pipe character means “OR” in regular expressions, so this formula will match Raspberries OR Orange OR Apple in column B:

On its own, this formula will return a #VALUE! error message. (Wrap this with the ArrayFormula function if you want to see what the array of TRUE and FALSE values looks like.)

However, when we put this inside of a FILTER function, the correct array value is passed in:

and returns the desired output. Kaboom!

Solution Two: Using the QUERY function




"select A where B contains '"


"' or B contains '"




", "




As with solution one, there is no requirement to use an ArrayFormula anywhere. Impressive!

This formula takes a different approach to solution one and uses the QUERY function to filter the rows of data.

The heart of the formula is similar though, splitting out the input terms into an array, then recombining them to use as filter conditions.


"' or B contains '"




", "


which outputs a clause ready to insert into your query function, viz:

Raspberries' or B contains 'Orange' or B contains 'Apple

The QUERY function uses a pseudo-SQL language to parse your data. It returns rows from column A, whenever column B contains Raspberries OR Orange OR Apple.


I hope you enjoyed this challenge and learnt something from it. I really enjoyed reading all the submissions and definitely learnt some new tricks myself.

SPLIT function caveats

There are two dangers with the Split function which are important to keep in mind when using it (thanks to Christopher D. for pointing these out to me).

Caveat 1

The SPLIT function uses all of the characters you provide in the input.



"First sentence, Second sentence"


", "


will split into FOUR parts, not two, because the comma and the space are used as delimiters. The output will therefore be:

First    sentence    Second    sentence

across four cells.

Caveat 2

Datatypes may change when they are split, viz:


"Lisa, 01"




gives an output of

Lisa    1

where the string has been converted into a number, namely 1.

See the other Formula Challenges here.

What Is Selenium? Introduction To Selenium Automation Testing

What is Selenium?

Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. You can use multiple programming languages like Java, C#, Python, etc to create Selenium Test Scripts. Testing done using the Selenium testing tool is usually referred to as Selenium Testing.

Selenium Tool Suite

Selenium Software is not just a single tool but a suite of software, each piece catering to different Selenium QA testing needs of an organization. Here is the list of tools

Selenium Integrated Development Environment (IDE)

Selenium Remote Control (RC)


Selenium Grid

At the moment, Selenium RC and WebDriver are merged into a single framework to form Selenium 2. Selenium 1, by the way, refers to Selenium RC.

Video Tutorial Selenium

Who developed Selenium?

Since Selenium is a collection of different tools, it also had different developers. Below are the key persons who made notable contributions to the Selenium Project

Primarily, Selenium was created by Jason Huggins in 2004. An engineer at ThoughtWorks, he was working on a web application that required frequent testing. Having realized that their application’s repetitious Manual Testing was becoming increasingly inefficient, he created a JavaScript program that would automatically control the browser’s actions. He named this program the “JavaScriptTestRunner.”

Seeing potential in this idea to help automate other web applications, he made JavaScriptRunner open-source, which was later re-named Selenium Core. For those interested in exploring other options for web application testing, take a look at these Selenium alternatives.

The Same Origin Policy Issue

Same Origin policy prohibits JavaScript code from accessing elements from a domain that is different from where it was launched. Example, the HTML code in chúng tôi uses a JavaScript program “randomScript.js”. The same origin policy will only allow chúng tôi to access pages within chúng tôi such as chúng tôi chúng tôi or chúng tôi However, it cannot access pages from different sites such as chúng tôi or chúng tôi because they belong to different domains.

This is the reason why prior to Selenium RC, testers needed to install local copies of both Selenium Core (a JavaScript program) and the web server containing the web application being tested so they would belong to the same domain

Birth of Selenium Remote Control (Selenium RC)

Unfortunately; testers using Selenium Core had to install the whole application under test and the web server on their own local computers because of the restrictions imposed by the same origin policy. So another ThoughtWork’s engineer, Paul Hammant, decided to create a server that will act as an HTTP proxy to “trick” the browser into believing that Selenium Core and the web application being tested come from the same domain. This system became known as the Selenium Remote Control or Selenium 1.

Birth of Selenium Grid

Birth of Selenium IDE

Shinya Kasatani of Japan created Selenium IDE, a Firefox and Chrome extension that can automate the browser through a record-and-playback feature. He came up with this idea to further increase the speed in creating test cases. He donated Selenium IDE to the Selenium Project in 2006.

Birth of WebDriver

Simon Stewart created WebDriver circa 2006 when browsers and web applications were becoming more powerful and more restrictive with JavaScript programs like Selenium Core. It was the first cross-platform testing framework that could control the browser from the OS level.

Birth of Selenium 2

In 2008, the whole Selenium Team decided to merge WebDriver and Selenium RC to form a more powerful tool called Selenium 2, with WebDriver being the core. Currently, Selenium RC is still being developed but only in maintenance mode. Most of the Selenium Project’s efforts are now focused on Selenium 2.

So, Why the Name Selenium?

The Name Selenium came from a joke that Jason cracked once to his team. During Selenium’s development, another automated testing framework was popular made by the company called Mercury Interactive (yes, the company who originally made QTP before it was acquired by HP). Since Selenium is a well-known antidote for Mercury poisoning, Jason suggested that name and his teammates took it. So that is how we got to call this framework up to the present.

What is Selenium IDE?

What is Selenium Remote Control (Selenium RC)?

Selenium RC was the flagship testing framework of the whole Selenium project for a long time. This is the first automated web testing tool that allows users to use a programming language they prefer. As of version 2.25.0, RC can support the following programming languages:

What is WebDriver?

The WebDriver proves to be better than Selenium IDE and Selenium RC in many aspects. It implements a more modern and stable approach in automating the browser’s actions. WebDriver, unlike Selenium RC, does not rely on JavaScript for Selenium Automation Testing. It controls the browser by directly communicating with it.

The supported languages are the same as those in Selenium RC.







What is Selenium Grid?

Selenium Grid is a tool used together with Selenium RC to run parallel tests across different machines and different browsers all at the same time. Parallel execution means running multiple tests at once.


Enables simultaneous running of tests in multiple browsers and environments.

Saves time enormously.

Utilizes the hub-and-nodes concept. The hub acts as a central source of Selenium commands to each node connected to it.

Selenium Browser and Environment Support

Because of their architectural differences, Selenium IDE, Selenium RC, and WebDriver support different sets of browsers and operating environments.

  Selenium IDE WebDriver

Browser Support Mozilla Firefox and Chrome

Google Chrome 12+


Internet Explorer 7+ and Edge


HtmlUnit and PhantomUnit

Operating System Windows, Mac OS X, Linux All operating systems where the browsers above can run.

Note: Opera Driver no longer works

How to Choose the Right Selenium Tool for Your Need

Tool Why Choose?

Selenium IDE

To learn about concepts on automated testing and Selenium, including:

Locators such as id, name, xpath, css selector, etc.

Executing customized JavaScript code using runScript

Exporting test cases in various formats.

To create tests with little or no prior knowledge in programming.

To create simple test cases and test suites that you can export later to RC or WebDriver.

To test a web application against Firefox and Chrome only.

Selenium RC

To design a test using a more expressive language than Selenese

To run your test against different browsers (except HtmlUnit) on different operating systems.

To deploy your tests across multiple environments using Selenium Grid.

To test your application against a new browser that supports JavaScript.

To test web applications with complex AJAX-based scenarios.


To use a certain programming language in designing your test case.

To test applications that are rich in AJAX-based functionalities.

To execute tests on the HtmlUnit browser.

To create customized test results.

Selenium Grid

To run your Selenium RC scripts in multiple browsers and operating systems simultaneously.

To run a huge test suite, that needs to complete in the soonest time possible.

A Comparison between Selenium and QTP(now UFT)

Advantages and Benefits of Selenium over QTP

Selenium QTP

Open source, free to use, and free of charge. Commercial.

Highly extensible Limited add-ons

Can run tests across different browsers Can only run tests in Firefox, Internet Explorer and Chrome

Supports various operating systems Can only be used in Windows

Supports mobile devices QTP Supports Mobile app test automation (iOS & Android) using HP solution called – HP Mobile Center

Can execute tests while the browser is minimized Needs to have the application under test to be visible on the desktop

Can execute tests in parallel. Can only execute in parallel but using Quality Center which is again a paid product.

Advantages of QTP over Selenium

QTP Selenium

Can test both web and desktop applications Can only test web applications

Comes with a built-in object repository Has no built-in object repository

Automates faster than Selenium because it is a fully featured IDE. Automates at a slower rate because it does not have a native IDE, and only third-party IDE can be used for development.

Data-driven testing is easier to perform because it has built-in global and local data tables. Data-driven testing is more cumbersome since you have to rely on the programming language’s capabilities for setting values for your test data

Can access controls within the browser(such as the Favorites bar, Address bar, Back and Forward buttons, etc.) Cannot access elements outside of the web application under test

Provides professional customer support No official user support is being offered.

Has native capability to export test data into external formats Has no native capability to export runtime data onto external formats

Parameterization Support is built Parameterization can be done via programming but is difficult to implement.

Test Reports are generated automatically No native support to generate test /bug reports.

Cost(because Selenium is completely free)

Flexibility(because of a number of programming languages, browsers, and platforms it can support)

Parallel testing(something that QTP is capable of but only with use of Quality Center)


The entire Selenium Software Testing Suite is comprised of four components:

Selenium IDE, a Firefox and chrome add-on that you can only use in creating relatively simple test cases and test suites.

Selenium Remote Control, also known as Selenium 1, is the first Selenium tool that allowed users to use programming languages in creating complex tests.

WebDriver, is the newer breakthrough that allows your test scripts to communicate directly to the browser, thereby controlling it from the OS level.

Selenium Grid is also a tool that is used with Selenium RC to execute parallel tests across different browsers and operating systems.

Selenium RC and WebDriver was merged to form Selenium 2.

Brain Mri Segmentation With 0.95 Dice Score

Data preprocessing here is the most crucial step as here we do most of our preprocessing and feature engineering stuff. That turns out to be one of the major features of our case study solution.

First, let’s have a look at all MR images present for a single patient “TCGA_CS_4941”. Here red circle shows the area where you can identify a tumor

Now as we can see that there are significant numbers of images with tumors. but as we are not trained radiologists or doctors so it turned out that we need to develop some masked images using already test-given masks.

From the above images, we can observe that not all colours in an image are equally useful. it turns out that whenever there is a tumor it gets highlighted with green color thus we can say that whenever there is a high-intensity green colour there may be a tumour, also images here are not too sharp to get the high intensity of each colour therefore we also need an image sharpener, and we don’t need to perform image augmentation as these are image outputs from a standard medical machine. On basis of this observation, we can create a custom data loader class for image preprocessing and data loading combined that should work on this flow chart

Image preprocessing and Data loading

View the code on Gist.

Here in the class dataset we just need to pass a pandas data frame with an image path and mask path along with the patient name and it will return a tuple that contains image and mask.

This tuple is then passed in Dataloader where based on the batch size provided it is being transformed into a model loadable data set.

View the code on Gist.

Here you can see I manually marked the area with tumor for red color and also you can observe that it’s fairly easy to visualize this area as these are already marked with high-intensity green color.

Metrics and Losses

As these are tasks for image segmentation. therefore their evaluation metrics are non-trivial to solve. In this, we need a pixel-wise comparison between both the actual mask and the predicted mask.

Therefore there are 2 proposed metrics for semantic segmentation tasks

Intersection Over Union(Jaccard Index)

the predicted segmentation and the ground truth. IOU is defined in the range(0-1). Here 0 is defined as no area being overlapped whereas 1 is defined as no noise and the entire defined area being overlapped.

Dice Score(F1 for Semantic segmentation)

Dice score is a useful score that we will use in our case study for evaluation as this metric was first used in paper and till then it is being used to compare your model against others

Dice Coefficient = 2 * the Area of Overlap divided by the total number of pixels in both images.

Losses and metrics can be obtained in Keras using

View the code on Gist.

Model Selection

After going through various models proposed for biomedical image segmentation model proposed we came to the conclusion to use Unet and versions of Unets along with transfer learning. The use of transfer learning will help us reduce training time significantly and obtain better accuracy as using Unet with resnet50 provides an architecture where Resnet 50 acts as a backbone that helps to detect features in images and is pretrained with image net datasets.


This is an Unet architecture with lots of skip connections these skip connections help to obtain particle size features from an image.

This architecture can be implemented in Keras as

This predicts a mask with a Dice score of 0.9 which is a good score and its predicted image can be viewed as

Unet with Resnet as Backbone

This is an architecture with Resnet encoders as the backbone and the weight of these encoders is frozen.

This shows an exceptional Dice score of 0.946


Here we can see the results of each model with DICE and IOU metric and we can also conclude unetxresnet is an architecture that fits our needs.

Feature Calculations

Now let’s calculate some important features that will be helpful for doctors to analyze the condition of a patient.

View the code on Gist.

This function returns area, standard deviation and coordinates

Predict Death

Death01 is a feature present in “lgg-mri-segmentation/kaggle_3m/data.csv” which tells us whether a patient is going to die or not. But there are lots of missing values present in this sheet that needs to be filled up.

To fill all these unknown values we use an imputer. Here we decide to choose KNNImputer from scikitlearn with n-neighbors=4 and then round it off so as to obtain integer values from the float.

Join both these data frames based on patient ID as key and method as inner join to create another

Now these features can be utilized with provided chúng tôi file to predict death01 features as y and it seems that we are able to classify all our points with 100% accuracy


Now the major task is to feed this with any image of *.tif format and it should be capable of creating a mask for that image and generating above discussed features. We will not be taking care of chúng tôi here. Here we will just generate masks and important features.

For this we will be using stream lit and further code can be downloaded from here.


T On comparing this with other solutions available we found that this approach provides us with the best Dice score ever. Below are some of the key takeaways:

How Deep-learning and transfer learning can be utilized to solve the task of Biomedical image segmentation

What should be the best loss function and evaluation metrics for our task

Generating features from images and utilizing them to predict the death of a patient with 100% accuracy

Use Streamlit to deploy our model for simpler use

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Update the detailed information about Introduction To Synthetic Control Using Propensity Score Matching on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!