Trending March 2024 # Developing New Machine Learning Algorithm Using Openai Gym # Suggested April 2024 # Top 4 Popular

You are reading the article Developing New Machine Learning Algorithm Using Openai Gym updated in March 2024 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Developing New Machine Learning Algorithm Using Openai Gym


OpenAI Gym is a toolkit that provides a wide variety of simulated environments (Atari games, board games, 2D and 3D physical simulations, and so on), so you can train agents, compare them, or develop new Machine Learning algorithms (Reinforcement Learning).

OpenAI is an artificial intelligence research company, funded in part by Elon Musk. Its stated goal is to promote and develop friendly AIs that will benefit humanity (rather than exterminate it).

Installing OpenAI Gym

In this article, I will be using the OpenAI gym, a great toolkit for developing and comparing Reinforcement Learning algorithms. It provides many environments for your learning agents to interact with.

Before installing the toolkit, if you created an isolated environment using virtualenv, you first need to activate it:

$ cd $ML_PATH # Your ML working directory (e.g., $HOME/ml) $ source my_env/bin/activate # on Linux or MacOS $ .my_envScriptsactivate # on Windows

Next, install OpenAI Gym (if you are not using a virtual environment, you will need to add the –user option, or have administrator rights):

$ python3 -m pip install -U gym

Depending on your system, you may also need to install the Mesa OpenGL Utility (GLU) library (e.g., on Ubuntu 18.04 you need to run apt install libglu1-mesa). This library will be needed to render the first environment.

Next, open up a Python shell or a Jupyter notebook or Google Colab and I will first import all the necessary libraries and then I will create an environment with make():

# Python ≥3.5 is required import sys import sklearn # %tensorflow_version only exists in Colab. %tensorflow_version 2.x !apt update && apt install -y libpq-dev libsdl2-dev swig xorg-dev xvfb !pip install -q -U tf-agents-nightly pyvirtualdisplay gym[atari] IS_COLAB = True except Exception: IS_COLAB = False

# TensorFlow ≥2.0 is required import tensorflow as tf from tensorflow import keras print("No GPU was detected. CNNs can be very slow without a GPU.") if IS_COLAB: import numpy as np import os

# to make this notebook's output stable across runs np.random.seed(42) tf.random.set_seed(42)

# To plot pretty figures %matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt mpl.rc('axes', labelsize=14) mpl.rc('xtick', labelsize=12) mpl.rc('ytick', labelsize=12)

# To get smooth animations import matplotlib.animation as animation mpl.rc('animation', html='jshtml')

import gym

Let’s list all the available environments:


The Cart-Pole is a very simple environment composed of a cart that can move left or right, and a pole placed vertically on top of it. The agent must move the cart left or right to keep the pole upright.

env = gym.make('CartPole-v1')

Let’s initialize the environment by calling is a reset() method. This returns an observation:

env.seed(42) obs = env.reset()

Observations vary depending on the environment. In this case, it is a 1D NumPy array composed of 4 floats: they represent the cart’s horizontal position, its velocity, the angle of the pole (0 = vertical), and the angular velocity.


array([-0.01258566, -0.00156614, 0.04207708, -0.00180545])

An environment can be visualized by calling its render() method, and you can pick the rendering mode (the rendering options depend on the environment).


In this example, we will set mode=”rgb_array” to get an image of the environment as a NumPy array:

img = env.render(mode="rgb_array") img.shape

(400, 600, 3)

def plot_environment(env, figsize=(5,4)): plt.figure(figsize=figsize) img = env.render(mode="rgb_array") plt.imshow(img) plt.axis("off") return img plot_environment(env)

Let’s see how to interact with the OpenAI Gym environment. Your agent will need to select an action from an “action space” (the set of possible actions). Let’s see what this environment’s action space looks like:



action = 1 # accelerate right obs, reward, done, info = env.step(action) obs

array([-0.01261699, 0.19292789, 0.04204097, -0.28092127])


Looks like it’s doing what we’re telling it to do! The environment also tells the agent how much reward it got during the last step:



When the game is over, the environment returns done=True:



Finally, info is an environment-specific dictionary that can provide some extra information that you may find useful for debugging or for training. For example, in some games, it may indicate how many lives the agent has.



The sequence of steps between the moment the environment is reset until it is done is called an “episode”. At the end of an episode (i.e., when step() returns done=True), you should reset the environment before you continue to use it.

if done: obs = env.reset()

Hardcoding OpenAI Gym using Simple Policy Algorithm

Let’s hardcode a simple policy that accelerates left when the pole is leaning toward the left and accelerates right when the pole is leaning toward the right. We will run this policy to see the average rewards it gets over 500 episodes:


def basic_policy(obs): angle = obs[2] return 0 if angle < 0 else 1

totals = [] for episode in range(500): episode_rewards = 0 obs = env.reset() for step in range(200): action = basic_policy(obs) obs, reward, done, info = env.step(action) episode_rewards += reward if done: break totals.append(episode_rewards)

This code is hopefully self-explanatory. Let’s look at the result:

np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(41.718, 8.858356280936096, 24.0, 68.0)

Well, as expected, this strategy is a bit too basic: the best it did was to keep the poll up for only 68 steps. This environment is considered solved when the agent keeps the poll up for 200 steps.

env.seed(42)frames = []obs = env.reset() for step in range(200): img = env.render(mode="rgb_array") frames.append(img) action = basic_policy(obs) obs, reward, done, info = env.step(action) if done: break

Now show the animation:

def update_scene(num, frames, patch): patch.set_data(frames[num]) return patch,def plot_animation(frames, repeat=False, interval=40): fig = plt.figure() patch = plt.imshow(frames[0]) plt.axis('off') anim = animation.FuncAnimation( fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval) plt.close() return anim plot_animation(frames)

About the Author

Aman Kharwal

I am a programmer from India, and I am here to guide you with Machine Learning for free. I hope you will learn a lot in your journey towards ML and AI with me.

You're reading Developing New Machine Learning Algorithm Using Openai Gym

Clustering Machine Learning Algorithm Using K Means

This article was published as a part of the Data Science Blogathon.


1. Introduction

2. Clustering

3. Types of Clustering

4. K-Means Clustering

5. Finding K value

6. Elbow Method

7. Silhouette Method

8. Implementation

9. Conclusion


In this article, we will learn about K-Means clustering in detail. K-Means is one of the most popular and simplest machine learning algorithms. K-Means is used when we have unlabeled data. So now we will learn everything in this article. Before going into that first we will learn what is clustering.


Clustering: The process of dividing datasets into groups consisting of similar data points is called clustering.

Clustering is an unsupervised learning technique.

Imagine a supermarket where all the items were arranged. All the vegetables were placed in the vegetable section, fruits in the fruits section, and all. So that you can easily shop and get things that you want. They were not mixed up. This is an example of clustering. They were divided into clusters based on their similarity.

Let’s see some examples of clustering.

Whenever you visit Amazon, you will see a recommendation list of some products. This is based on your past purchases. This is done because of clustering.

And even same with Netflix. You will get movie recommendations based on your watch history.

Types of Clustering

1. Exclusive Clustering

2. Overlapping Clustering

3. Hierarchical Clustering

Exclusive Clustering: Exclusive Clustering is the hard clustering in which data point exclusively belongs to one cluster.

Source: Author

Here you can see all similar datapoints are clustered. All the blue-colored data points are clustered into the blue cluster and all the red-colored data points are clustered into the red cluster.

Overlapping Clustering: Overlapping clustering is the soft cluster in which data point belongs to multiple clusters.

For example, C-Means Clustering.

In this, we can see that some of the blue data points and some of the pink data points are overlapped.

Hierarchical Clustering: Hierarchical clustering is grouping similar objects into groups. This forms the set of clusters in which each cluster is distinct from another cluster and the objects within that each cluster is similar to each other.

I know this might be a little confusing. Let’s understand it in detail.


Observe this picture. There are Six different data points namely, A, B, C, D, E, and F.

Coming to case1, A and B are clustered based on some similarities whereas E and D are clustered based on some similarities.

Coming to case2, the combination of A and B is similar to C so the combination of A and B is grouped with C.

Coming to case3, the combination of D and E is similar to F. So the combination of D and E is grouped with F.

Coming to the last case, the combination of A, B, C and combination of D, E, F are quite similar so all these points are grouped into a single cluster.

This is how hierarchical clustering works.

Let’s learn about K-Means Clustering in detail.

K-Means Clustering

K-Means Clustering: The algorithm which groups all the similar data points into a cluster is known as K-Means Clustering. This is an unsupervised machine learning algorithm. This contains no labeled data. K-Means is a centroid-based algorithm in which each group has a centroid.

Here K in K-Means is the number of clusters. In this K-Means algorithm first, we randomly assign some centroids to the dataset. And then clusters are formed by assigning data points to the cluster to which the data point is near to the corresponding cluster. From those clusters, new centroids were formed with the mean data points. This process will continue until the model is optimized which means the final centroids will not change even for the next iteration.

Let us understand it in detail by taking some random data points. For these points, two clusters were formed with random centroids.

This is the first iteration. Centroids were completely random and clusters will look like this.

Source: Author

Now coming to the second iteration, the mean of data points were taken and new centroids were formed. Now the clusters will look like this.

Source: Author

Now coming to the third iteration, again centroids were reassigned based on mean data points. Clusters will be like this.

Source: Author

Here the model is optimized. Centroids were not changing though we iterate again. So final clusters were formed.

Finding K value

In this article, we will see 2 methods to find the K value for K-Means.

1. Elbow Method

2. Silhouette Method

Elbow Method

Elbow Method: This is one of the most popular methods that is used to find K for K-Means.

For this, we have to learn something known as WSS(Within the sum of squares).

WSS: The WSS is defined as the sum of squared distance between each member of the cluster and its centroid.

Source: Author


p(i)=data point

q(i)=closest centroid to the data point

Here in the elbow method, the K value is chosen after the decrease of WSS is almost constant.

In the above picture, you can see the elbow point, which is 3. After that point, WSS is almost constant. So 3 is selected as chúng tôi this way elbow method is used for finding the value of K

Silhouette Method

Silhouette Method: Here in the silhouette method, we will compute the silhouette score for every point.

Silhouette Coefficient for the point= (b-a)/max(a,b)


a=mean intra-cluster distance

b=mean nearest cluster distance

Silhouette coefficient for dataset = Mean Silhouette Coefficient over points.

If we draw a graph for these points, we will get something like this.

So here we can see the highest silhouette coefficient is for K = 3. In this way, the silhouette method is used for finding K.

It is available in scikit learn library.



Create a model by importing KMeans which is an inbuilt model in scikit learn library. So we can directly import it. Using the fit method train the model with data. and finally, use the predict method to predict the outcome of desired data.

And finally, in visualizations, you can see all the data points were clustered into different groups,

 Hope you guys found it useful. Read more articles on machine learning algorithms on AV blog.

Connect with me on Linkedin:

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


Paragraph Segmentation Using Machine Learning


Natural language processing (NLP) relies heavily on paragraph segmentation, which has various practical applications such as text summarization, sentiment analysis, and topic modeling. Text summarizing algorithms, for example, frequently rely on paragraph segmentation to find the most important areas of a document that must be summarized. Similarly, paragraph segmentation may be required for sentiment analysis algorithms in order to grasp the context and tone of each paragraph independently.

Paragraph Segmentation

The technique of splitting a given text into different paragraphs based on structural and linguistic criteria is known as paragraph segmentation. Paragraph segmentation is used to improve the readability and organization of huge documents such as articles, novels, or reports. Readers can traverse the text more simply, get the information they need more quickly, and absorb the content more effectively using paragraph segmentation.

Depending on the individual properties of the text and the purposes of the segmentation, there are numerous ways to divide it into paragraphs.

1. Text indentation

This book discusses the issue of indentation in written text. Indentation refers to the space at the beginning of a line of text that is commonly used to signify the start of a new paragraph in numerous writing styles. Readers can benefit from indentation to visually differentiate where one paragraph finishes and another begins. Text indentation may also be used as a characteristic for automated paragraph segmentation, which is a natural language processing approach for automatically identifying and separating paragraphs in a body of text. The computer may be trained to recognize where paragraphs begin and end by analyzing indentation patterns, which is valuable in a variety of text analysis applications.

2. Punctuation marks

This book discusses the role of punctuation indicators which include periods, question marks, and exclamation points. These symbols are widely used to indicate the conclusion of a paragraph and the start of a new one. They can also be used to signal the conclusion of one paragraph and the start of another. Punctuation marks in written communication should be utilized appropriately since they help to clarify material and make the text easier to read and understand.

3. Text length

A paragraph seems to be a writing style composed of a sequence of connected phrases that address a particular topic or issue. The text’s length can be utilized to split it into paragraphs. A huge block of content, for example, can be divided into smaller paragraphs depending on sentence length. This means that if multiple sentences in a sequence discuss the same topic, they can be concatenated to form a paragraph. Similarly, if the topic or notion changes, a new paragraph may be added to alert the reader. Ultimately, the objective of paragraphs is to arrange and structure written content in an easy-to-read and understandable manner.

4. Text coherence

Paragraphs are an important part of writing since they assist to organize ideas and thoughts in a clear and logical way. A cohesive paragraph has sentences that are all connected to and contribute to a major concept or thinking. The coherence of the text refers to the flow of ideas and the logical links between phrases that allow the reader to discern between the beginning and finish of a paragraph. When reading a text, look for a shift in the topic or the introduction of a new concept to identify the beginning of a new paragraph. Similarly, a concluding phrase or a transition to a new concept might signal the conclusion of a paragraph. Ultimately, text coherence is an important aspect in distinguishing paragraph borders and interpreting the writer’s intended meaning.

Paragraph segmentation using machine learning

Machine learning algorithms have been employed to automate the job of paragraph segmentation in recent years, attaining remarkable levels of accuracy and speed. Machine learning algorithms are trained on a vast corpus of manually annotated text data with paragraph boundaries. This training data is used to understand the patterns and characteristics that differentiate various paragraphs.

Paragraph segmentation may be accomplished using supervised learning methods. Supervised learning algorithms are machine learning algorithms that learn on labeled data, which has already been labeled with correct answers. The labeled data for paragraph segmentation would consist of text that has been split into paragraphs and each paragraph has been labeled with a unique ID.

Two supervised learning approaches for paragraph segmentation are support vector machines (SVMs) and decision trees. These algorithms employ labeled data to learn patterns and rules that may be used to predict the boundaries of paragraphs in new texts. When given new, unlabeled text, the algorithms may utilize their previously acquired patterns and rules to forecast where one paragraph ends and another begins. This method is especially effective for evaluating vast amounts of text where manual paragraph segmentation would be impossible or time-consuming. Overall, supervised learning algorithms provide a reliable and efficient way for automating paragraph segmentation in a wide range of applications.

For paragraph segmentation, unsupervised learning methods can be employed. Unlike supervised learning algorithms, which require labeled training data, unsupervised learning algorithms may separate paragraphs without any prior knowledge of how the text should be split. Unsupervised learning algorithms use statistical analysis and clustering techniques to detect similar patterns in text. Clustering algorithms, for example, can group together phrases with similar qualities, such as lexicon or grammar, and identify them as belonging to the same paragraph. Topic modeling is another unsupervised learning approach that may be used to discover clusters of linked phrases that may constitute a paragraph. These algorithms do not rely on predetermined rules or patterns, but rather on statistical approaches to find significant patterns and groups in text. Unsupervised learning methods are very beneficial for text segmentation when the structure or formatting of the text is uneven or uncertain. Overall, unsupervised learning algorithms provide a versatile and powerful way for automating paragraph segmentation in a range of applications.

The text file below contains the paragraph above that is starting with ‘For paragraph segmentation, ……’

Python Program import nltk from chúng tôi import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import make_pipeline # Load the data with open('/content/data.txt', 'r') as file: data = # Tokenize the data into sentences sentences = nltk.sent_tokenize(data) # Label the sentences as belonging to the same or a new paragraph # Create a feature matrix using TF-IDF vectorization vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(sentences) # Create a support vector machine classifier clf = make_pipeline(SVC(kernel='linear')) # Train the classifier on the labeled data, labels) # Use the classifier to predict the paragraph boundaries in new text new_sentences = nltk.sent_tokenize(new_text) new_X = vectorizer.transform(new_sentences) new_labels = clf.predict(new_X) # Print the predicted paragraph boundaries for i in range(len(new_sentences)): if new_labels[i] == 1: print(new_sentences[i]) Output This is a new paragraph. It is separate from the previous one. This is the second sentence of the second paragraph. Conclusion

Finally, paragraph segmentation is an important job in natural language processing that may enhance the readability and structure of enormous texts greatly. In this domain, machine learning algorithms have made great progress, enabling precise and efficient segmentation based on structural data and statistical analysis. However, further study is needed to enhance these models’ performance on more complicated and diverse texts, as well as to investigate novel ways to paragraph segmentation based on deep learning and other sophisticated techniques.

Mental Health Prediction Using Machine Learning

Text(0.5, 0, ‘Age’)

Inference: The above plot shows the Age column with respect to density. We can see that density is higher from Age 10 to 20 years in our dataset.

j = sns.FacetGrid(train_df, col='treatment', size=5) j =, "Age")

Inference: Treatment 0 means treatment is not necessary 1 means it is. First Barplot shows that from age 0 to 10-year treatment is not necessary and is needed after 15 years.

plt.figure(figsize=(12,8)) labels = labelDict['label_Gender'] j = sns.countplot(x="treatment", data=train_df) j.set_xticklabels(labels) plt.title('Total Distribution by treated or not')

Text(0.5, 1.0, ‘Total Distribution by treated or not’)

Inference: Here we can see that more males are treated as compared to females in the dataset.

o = labelDict['label_age_range'] j = sns.factorplot(x="age_range", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Age') new_labels = labelDict['label_Gender'] for t, l in zip(j._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: This barplot shows the mental health of females, males, and transgender according to different age groups. we can analyze that from the age group of 66 to 100, mental health is very high in females as compared to another gender. And from age 21 to 64, mental health is very high in transgender as compared to males.

o = labelDict['label_family_history'] j = sns.factorplot(x="family_history", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Family History') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) o = labelDict['label_care_options'] j = sns.factorplot(x="care_options", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Care options') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: In the dataset, for those who are having a family history of mental health problems, the Probability of mental health will be high. So here we can see that probability of mental health conditions for transgender is almost 90% as they have a family history of medical health conditions.

Inference: This barplot shows health status with respect to care options. In the dataset, for Those who are not having care options, the Probability of mental health situation will be high. So here we can see that the mental health of transgender is very high who have not care options and low for those who are having care options.

o = labelDict['label_benefits'] j = sns.factorplot(x="care_options", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Benefits') new_labels = labelDict['label_Gender'] for t, l in zip(j._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: This barplot shows the probability of health conditions with respect to Benefits. In the dataset, for those who are not having any benefits, the Probability of mental health conditions will be high. So here we can see that probability of mental health conditions for transgender is very high who have not getting any benefits. and probability is low for those who are having benefits options.

o = labelDict['label_work_interfere'] j = sns.factorplot(x="work_interfere", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Work interfere') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: This barplot shows the probability of health conditions with respect to work interference. For those who are not having any work interference, the Probability of mental health conditions will be very less. and probability is high for those who are having work interference rarely.

Scaling and Fitting # Scaling Age scaler = MinMaxScaler() train_df['Age'] = scaler.fit_transform(train_df[['Age']]) train_df.head() # define X and y feature_cols1 = ['Age', 'Gender', 'family_history', 'benefits', 'care_options', 'anonymity', 'leave', 'work_interfere'] X = train_df[feature_cols1] y = train_df.treatment X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.30, Random_state1=0) # Create dictionaries for final graph # Use: methodDict['Stacking'] = accuracy_score methodDict = {} rmseDict = () forest = ExtraTreesClassifier(n_estimators=250, Random_state1=0), y) importances = forest.feature_importances_ std = np.std([tree1.feature_importances_ for tree in forest.estimators_], axis=0) indices = np.argsort(importances)[::-1] labels = [] for f in Range(x.shape[1]): labels.append(feature_cols1[f]) plt.figure(figsize=(12,8)) plt.title("Feature importances")[1]), importances[indices], color="r", yerr=std[indices],) plt.Xticks(range(X.shape[1]), labels, rotation='vertical') plt.xlim([-1, X.shape[1]]) Tuning def evalClassModel(model, y_test1, y_pred_class, plot=False): #Classification accuracy: percentage of correct predictions # calculate accuracy print('Accuracy:', metrics.accuracy_score(y_test1, y_pred_class)) print('Null accuracy:n', y_test1.value_counts()) # calculate the percentage of ones print('Percentage of ones:', y_test1.mean()) # calculate the percentage of zeros print('Percentage of zeros:',1 - y_test1.mean()) print('True:', y_test1.values[0:25]) print('Pred:', y_pred_class[0:25]) #Confusion matrix confusion = metrics.confusion_matrix(y_test1, y_pred_class) #[row, column] TP = confusion[1, 1] TN = confusion[0, 0] FP = confusion[0, 1] FN = confusion[1, 0] # visualize Confusion Matrix sns.heatmap(confusion,annot=True,fmt="d") plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') accuracy = metrics.accuracy_score(y_test1, y_pred_class) print('Classification Accuracy:', accuracy) print('Classification Error:', 1 - metrics.accuracy_score(y_test1, y_pred_class)) fp_rate = FP / float(TN + FP) print('False Positive Rate:', fp_rate) print('Precision:', metrics.precision_score(y_test1, y_pred_class)) print('AUC Score:', metrics.roc_auc_score(y_test1, y_pred_class)) # calculate cross-validated AUC print('Crossvalidated AUC values:', cross_val_score1(model, X, y, cv=10, scoring='roc_auc').mean()) print('First 10 predicted responses:n', model.predict(X_test1)[0:10]) print('First 10 predicted probabilities of class members:n', model.predict_proba(X_test1)[0:10]) model.predict_proba(X_test1)[0:10, 1] y_pred_prob = model.predict_proba(X_test1)[:, 1] if plot == True: # histogram of predicted probabilities plt.rcParams['font.size'] = 12 plt.hist(y_pred_prob, bins=8) plt.xlim(0,1) plt.title('Histogram of predicted probabilities') plt.xlabel('Predicted probability of treatment') plt.ylabel('Frequency') y_pred_prob = y_pred_prob.reshape(-1,1) y_pred_class = binarize(y_pred_prob, 0.3)[0] print('First 10 predicted probabilities:n', y_pred_prob[0:10]) roc_auc = metrics.roc_auc_score(y_test1, y_pred_prob) fpr, tpr, thresholds = metrics.roc_curve(y_test1, y_pred_prob) if plot == True: plt.figure() plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for treatment classifier') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.legend(loc="lower right") def evaluate_threshold(threshold): confusion = metrics.confusion_matrix(y_test1, predict_mine) print(confusion) return accuracy

Tuning with cross-validation score

def tuningCV(knn): k_Range = list(Range(1, 31)) k_scores = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score1(knn, X, y, cv=10, scoring='accuracy') k_scores.append(scores.mean()) print(k_scores) plt.plot(k_Range, k_scores) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy')

Tuning with GridSearchCV

def tuningGridSerach(knn): k_Range = list(range(1, 31)) print(k_Range) param_grid = dict(n_neighbors=k_range) print(param_grid) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy'), y) grid.grid_scores1_ print(grid.grid_scores_[0].parameters) print(grid.grid_scores_[0].cv_validation_scores) print(grid.grid_scores_[0].mean_validation_score) grid_mean_scores1 = [result.mean_validation_score for result in grid.grid_scores_] print(grid_mean_scores1) # plot the results plt.plot(k_Range, grid_mean_scores1) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy') # examine the best model print('GridSearch best score', grid.best_score_) print('GridSearch best params', grid.best_params_) print('GridSearch best estimator', grid.best_estimator_)

Tuning with RandomizedSearchCV

def tuningRandomizedSearchCV(model, param_dist): rand1 = RandomizedSearchCV(model, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state1=5), y) rand1.cv_results_ print('Rand1. Best Score: ', rand.best_score_) print('Rand1. Best Params: ', rand.best_params_) best_scores = [] for _ in Range(20): rand1 = RandomizedSearchCV(model, param_dist, cv=10, scoring='accuracy', n_iter=10), y) best_scores.append(round(rand.best_score_, 3)) print(best_scores)

Tuning by searching multiple parameters simultaneously

def tuningMultParam(knn): k_Range = list(Range(1, 31)) weight_options = ['uniform', 'distance'] param_grid = dict(N_neighbors=k_range, weights=weight_options) print(param_grid) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy'), y) print(grid.grid_scores_) print('Multiparam. Best Score: ', grid.best_score_) print('Multiparam. Best Params: ', grid.best_params_) Evaluating Models

Logistic Regression

def logisticRegression(): logreg = LogisticRegression(), y_train) y_pred_class = logreg.predict(X_test1) accuracy_score = evalClassModel(logreg, y_test1, y_pred_class, True) #Data for final graph methodDict['Log. Regression'] = accuracy_score * 100 logisticRegression()

True value: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Predicted value: [1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.09193053 0.90806947] [0.95991564 0.04008436] [0.96547467 0.03452533] [0.78757121 0.21242879] [0.38959922 0.61040078] [0.05264207 0.94735793] [0.75035574 0.24964426] [0.19065116 0.80934884] [0.61612081 0.38387919] [0.47699963 0.52300037]] [[0.90806947] [0.04008436] [0.03452533] [0.21242879] [0.61040078] [0.94735793] [0.24964426] [0.80934884] [0.38387919] [0.52300037]]

[[142 49] [ 28 159]]

KNeighbors Classifier

def Knn(): # Calculating the best parameters knn = KNeighborsClassifier(n_neighbors=5) k_Range = list(Range(1, 31)) weight_options = ['uniform', 'distance'] param_dist = dict(N_neighbors=k_range, weights=weight_options) tuningRandomizedSearchCV(knn, param_dist) knn = KNeighborsClassifier(n_neighbors=27, weights='uniform'), y_train1) y_pred_class = knn.predict(X_test1) accuracy_score = evalClassModel(knn, y_test1, y_pred_class, True) #Data for final graph methodDict['K-Neighbors'] = accuracy_score * 100 Knn()

[0.816, 0.812, 0.821, 0.823, 0.823, 0.818, 0.821, 0.821, 0.815, 0.812, 0.819, 0.811, 0.819, 0.818, 0.82, 0.815, 0.803, 0.821, 0.823, 0.815] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.33333333 0.66666667] [1. 0. ] [1. 0. ] [0.66666667 0.33333333] [0.37037037 0.62962963] [0.03703704 0.96296296] [0.59259259 0.40740741] [0.37037037 0.62962963] [0.33333333 0.66666667] [0.33333333 0.66666667]] [[0.66666667] [0. ] [0. ] [0.33333333] [0.62962963] [0.96296296] [0.40740741] [0.62962963] [0.66666667] [0.66666667]]

[[135 56] [ 18 169]]

Decision Tree 

def treeClassifier(): # Calculating the best parameters tree1 = DecisionTreeClassifier() featuresSize = feature_cols1.__len__() param_dist = {"max_depth": [3, None], "max_features": randint(1, featuresSize), "min_samples_split": randint(2, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} tuningRandomizedSearchCV(tree1, param_dist) tree1 = DecisionTreeClassifier(max_depth=3, min_samples_split=8, max_features=6, criterion='entropy', min_samples_leaf=7), y_train1) y_pred_class = tree1.predict(X_test1) accuracy_score = evalClassModel(tree1, y_test1, y_pred_class, True) #Data for final graph methodDict['Decision Tree Classifier'] = accuracy_score * 100 treeClassifier()

[0.83, 0.827, 0.831, 0.829, 0.831, 0.83, 0.783, 0.831, 0.821, 0.831, 0.831, 0.831, 0.8, 0.79, 0.831, 0.831, 0.831, 0.829, 0.831, 0.831] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.18 0.82 ] [0.96534653 0.03465347] [0.96534653 0.03465347] [0.89473684 0.10526316] [0.36097561 0.63902439] [0.18 0.82 ] [0.89473684 0.10526316] [0.11320755 0.88679245] [0.36097561 0.63902439] [0.36097561 0.63902439]] [[0.82 ] [0.03465347] [0.03465347] [0.10526316] [0.63902439] [0.82 ] [0.10526316] [0.88679245] [0.63902439] [0.63902439]]

[[130 61] [ 12 175]]

Random Forests

def randomForest(): # Calculating the best parameters forest1 = RandomForestClassifier(n_estimators = 20) featuresSize = feature_cols1.__len__() param_dist = {"max_depth": [3, None], "max_features": randint(1, featuresSize), "min_samples_split": randint(2, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} tuningRandomizedSearchCV(forest1, param_dist) forest1 = RandomForestClassifier(max_depth = None, min_samples_leaf=8, min_samples_split=2, n_estimators = 20, random_state = 1) my_forest =, y_train1) y_pred_class = my_forest.predict(X_test1) accuracy_score = evalClassModel(my_forest, y_test1, y_pred_class, True) #Data for final graph methodDict['Random Forest'] = accuracy_score * 100 randomForest()

[0.831, 0.831, 0.831, 0.831, 0.831, 0.831, 0.831, 0.832, 0.831, 0.831, 0.831, 0.831, 0.837, 0.834, 0.831, 0.832, 0.831, 0.831, 0.831, 0.831] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.2555794 0.7444206 ] [0.95069083 0.04930917] [0.93851009 0.06148991] [0.87096597 0.12903403] [0.40653554 0.59346446] [0.17282958 0.82717042] [0.89450448 0.10549552] [0.4065912 0.5934088 ] [0.20540631 0.79459369] [0.19337644 0.80662356]] [[0.7444206 ] [0.04930917] [0.06148991] [0.12903403] [0.59346446] [0.82717042] [0.10549552] [0.5934088 ] [0.79459369] [0.80662356]]


def boosting(): # Building and fitting clf = DecisionTreeClassifier(criterion='entropy', max_depth=1) boost = AdaBoostClassifier(base_estimator=clf, n_estimators=500), y_train1) y_pred_class = boost.predict(X_test1) accuracy_score = evalClassModel(boost, y_test1, y_pred_class, True) #Data for final graph methodDict['Boosting'] = accuracy_score * 100 boosting()

True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.49924555 0.50075445] [0.50285507 0.49714493] [0.50291786 0.49708214] [0.50127788 0.49872212] [0.50013552 0.49986448] [0.49796157 0.50203843] [0.50046371 0.49953629] [0.49939483 0.50060517] [0.49921757 0.50078243] [0.49897133 0.50102867]] [[0.50075445] [0.49714493] [0.49708214] [0.49872212] [0.49986448] [0.50203843] [0.49953629] [0.50060517] [0.50078243] [0.50102867]]

Predicting with Neural Network

Create input function

%tensorflow_version 1.x import tensorflow as tf import argparse

TensorFlow 1.x selected.

batch_size = 100 train_steps = 1000 X_train1, X_test1, y_train1, y_test1 = train_test1_split(X, y, test_size=0.30, random_state=0) def train_input_fn(features, labels, batch_size): dataset =, labels)) return dataset.shuffle(1000).repeat().batch(batch_size) def eval_input_fn(features, labels, batch_size): features=dict(features) if labels is None: # No labels, use only features. inputs = features else: inputs = (features, labels) dataset = dataset = dataset.batch(batch_size) # Return the dataset. return dataset

Define the feature columns

# Define Tensorflow feature columns age = tf.feature_column.numeric_column("Age") gender = tf.feature_column.numeric_column("Gender") family_history = tf.feature_column.numeric_column("family_history") benefits = tf.feature_column.numeric_column("benefits") care_options = tf.feature_column.numeric_column("care_options") anonymity = tf.feature_column.numeric_column("anonymity") leave = tf.feature_column.numeric_column("leave") work_interfere = tf.feature_column.numeric_column("work_interfere") feature_column = [age, gender, family_history, benefits, care_options, anonymity, leave, work_interfere]

Instantiate an Estimator

model = tf.estimator.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 10], optimizer=tf.train.ProximalAdagradOptimizer( learning_rate=0.1, l1_regularization_strength=0.001 ))

Train the model

model.train(input_fn=lambda:train_input_fn(X_train1, y_train1, batch_size), steps=train_steps)

Evaluate the model

# Evaluate the model. eval_result = model.evaluate( input_fn=lambda:eval_input_fn(X_test1, y_test1, batch_size)) print('nTest set accuracy: {accuracy:0.2f}n'.format(**eval_result)) #Data for final graph accuracy = eval_result['accuracy'] * 100 methodDict['Neural Network'] = accuracy

The test set accuracy: 0.80

Making predictions (inferring) from the trained model

predictions = list(model.predict(input_fn=lambda:eval_input_fn(X_train1, y_train1, batch_size=batch_size))) # Generate predictions from the model template = ('nIndex: "{}", Prediction is "{}" ({:.1f}%), expected "{}"') # Dictionary for predictions col1 = [] col2 = [] col3 = [] for idx, input, p in zip(X_train1.index, y_train1, predictions): v = p["class_ids"][0] class_id = p['class_ids'][0] probability = p['probabilities'][class_id] # Probability # Adding to dataframe col1.append(idx) # Index col2.append(v) # Prediction col3.append(input) # Expecter #print(template.format(idx, v, 100 * probability, input)) results = pd.DataFrame({'index':col1, 'prediction':col2, 'expected':col3}) results.head() Creating Predictions on the Test Set

# Generate predictions with the best methodology

clf = AdaBoostClassifier(), y) dfTestPredictions = clf.predict(X_test1) # Write predictions to csv file results = pd.DataFrame({'Index': X_test1.index, 'Treatment': dfTestPredictions}) # Save to file results.to_csv('results.csv', index=False) results.head() Submission results = pd.DataFrame({'Index': X_test1.index, 'Treatment': dfTestPredictions}) results

The final prediction consists of 0 and 1. 0 means the person is not needed any mental health treatment and 1 means the person is needed mental health treatment.


After using all these Employee records, we are able to build various machine learning models. From all the models, ADA–Boost achieved 81.75% accuracy with an AUC of 0.8185 along with that we were able to draw some insights from the data via data analysis and visualization.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Cybercrooks Developing Dangerous New File

A team of malware developers is preparing to sell a new ransomware program that encrypts files on infected computers and asks victims for money to recover them, according to a volunteer group of security researchers who tracked the development of the threat on underground forums in recent weeks.

Like CryptoLocker, PowerLocker allegedly uses strong encryption that cannot be cracked to recover the files without paying, but it’s also more sophisticated and potentially more dangerous because its developers reportedly intend to sell it to other cybercriminals.

Also like CryptoLocker, PowerLocker allegedly uses strong encryption that prevents users from recovering files unless they pay or have backups. However, it’s also more sophisticated and potentially more dangerous because its developers reportedly intend to sell it to other cybercriminals.

Malware Must Die (MMD), a group of security researchers dedicated to fighting cybercrime, spotted a post on an underground forum at the end of November in which a malware writer announced a new ransomware project. That project, initially under the name Prison Locker, later became PowerLocker.

MMD researchers tracked the development of the threat and decided to make the information they gathered public on Friday out of concern that, if completed and released, the new ransomware program could cause a lot of damage. The group published a blog post with screen shots of several underground forum messages describing the malware’s alleged features at various stages of completion, as well as its planned price.

Based on a progress report by the malware’s main developer—a user with the online identity “gyx”—PowerLocker consists of a single file that’s dropped in the Windows temporary folder. Once run on a computer for the first time, it begins encrypting all user files stored on local drives and network shares, except for executable and system files.

This is similar to how CryptoLocker’s encryption scheme is implemented, but PowerLocker goes even further. Once the encryption stage is done, it disables the Windows and Escape keys and prevents a number of other useful utilities like chúng tôi chúng tôi chúng tôi chúng tôi and chúng tôi from being used.

”While CryptoLocker was tailor-made for a select group of individuals, the PowerLocker as they call it is a tool that would be available for purchase, thus making any script-kiddie a potential attacker.”

The malware is also capable of detecting whether it’s run in virtual machines, sandboxes or debugging environments, a feature designed to prevent security researchers from analyzing it using their usual tools.

Another important difference between CryptoLocker and PowerLocker is that the new threat is supposed to be sold as a crimepack to other cybercriminals.

”While CryptoLocker was tailor-made for a select group of individuals, the PowerLocker as they call it is a tool that would be available for purchase, thus making any script-kiddie a potential attacker,” he said. “If it is real, we expect it to hit really hard.”

“Ransomware is easy money and that’s what cybercriminals are after.”

According to the underground forum messages shared by MMD, the PowerLocker author has partnered with another developer to create the malware’s command-and-control panel and the graphical user interface and is very close to completing them. The developers plan to sell the malware for $100 in Bitcoins per initial build and $25 per rebuild, which is a very accessible price for cybercriminals.

Botezatu expects other similar malware programs to be developed and used this year.

”Trojans like GPcode have set the standard for commercial ransomware, while the ROI [return on investment] rates of the FBI Trojan and CryptoLocker have probably incentivized other cybercriminal groups into joining the ransomware pack,” he said. “Ransomware is easy money and that’s what cybercriminals are after.”

Most malware today is distributed through exploits for vulnerabilities in popular software programs like Java, Flash Player and others, so it is very important to keep all applications up-to-date to prevent infection with ransomware and other threats.

Embedded Ai And Machine Learning Adding New Advancements In Tech Space

Embedded AI and machine learning will improve devices and make them brilliant

Throughout the most recent years, as sensor and MCU costs dove and shipped volumes have gone through the roof, an ever-increasing number of organizations have attempted to exploit by adding sensor-driven embedded AI to their products. Automotive is driving the trend– the average non-autonomous vehicle presently has 100 sensors, sending information to 30-50 microcontrollers that run about 1m lines of code and create 1TB of data per vehicle every day. Extravagance vehicles may have twice the same number of, and autonomous vehicles increase the sensor check significantly more drastically. Yet, it’s not simply an automotive trend. Industrial equipment is turning out to be progressively “brilliant” as creators of rotating, reciprocating and other types of equipment rush to add usefulness for condition monitoring and predictive support, and a huge number of new consumer products from toothbrushes, to vacuum cleaners, to fitness monitors add instrumentation and “smarts”. An ever-increasing number of smart devices are being introduced each month. We are now at a point where artificial intelligence and machine learning in its exceptionally essential structure has discovered its way into the core of embedded devices. For example, smart home lighting systems that automatically turn on and off depend on whether anybody is available in the room. By all accounts, the system doesn’t look excessively stylish. Yet, when you consider everything, you understand that the system is really settling on choices all alone. In view of the contribution from the sensor, the microcontroller/SOC concludes if to turn on the light or not. To do all of this simultaneously, defeating variety to achieve troublesome detections in real-time, at the edge, inside the vital limitations isn’t at all simple. In any case, with current tools, integrating new options for machine learning for signals (like Reality AI) it is getting simpler. They can regularly achieve detections that escape traditional engineering models. They do this by making significantly more productive and compelling utilization of data to conquer variation. Where traditional engineering approaches will ordinarily be founded on a physical model, utilizing data to appraise parameters, machine learning approaches can adapt autonomously of those models. They figure out how to recognize signatures straightforwardly from the raw information and utilize the mechanics of machine learning (mathematics) to isolate targets from non-targets without depending on physical science. There are a lot of different regions where the convergence of machine learning and embedded systems will prompt great opportunities. Healthcare, for example, is now receiving the rewards of putting resources into AI technology. The Internet of Things or IoT will likewise profit enormously from the introduction of artificial intelligence. We will have smart automation solutions that will prompt energy savings, cost proficiency as well as the end of human blunder. Forecasting is at the center of so many ML/AI conversations as organizations hope to use neural networks and deep learning to conjecture time series data. The worth is the capacity to ingest information and quickly acknowledge insight into how it changes the long-term outlook. Further, a large part of the circumstance relies upon the global supply chain, which makes improvements significantly harder to precisely project.

Update the detailed information about Developing New Machine Learning Algorithm Using Openai Gym on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!