Trending December 2023 # Clustering Machine Learning Algorithm Using K Means # Suggested January 2024 # Top 13 Popular

You are reading the article Clustering Machine Learning Algorithm Using K Means updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Clustering Machine Learning Algorithm Using K Means

This article was published as a part of the Data Science Blogathon.


1. Introduction

2. Clustering

3. Types of Clustering

4. K-Means Clustering

5. Finding K value

6. Elbow Method

7. Silhouette Method

8. Implementation

9. Conclusion


In this article, we will learn about K-Means clustering in detail. K-Means is one of the most popular and simplest machine learning algorithms. K-Means is used when we have unlabeled data. So now we will learn everything in this article. Before going into that first we will learn what is clustering.


Clustering: The process of dividing datasets into groups consisting of similar data points is called clustering.

Clustering is an unsupervised learning technique.

Imagine a supermarket where all the items were arranged. All the vegetables were placed in the vegetable section, fruits in the fruits section, and all. So that you can easily shop and get things that you want. They were not mixed up. This is an example of clustering. They were divided into clusters based on their similarity.

Let’s see some examples of clustering.

Whenever you visit Amazon, you will see a recommendation list of some products. This is based on your past purchases. This is done because of clustering.

And even same with Netflix. You will get movie recommendations based on your watch history.

Types of Clustering

1. Exclusive Clustering

2. Overlapping Clustering

3. Hierarchical Clustering

Exclusive Clustering: Exclusive Clustering is the hard clustering in which data point exclusively belongs to one cluster.

Source: Author

Here you can see all similar datapoints are clustered. All the blue-colored data points are clustered into the blue cluster and all the red-colored data points are clustered into the red cluster.

Overlapping Clustering: Overlapping clustering is the soft cluster in which data point belongs to multiple clusters.

For example, C-Means Clustering.

In this, we can see that some of the blue data points and some of the pink data points are overlapped.

Hierarchical Clustering: Hierarchical clustering is grouping similar objects into groups. This forms the set of clusters in which each cluster is distinct from another cluster and the objects within that each cluster is similar to each other.

I know this might be a little confusing. Let’s understand it in detail.


Observe this picture. There are Six different data points namely, A, B, C, D, E, and F.

Coming to case1, A and B are clustered based on some similarities whereas E and D are clustered based on some similarities.

Coming to case2, the combination of A and B is similar to C so the combination of A and B is grouped with C.

Coming to case3, the combination of D and E is similar to F. So the combination of D and E is grouped with F.

Coming to the last case, the combination of A, B, C and combination of D, E, F are quite similar so all these points are grouped into a single cluster.

This is how hierarchical clustering works.

Let’s learn about K-Means Clustering in detail.

K-Means Clustering

K-Means Clustering: The algorithm which groups all the similar data points into a cluster is known as K-Means Clustering. This is an unsupervised machine learning algorithm. This contains no labeled data. K-Means is a centroid-based algorithm in which each group has a centroid.

Here K in K-Means is the number of clusters. In this K-Means algorithm first, we randomly assign some centroids to the dataset. And then clusters are formed by assigning data points to the cluster to which the data point is near to the corresponding cluster. From those clusters, new centroids were formed with the mean data points. This process will continue until the model is optimized which means the final centroids will not change even for the next iteration.

Let us understand it in detail by taking some random data points. For these points, two clusters were formed with random centroids.

This is the first iteration. Centroids were completely random and clusters will look like this.

Source: Author

Now coming to the second iteration, the mean of data points were taken and new centroids were formed. Now the clusters will look like this.

Source: Author

Now coming to the third iteration, again centroids were reassigned based on mean data points. Clusters will be like this.

Source: Author

Here the model is optimized. Centroids were not changing though we iterate again. So final clusters were formed.

Finding K value

In this article, we will see 2 methods to find the K value for K-Means.

1. Elbow Method

2. Silhouette Method

Elbow Method

Elbow Method: This is one of the most popular methods that is used to find K for K-Means.

For this, we have to learn something known as WSS(Within the sum of squares).

WSS: The WSS is defined as the sum of squared distance between each member of the cluster and its centroid.

Source: Author


p(i)=data point

q(i)=closest centroid to the data point

Here in the elbow method, the K value is chosen after the decrease of WSS is almost constant.

In the above picture, you can see the elbow point, which is 3. After that point, WSS is almost constant. So 3 is selected as chúng tôi this way elbow method is used for finding the value of K

Silhouette Method

Silhouette Method: Here in the silhouette method, we will compute the silhouette score for every point.

Silhouette Coefficient for the point= (b-a)/max(a,b)


a=mean intra-cluster distance

b=mean nearest cluster distance

Silhouette coefficient for dataset = Mean Silhouette Coefficient over points.

If we draw a graph for these points, we will get something like this.

So here we can see the highest silhouette coefficient is for K = 3. In this way, the silhouette method is used for finding K.

It is available in scikit learn library.



Create a model by importing KMeans which is an inbuilt model in scikit learn library. So we can directly import it. Using the fit method train the model with data. and finally, use the predict method to predict the outcome of desired data.

And finally, in visualizations, you can see all the data points were clustered into different groups,

 Hope you guys found it useful. Read more articles on machine learning algorithms on AV blog.

Connect with me on Linkedin:

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


You're reading Clustering Machine Learning Algorithm Using K Means

Developing New Machine Learning Algorithm Using Openai Gym


OpenAI Gym is a toolkit that provides a wide variety of simulated environments (Atari games, board games, 2D and 3D physical simulations, and so on), so you can train agents, compare them, or develop new Machine Learning algorithms (Reinforcement Learning).

OpenAI is an artificial intelligence research company, funded in part by Elon Musk. Its stated goal is to promote and develop friendly AIs that will benefit humanity (rather than exterminate it).

Installing OpenAI Gym

In this article, I will be using the OpenAI gym, a great toolkit for developing and comparing Reinforcement Learning algorithms. It provides many environments for your learning agents to interact with.

Before installing the toolkit, if you created an isolated environment using virtualenv, you first need to activate it:

$ cd $ML_PATH # Your ML working directory (e.g., $HOME/ml) $ source my_env/bin/activate # on Linux or MacOS $ .my_envScriptsactivate # on Windows

Next, install OpenAI Gym (if you are not using a virtual environment, you will need to add the –user option, or have administrator rights):

$ python3 -m pip install -U gym

Depending on your system, you may also need to install the Mesa OpenGL Utility (GLU) library (e.g., on Ubuntu 18.04 you need to run apt install libglu1-mesa). This library will be needed to render the first environment.

Next, open up a Python shell or a Jupyter notebook or Google Colab and I will first import all the necessary libraries and then I will create an environment with make():

# Python ≥3.5 is required import sys import sklearn # %tensorflow_version only exists in Colab. %tensorflow_version 2.x !apt update && apt install -y libpq-dev libsdl2-dev swig xorg-dev xvfb !pip install -q -U tf-agents-nightly pyvirtualdisplay gym[atari] IS_COLAB = True except Exception: IS_COLAB = False

# TensorFlow ≥2.0 is required import tensorflow as tf from tensorflow import keras print("No GPU was detected. CNNs can be very slow without a GPU.") if IS_COLAB: import numpy as np import os

# to make this notebook's output stable across runs np.random.seed(42) tf.random.set_seed(42)

# To plot pretty figures %matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt mpl.rc('axes', labelsize=14) mpl.rc('xtick', labelsize=12) mpl.rc('ytick', labelsize=12)

# To get smooth animations import matplotlib.animation as animation mpl.rc('animation', html='jshtml')

import gym

Let’s list all the available environments:


The Cart-Pole is a very simple environment composed of a cart that can move left or right, and a pole placed vertically on top of it. The agent must move the cart left or right to keep the pole upright.

env = gym.make('CartPole-v1')

Let’s initialize the environment by calling is a reset() method. This returns an observation:

env.seed(42) obs = env.reset()

Observations vary depending on the environment. In this case, it is a 1D NumPy array composed of 4 floats: they represent the cart’s horizontal position, its velocity, the angle of the pole (0 = vertical), and the angular velocity.


array([-0.01258566, -0.00156614, 0.04207708, -0.00180545])

An environment can be visualized by calling its render() method, and you can pick the rendering mode (the rendering options depend on the environment).


In this example, we will set mode=”rgb_array” to get an image of the environment as a NumPy array:

img = env.render(mode="rgb_array") img.shape

(400, 600, 3)

def plot_environment(env, figsize=(5,4)): plt.figure(figsize=figsize) img = env.render(mode="rgb_array") plt.imshow(img) plt.axis("off") return img plot_environment(env)

Let’s see how to interact with the OpenAI Gym environment. Your agent will need to select an action from an “action space” (the set of possible actions). Let’s see what this environment’s action space looks like:



action = 1 # accelerate right obs, reward, done, info = env.step(action) obs

array([-0.01261699, 0.19292789, 0.04204097, -0.28092127])


Looks like it’s doing what we’re telling it to do! The environment also tells the agent how much reward it got during the last step:



When the game is over, the environment returns done=True:



Finally, info is an environment-specific dictionary that can provide some extra information that you may find useful for debugging or for training. For example, in some games, it may indicate how many lives the agent has.



The sequence of steps between the moment the environment is reset until it is done is called an “episode”. At the end of an episode (i.e., when step() returns done=True), you should reset the environment before you continue to use it.

if done: obs = env.reset()

Hardcoding OpenAI Gym using Simple Policy Algorithm

Let’s hardcode a simple policy that accelerates left when the pole is leaning toward the left and accelerates right when the pole is leaning toward the right. We will run this policy to see the average rewards it gets over 500 episodes:


def basic_policy(obs): angle = obs[2] return 0 if angle < 0 else 1

totals = [] for episode in range(500): episode_rewards = 0 obs = env.reset() for step in range(200): action = basic_policy(obs) obs, reward, done, info = env.step(action) episode_rewards += reward if done: break totals.append(episode_rewards)

This code is hopefully self-explanatory. Let’s look at the result:

np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(41.718, 8.858356280936096, 24.0, 68.0)

Well, as expected, this strategy is a bit too basic: the best it did was to keep the poll up for only 68 steps. This environment is considered solved when the agent keeps the poll up for 200 steps.

env.seed(42)frames = []obs = env.reset() for step in range(200): img = env.render(mode="rgb_array") frames.append(img) action = basic_policy(obs) obs, reward, done, info = env.step(action) if done: break

Now show the animation:

def update_scene(num, frames, patch): patch.set_data(frames[num]) return patch,def plot_animation(frames, repeat=False, interval=40): fig = plt.figure() patch = plt.imshow(frames[0]) plt.axis('off') anim = animation.FuncAnimation( fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval) plt.close() return anim plot_animation(frames)

About the Author

Aman Kharwal

I am a programmer from India, and I am here to guide you with Machine Learning for free. I hope you will learn a lot in your journey towards ML and AI with me.

Paragraph Segmentation Using Machine Learning


Natural language processing (NLP) relies heavily on paragraph segmentation, which has various practical applications such as text summarization, sentiment analysis, and topic modeling. Text summarizing algorithms, for example, frequently rely on paragraph segmentation to find the most important areas of a document that must be summarized. Similarly, paragraph segmentation may be required for sentiment analysis algorithms in order to grasp the context and tone of each paragraph independently.

Paragraph Segmentation

The technique of splitting a given text into different paragraphs based on structural and linguistic criteria is known as paragraph segmentation. Paragraph segmentation is used to improve the readability and organization of huge documents such as articles, novels, or reports. Readers can traverse the text more simply, get the information they need more quickly, and absorb the content more effectively using paragraph segmentation.

Depending on the individual properties of the text and the purposes of the segmentation, there are numerous ways to divide it into paragraphs.

1. Text indentation

This book discusses the issue of indentation in written text. Indentation refers to the space at the beginning of a line of text that is commonly used to signify the start of a new paragraph in numerous writing styles. Readers can benefit from indentation to visually differentiate where one paragraph finishes and another begins. Text indentation may also be used as a characteristic for automated paragraph segmentation, which is a natural language processing approach for automatically identifying and separating paragraphs in a body of text. The computer may be trained to recognize where paragraphs begin and end by analyzing indentation patterns, which is valuable in a variety of text analysis applications.

2. Punctuation marks

This book discusses the role of punctuation indicators which include periods, question marks, and exclamation points. These symbols are widely used to indicate the conclusion of a paragraph and the start of a new one. They can also be used to signal the conclusion of one paragraph and the start of another. Punctuation marks in written communication should be utilized appropriately since they help to clarify material and make the text easier to read and understand.

3. Text length

A paragraph seems to be a writing style composed of a sequence of connected phrases that address a particular topic or issue. The text’s length can be utilized to split it into paragraphs. A huge block of content, for example, can be divided into smaller paragraphs depending on sentence length. This means that if multiple sentences in a sequence discuss the same topic, they can be concatenated to form a paragraph. Similarly, if the topic or notion changes, a new paragraph may be added to alert the reader. Ultimately, the objective of paragraphs is to arrange and structure written content in an easy-to-read and understandable manner.

4. Text coherence

Paragraphs are an important part of writing since they assist to organize ideas and thoughts in a clear and logical way. A cohesive paragraph has sentences that are all connected to and contribute to a major concept or thinking. The coherence of the text refers to the flow of ideas and the logical links between phrases that allow the reader to discern between the beginning and finish of a paragraph. When reading a text, look for a shift in the topic or the introduction of a new concept to identify the beginning of a new paragraph. Similarly, a concluding phrase or a transition to a new concept might signal the conclusion of a paragraph. Ultimately, text coherence is an important aspect in distinguishing paragraph borders and interpreting the writer’s intended meaning.

Paragraph segmentation using machine learning

Machine learning algorithms have been employed to automate the job of paragraph segmentation in recent years, attaining remarkable levels of accuracy and speed. Machine learning algorithms are trained on a vast corpus of manually annotated text data with paragraph boundaries. This training data is used to understand the patterns and characteristics that differentiate various paragraphs.

Paragraph segmentation may be accomplished using supervised learning methods. Supervised learning algorithms are machine learning algorithms that learn on labeled data, which has already been labeled with correct answers. The labeled data for paragraph segmentation would consist of text that has been split into paragraphs and each paragraph has been labeled with a unique ID.

Two supervised learning approaches for paragraph segmentation are support vector machines (SVMs) and decision trees. These algorithms employ labeled data to learn patterns and rules that may be used to predict the boundaries of paragraphs in new texts. When given new, unlabeled text, the algorithms may utilize their previously acquired patterns and rules to forecast where one paragraph ends and another begins. This method is especially effective for evaluating vast amounts of text where manual paragraph segmentation would be impossible or time-consuming. Overall, supervised learning algorithms provide a reliable and efficient way for automating paragraph segmentation in a wide range of applications.

For paragraph segmentation, unsupervised learning methods can be employed. Unlike supervised learning algorithms, which require labeled training data, unsupervised learning algorithms may separate paragraphs without any prior knowledge of how the text should be split. Unsupervised learning algorithms use statistical analysis and clustering techniques to detect similar patterns in text. Clustering algorithms, for example, can group together phrases with similar qualities, such as lexicon or grammar, and identify them as belonging to the same paragraph. Topic modeling is another unsupervised learning approach that may be used to discover clusters of linked phrases that may constitute a paragraph. These algorithms do not rely on predetermined rules or patterns, but rather on statistical approaches to find significant patterns and groups in text. Unsupervised learning methods are very beneficial for text segmentation when the structure or formatting of the text is uneven or uncertain. Overall, unsupervised learning algorithms provide a versatile and powerful way for automating paragraph segmentation in a range of applications.

The text file below contains the paragraph above that is starting with ‘For paragraph segmentation, ……’

Python Program import nltk from chúng tôi import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import make_pipeline # Load the data with open('/content/data.txt', 'r') as file: data = # Tokenize the data into sentences sentences = nltk.sent_tokenize(data) # Label the sentences as belonging to the same or a new paragraph # Create a feature matrix using TF-IDF vectorization vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(sentences) # Create a support vector machine classifier clf = make_pipeline(SVC(kernel='linear')) # Train the classifier on the labeled data, labels) # Use the classifier to predict the paragraph boundaries in new text new_sentences = nltk.sent_tokenize(new_text) new_X = vectorizer.transform(new_sentences) new_labels = clf.predict(new_X) # Print the predicted paragraph boundaries for i in range(len(new_sentences)): if new_labels[i] == 1: print(new_sentences[i]) Output This is a new paragraph. It is separate from the previous one. This is the second sentence of the second paragraph. Conclusion

Finally, paragraph segmentation is an important job in natural language processing that may enhance the readability and structure of enormous texts greatly. In this domain, machine learning algorithms have made great progress, enabling precise and efficient segmentation based on structural data and statistical analysis. However, further study is needed to enhance these models’ performance on more complicated and diverse texts, as well as to investigate novel ways to paragraph segmentation based on deep learning and other sophisticated techniques.

Mental Health Prediction Using Machine Learning

Text(0.5, 0, ‘Age’)

Inference: The above plot shows the Age column with respect to density. We can see that density is higher from Age 10 to 20 years in our dataset.

j = sns.FacetGrid(train_df, col='treatment', size=5) j =, "Age")

Inference: Treatment 0 means treatment is not necessary 1 means it is. First Barplot shows that from age 0 to 10-year treatment is not necessary and is needed after 15 years.

plt.figure(figsize=(12,8)) labels = labelDict['label_Gender'] j = sns.countplot(x="treatment", data=train_df) j.set_xticklabels(labels) plt.title('Total Distribution by treated or not')

Text(0.5, 1.0, ‘Total Distribution by treated or not’)

Inference: Here we can see that more males are treated as compared to females in the dataset.

o = labelDict['label_age_range'] j = sns.factorplot(x="age_range", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Age') new_labels = labelDict['label_Gender'] for t, l in zip(j._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: This barplot shows the mental health of females, males, and transgender according to different age groups. we can analyze that from the age group of 66 to 100, mental health is very high in females as compared to another gender. And from age 21 to 64, mental health is very high in transgender as compared to males.

o = labelDict['label_family_history'] j = sns.factorplot(x="family_history", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Family History') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) o = labelDict['label_care_options'] j = sns.factorplot(x="care_options", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Care options') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: In the dataset, for those who are having a family history of mental health problems, the Probability of mental health will be high. So here we can see that probability of mental health conditions for transgender is almost 90% as they have a family history of medical health conditions.

Inference: This barplot shows health status with respect to care options. In the dataset, for Those who are not having care options, the Probability of mental health situation will be high. So here we can see that the mental health of transgender is very high who have not care options and low for those who are having care options.

o = labelDict['label_benefits'] j = sns.factorplot(x="care_options", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Benefits') new_labels = labelDict['label_Gender'] for t, l in zip(j._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: This barplot shows the probability of health conditions with respect to Benefits. In the dataset, for those who are not having any benefits, the Probability of mental health conditions will be high. So here we can see that probability of mental health conditions for transgender is very high who have not getting any benefits. and probability is low for those who are having benefits options.

o = labelDict['label_work_interfere'] j = sns.factorplot(x="work_interfere", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Work interfere') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8)

Inference: This barplot shows the probability of health conditions with respect to work interference. For those who are not having any work interference, the Probability of mental health conditions will be very less. and probability is high for those who are having work interference rarely.

Scaling and Fitting # Scaling Age scaler = MinMaxScaler() train_df['Age'] = scaler.fit_transform(train_df[['Age']]) train_df.head() # define X and y feature_cols1 = ['Age', 'Gender', 'family_history', 'benefits', 'care_options', 'anonymity', 'leave', 'work_interfere'] X = train_df[feature_cols1] y = train_df.treatment X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.30, Random_state1=0) # Create dictionaries for final graph # Use: methodDict['Stacking'] = accuracy_score methodDict = {} rmseDict = () forest = ExtraTreesClassifier(n_estimators=250, Random_state1=0), y) importances = forest.feature_importances_ std = np.std([tree1.feature_importances_ for tree in forest.estimators_], axis=0) indices = np.argsort(importances)[::-1] labels = [] for f in Range(x.shape[1]): labels.append(feature_cols1[f]) plt.figure(figsize=(12,8)) plt.title("Feature importances")[1]), importances[indices], color="r", yerr=std[indices],) plt.Xticks(range(X.shape[1]), labels, rotation='vertical') plt.xlim([-1, X.shape[1]]) Tuning def evalClassModel(model, y_test1, y_pred_class, plot=False): #Classification accuracy: percentage of correct predictions # calculate accuracy print('Accuracy:', metrics.accuracy_score(y_test1, y_pred_class)) print('Null accuracy:n', y_test1.value_counts()) # calculate the percentage of ones print('Percentage of ones:', y_test1.mean()) # calculate the percentage of zeros print('Percentage of zeros:',1 - y_test1.mean()) print('True:', y_test1.values[0:25]) print('Pred:', y_pred_class[0:25]) #Confusion matrix confusion = metrics.confusion_matrix(y_test1, y_pred_class) #[row, column] TP = confusion[1, 1] TN = confusion[0, 0] FP = confusion[0, 1] FN = confusion[1, 0] # visualize Confusion Matrix sns.heatmap(confusion,annot=True,fmt="d") plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') accuracy = metrics.accuracy_score(y_test1, y_pred_class) print('Classification Accuracy:', accuracy) print('Classification Error:', 1 - metrics.accuracy_score(y_test1, y_pred_class)) fp_rate = FP / float(TN + FP) print('False Positive Rate:', fp_rate) print('Precision:', metrics.precision_score(y_test1, y_pred_class)) print('AUC Score:', metrics.roc_auc_score(y_test1, y_pred_class)) # calculate cross-validated AUC print('Crossvalidated AUC values:', cross_val_score1(model, X, y, cv=10, scoring='roc_auc').mean()) print('First 10 predicted responses:n', model.predict(X_test1)[0:10]) print('First 10 predicted probabilities of class members:n', model.predict_proba(X_test1)[0:10]) model.predict_proba(X_test1)[0:10, 1] y_pred_prob = model.predict_proba(X_test1)[:, 1] if plot == True: # histogram of predicted probabilities plt.rcParams['font.size'] = 12 plt.hist(y_pred_prob, bins=8) plt.xlim(0,1) plt.title('Histogram of predicted probabilities') plt.xlabel('Predicted probability of treatment') plt.ylabel('Frequency') y_pred_prob = y_pred_prob.reshape(-1,1) y_pred_class = binarize(y_pred_prob, 0.3)[0] print('First 10 predicted probabilities:n', y_pred_prob[0:10]) roc_auc = metrics.roc_auc_score(y_test1, y_pred_prob) fpr, tpr, thresholds = metrics.roc_curve(y_test1, y_pred_prob) if plot == True: plt.figure() plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for treatment classifier') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.legend(loc="lower right") def evaluate_threshold(threshold): confusion = metrics.confusion_matrix(y_test1, predict_mine) print(confusion) return accuracy

Tuning with cross-validation score

def tuningCV(knn): k_Range = list(Range(1, 31)) k_scores = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score1(knn, X, y, cv=10, scoring='accuracy') k_scores.append(scores.mean()) print(k_scores) plt.plot(k_Range, k_scores) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy')

Tuning with GridSearchCV

def tuningGridSerach(knn): k_Range = list(range(1, 31)) print(k_Range) param_grid = dict(n_neighbors=k_range) print(param_grid) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy'), y) grid.grid_scores1_ print(grid.grid_scores_[0].parameters) print(grid.grid_scores_[0].cv_validation_scores) print(grid.grid_scores_[0].mean_validation_score) grid_mean_scores1 = [result.mean_validation_score for result in grid.grid_scores_] print(grid_mean_scores1) # plot the results plt.plot(k_Range, grid_mean_scores1) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy') # examine the best model print('GridSearch best score', grid.best_score_) print('GridSearch best params', grid.best_params_) print('GridSearch best estimator', grid.best_estimator_)

Tuning with RandomizedSearchCV

def tuningRandomizedSearchCV(model, param_dist): rand1 = RandomizedSearchCV(model, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state1=5), y) rand1.cv_results_ print('Rand1. Best Score: ', rand.best_score_) print('Rand1. Best Params: ', rand.best_params_) best_scores = [] for _ in Range(20): rand1 = RandomizedSearchCV(model, param_dist, cv=10, scoring='accuracy', n_iter=10), y) best_scores.append(round(rand.best_score_, 3)) print(best_scores)

Tuning by searching multiple parameters simultaneously

def tuningMultParam(knn): k_Range = list(Range(1, 31)) weight_options = ['uniform', 'distance'] param_grid = dict(N_neighbors=k_range, weights=weight_options) print(param_grid) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy'), y) print(grid.grid_scores_) print('Multiparam. Best Score: ', grid.best_score_) print('Multiparam. Best Params: ', grid.best_params_) Evaluating Models

Logistic Regression

def logisticRegression(): logreg = LogisticRegression(), y_train) y_pred_class = logreg.predict(X_test1) accuracy_score = evalClassModel(logreg, y_test1, y_pred_class, True) #Data for final graph methodDict['Log. Regression'] = accuracy_score * 100 logisticRegression()

True value: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Predicted value: [1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.09193053 0.90806947] [0.95991564 0.04008436] [0.96547467 0.03452533] [0.78757121 0.21242879] [0.38959922 0.61040078] [0.05264207 0.94735793] [0.75035574 0.24964426] [0.19065116 0.80934884] [0.61612081 0.38387919] [0.47699963 0.52300037]] [[0.90806947] [0.04008436] [0.03452533] [0.21242879] [0.61040078] [0.94735793] [0.24964426] [0.80934884] [0.38387919] [0.52300037]]

[[142 49] [ 28 159]]

KNeighbors Classifier

def Knn(): # Calculating the best parameters knn = KNeighborsClassifier(n_neighbors=5) k_Range = list(Range(1, 31)) weight_options = ['uniform', 'distance'] param_dist = dict(N_neighbors=k_range, weights=weight_options) tuningRandomizedSearchCV(knn, param_dist) knn = KNeighborsClassifier(n_neighbors=27, weights='uniform'), y_train1) y_pred_class = knn.predict(X_test1) accuracy_score = evalClassModel(knn, y_test1, y_pred_class, True) #Data for final graph methodDict['K-Neighbors'] = accuracy_score * 100 Knn()

[0.816, 0.812, 0.821, 0.823, 0.823, 0.818, 0.821, 0.821, 0.815, 0.812, 0.819, 0.811, 0.819, 0.818, 0.82, 0.815, 0.803, 0.821, 0.823, 0.815] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.33333333 0.66666667] [1. 0. ] [1. 0. ] [0.66666667 0.33333333] [0.37037037 0.62962963] [0.03703704 0.96296296] [0.59259259 0.40740741] [0.37037037 0.62962963] [0.33333333 0.66666667] [0.33333333 0.66666667]] [[0.66666667] [0. ] [0. ] [0.33333333] [0.62962963] [0.96296296] [0.40740741] [0.62962963] [0.66666667] [0.66666667]]

[[135 56] [ 18 169]]

Decision Tree 

def treeClassifier(): # Calculating the best parameters tree1 = DecisionTreeClassifier() featuresSize = feature_cols1.__len__() param_dist = {"max_depth": [3, None], "max_features": randint(1, featuresSize), "min_samples_split": randint(2, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} tuningRandomizedSearchCV(tree1, param_dist) tree1 = DecisionTreeClassifier(max_depth=3, min_samples_split=8, max_features=6, criterion='entropy', min_samples_leaf=7), y_train1) y_pred_class = tree1.predict(X_test1) accuracy_score = evalClassModel(tree1, y_test1, y_pred_class, True) #Data for final graph methodDict['Decision Tree Classifier'] = accuracy_score * 100 treeClassifier()

[0.83, 0.827, 0.831, 0.829, 0.831, 0.83, 0.783, 0.831, 0.821, 0.831, 0.831, 0.831, 0.8, 0.79, 0.831, 0.831, 0.831, 0.829, 0.831, 0.831] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.18 0.82 ] [0.96534653 0.03465347] [0.96534653 0.03465347] [0.89473684 0.10526316] [0.36097561 0.63902439] [0.18 0.82 ] [0.89473684 0.10526316] [0.11320755 0.88679245] [0.36097561 0.63902439] [0.36097561 0.63902439]] [[0.82 ] [0.03465347] [0.03465347] [0.10526316] [0.63902439] [0.82 ] [0.10526316] [0.88679245] [0.63902439] [0.63902439]]

[[130 61] [ 12 175]]

Random Forests

def randomForest(): # Calculating the best parameters forest1 = RandomForestClassifier(n_estimators = 20) featuresSize = feature_cols1.__len__() param_dist = {"max_depth": [3, None], "max_features": randint(1, featuresSize), "min_samples_split": randint(2, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} tuningRandomizedSearchCV(forest1, param_dist) forest1 = RandomForestClassifier(max_depth = None, min_samples_leaf=8, min_samples_split=2, n_estimators = 20, random_state = 1) my_forest =, y_train1) y_pred_class = my_forest.predict(X_test1) accuracy_score = evalClassModel(my_forest, y_test1, y_pred_class, True) #Data for final graph methodDict['Random Forest'] = accuracy_score * 100 randomForest()

[0.831, 0.831, 0.831, 0.831, 0.831, 0.831, 0.831, 0.832, 0.831, 0.831, 0.831, 0.831, 0.837, 0.834, 0.831, 0.832, 0.831, 0.831, 0.831, 0.831] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.2555794 0.7444206 ] [0.95069083 0.04930917] [0.93851009 0.06148991] [0.87096597 0.12903403] [0.40653554 0.59346446] [0.17282958 0.82717042] [0.89450448 0.10549552] [0.4065912 0.5934088 ] [0.20540631 0.79459369] [0.19337644 0.80662356]] [[0.7444206 ] [0.04930917] [0.06148991] [0.12903403] [0.59346446] [0.82717042] [0.10549552] [0.5934088 ] [0.79459369] [0.80662356]]


def boosting(): # Building and fitting clf = DecisionTreeClassifier(criterion='entropy', max_depth=1) boost = AdaBoostClassifier(base_estimator=clf, n_estimators=500), y_train1) y_pred_class = boost.predict(X_test1) accuracy_score = evalClassModel(boost, y_test1, y_pred_class, True) #Data for final graph methodDict['Boosting'] = accuracy_score * 100 boosting()

True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]

[[0.49924555 0.50075445] [0.50285507 0.49714493] [0.50291786 0.49708214] [0.50127788 0.49872212] [0.50013552 0.49986448] [0.49796157 0.50203843] [0.50046371 0.49953629] [0.49939483 0.50060517] [0.49921757 0.50078243] [0.49897133 0.50102867]] [[0.50075445] [0.49714493] [0.49708214] [0.49872212] [0.49986448] [0.50203843] [0.49953629] [0.50060517] [0.50078243] [0.50102867]]

Predicting with Neural Network

Create input function

%tensorflow_version 1.x import tensorflow as tf import argparse

TensorFlow 1.x selected.

batch_size = 100 train_steps = 1000 X_train1, X_test1, y_train1, y_test1 = train_test1_split(X, y, test_size=0.30, random_state=0) def train_input_fn(features, labels, batch_size): dataset =, labels)) return dataset.shuffle(1000).repeat().batch(batch_size) def eval_input_fn(features, labels, batch_size): features=dict(features) if labels is None: # No labels, use only features. inputs = features else: inputs = (features, labels) dataset = dataset = dataset.batch(batch_size) # Return the dataset. return dataset

Define the feature columns

# Define Tensorflow feature columns age = tf.feature_column.numeric_column("Age") gender = tf.feature_column.numeric_column("Gender") family_history = tf.feature_column.numeric_column("family_history") benefits = tf.feature_column.numeric_column("benefits") care_options = tf.feature_column.numeric_column("care_options") anonymity = tf.feature_column.numeric_column("anonymity") leave = tf.feature_column.numeric_column("leave") work_interfere = tf.feature_column.numeric_column("work_interfere") feature_column = [age, gender, family_history, benefits, care_options, anonymity, leave, work_interfere]

Instantiate an Estimator

model = tf.estimator.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 10], optimizer=tf.train.ProximalAdagradOptimizer( learning_rate=0.1, l1_regularization_strength=0.001 ))

Train the model

model.train(input_fn=lambda:train_input_fn(X_train1, y_train1, batch_size), steps=train_steps)

Evaluate the model

# Evaluate the model. eval_result = model.evaluate( input_fn=lambda:eval_input_fn(X_test1, y_test1, batch_size)) print('nTest set accuracy: {accuracy:0.2f}n'.format(**eval_result)) #Data for final graph accuracy = eval_result['accuracy'] * 100 methodDict['Neural Network'] = accuracy

The test set accuracy: 0.80

Making predictions (inferring) from the trained model

predictions = list(model.predict(input_fn=lambda:eval_input_fn(X_train1, y_train1, batch_size=batch_size))) # Generate predictions from the model template = ('nIndex: "{}", Prediction is "{}" ({:.1f}%), expected "{}"') # Dictionary for predictions col1 = [] col2 = [] col3 = [] for idx, input, p in zip(X_train1.index, y_train1, predictions): v = p["class_ids"][0] class_id = p['class_ids'][0] probability = p['probabilities'][class_id] # Probability # Adding to dataframe col1.append(idx) # Index col2.append(v) # Prediction col3.append(input) # Expecter #print(template.format(idx, v, 100 * probability, input)) results = pd.DataFrame({'index':col1, 'prediction':col2, 'expected':col3}) results.head() Creating Predictions on the Test Set

# Generate predictions with the best methodology

clf = AdaBoostClassifier(), y) dfTestPredictions = clf.predict(X_test1) # Write predictions to csv file results = pd.DataFrame({'Index': X_test1.index, 'Treatment': dfTestPredictions}) # Save to file results.to_csv('results.csv', index=False) results.head() Submission results = pd.DataFrame({'Index': X_test1.index, 'Treatment': dfTestPredictions}) results

The final prediction consists of 0 and 1. 0 means the person is not needed any mental health treatment and 1 means the person is needed mental health treatment.


After using all these Employee records, we are able to build various machine learning models. From all the models, ADA–Boost achieved 81.75% accuracy with an AUC of 0.8185 along with that we were able to draw some insights from the data via data analysis and visualization.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

5 Best Desktops For Machine Learning & Deep Learning

5 best desktops for machine learning & deep learning




Machine learning and deep learning require powerful computers. Amazon has a fair amount of choices for people and budgets of all kinds.

This article will help you decide what to buy by showing you a list of different computers with their pros and cons.

Looking for more computer-related articles? Check out our detailed Computers Hub for more buying guides and information.

Visit our thorough Buying Guides Section if you want more help regarding your choice of gadgets and devices.



To fix Windows PC system issues, you will need a dedicated tool

Fortect is a tool that does not simply cleans up your PC, but has a repository with several millions of Windows System files stored in their initial version. When your PC encounters a problem, Fortect will fix it for you, by replacing bad files with fresh versions. To fix your current PC issue, here are the steps you need to take:

Download Fortect and install it on your PC.

Start the tool’s scanning process to look for corrupt files that are the source of your problem

Fortect has been downloaded by


readers this month.

Deep & machine learning are tools that try to replicate the brain’s neural network in machines. They introduce self-learning techniques that teach AI to behave under certain conditions. Usually, the AI has to do a certain task and learn from its own mistakes in order to fulfill it.

If you want to get into machine learning and deep learning you might need to take a look at your current computer. In other words, even if your desktop can perform everyday tasks with ease, that doesn’t mean it will have the computing power to run machine learning & deep learning programs.

GPU and CPU are crucial. You need a graphics card with high memory and your processor must have many cores. In addition, your RAM memory needs to be high as well, somewhere around 8 gigs or more.

Because these processes run for long periods of time, the computer you’re looking for needs to be able to run them as long as possible without problems. In conclusion, a powerful cooler is required to stop your components from overheating and causing thermal throttling.

What are the best desktops for machine & deep learning?

RTX 2080 Super has over 8GB of dedicated memory

The Intel Core i9-9900K has an ideal 8 cores and can be turbo boosted

Over 32GB of HyperX DDR4 3000mhz RAM memory

Liquid cooling keeps your temps as low as possible even during intensive use

The product is expensive, its price starting from 2000$

Visit website

HP Obelisk Omen is the most powerful item on our list. Geared with the latest hardware such as the 9th generation Intel Core i9-9900K Processor and the hyper-realistic NVIDIA GeForce RTX 2080 Super,

It is perfect for machine learning and deep learning.

If you want speed, power, customization, and the best quality products out there, this is the choice for you.

The GTX 1660 TI offers 6GB of dedicated memory

The i5-9400f has 6 cores

16Gb of DDR4 memory

The processor does not support overclocking

Visit website

Our last item on this list is another midrange choice, the Hp Pavilion. It is close in performance to the Skytech Shiva while also being a bit cheaper.

The GeForce GTX 1660 TI is just about 10% weaker than the aforementioned RTX 2060, but it is less expensive. In addition, the i5-9400f is still capable of deep learning & machine learning processes.

The Ryzen 5 2600 offers 6 cores

The Video Card has 6GB of DDR6 memory

3x RGB RING Fans for Maximum Air Flow, powered by 80 Plus Certified

500 Watt Power Supply

It only has 8GB of RAM

Check price

Expert tip:

Equipped with a Ryzen 5 2600 processor and a GTX 1660 TI graphics card, it is capable enough of running parsing data algorithms. Furthermore, both the GPU and the CPU can be overclocked.

Intel Core i7-9700k offers has an ideal 8 cores

The NVIDIA RTX 2070 Super comes with 8GB of dedicated memory

Liquid cooling keeps the temperatures low

16GB of DDR4 RAM is enough for deep learning & machine learning

You need to update your firmware if you want to get the proper CPU speed boosts working

Check price

CyberpowerPC Gamer Supreme is our next recommendation on the list. Coming close to the HP Omen mentioned above, this desktop trades a bit of power for a lower price.

The Intel Core i7-9700k and GeForce RTX 2070 Super still offer cutting-edge performance for more affordable prices. Moreover, this desktop can also be overclocked with no problem.

The Ryzen 5 2600 processor has 6 cores

It has 16GB of DDR4 RAM

The Graphics Card has 6GB of dedicated memory

Equipped with 3x RGB Ring Fans ensuring good airflow

Lacks a USB Type-C port

Check price

Skytech Shiva is our first budget-oriented choice. Significantly cheaper than the other two desktops from above, yet still holding strong in terms of performance, this computer is perfect for those who want some balance between price and power.

This product is geared with an AMD processor, precisely the Ryzen 5 2600, and an RTX 2060 non-Super version. The CPU is about 20% slower than an i7-9900k, but it is so much cheaper. Moreover, both the CPU and GPU can be easily overclocked.

This list covers all you need to buy a brand new desktop for deep learning & machine learning.

Still experiencing issues?

Was this page helpful?


Start a conversation

What To Know About Machine Learning

Machine learning is a discipline of computer science which investigates the analysis and structure of algorithms that can learn from information, according to which they may make predictions. It’s used in several programs like self-driven automobiles, effective Internet search, speech recognition, etc.. Classic programming techniques assume that you know the issue clearly and, with that knowledge, will compose a collection of clear directions to resolve this specific issue or to perform a specific job. Examples for these forms of problems/tasks are numerous; in reality, the majority of the apps are composed with very clear expectations of input, output signal along with a fantastic algorithm for the procedure — for example, sorting of amounts, eliminating a specific series in a text file, copying a document, etc..

But, there’s a specific class of problems that conventional problem-solving or programming methods won’t be of much use. By way of instance, let us assume you have about 50,000 files which need to be categorized into particular categories like sports, business and entertainment — without going through every one of these. Or consider another example of looking for a specific thing in tens of thousands of pictures. In the latter scenario, the thing could be photographed in another orientation, or under different lighting conditions. How do you tell which pictures include the thing? Another very helpful case in point is of constructing an internet payment gateway and needing to stop fraudulent transactions. 1 method is to identify indications of possibly fraudulent transactions and triggering alarms before the trade is complete. Just how would you call this correctly without creating unnecessary alerts?

Since you can readily imagine, it is impossible to write quite exact calculations for all those issues. What we can do is to create systems which work like an individual specialist. A physician can tell what disorder a specific patient has by taking a look at the evaluation reports. As soon as it is not feasible to allow him to make a precise identification 100 percent of their moment, he’ll be right most times. Nobody programmed the physician, but he learned those things by analyzing and by expertise.

Machine learning and mathematics

While the illustration of this physician might be somewhat different from real machine learning, the core concept of machine learning is that machines can learn from big data collections and they improve as they gain expertise. As information comprises randomness and doubt, we are going to have to employ concepts from probability and statistics. In reality, machine learning algorithms are so determined by concepts from data that a lot of men and women refer to system learning as statistical understanding. Aside from figures, another important branch of math that is very much in usage is linear algebra. Concepts of matrices, options for systems of equations and optimisation calculations — all play significant roles in machine learning.

Machine studying and Big Data

There are essentially two kinds of machine learning – supervised learning and unsupervised learning. Supervised learning identifies information with tags, examples of which are displayed below:

Let us presume that we must forecast whether it rains in the day, using wind and temperature information. Whether it rains or not will be saved in the column in the day, which becomes the tag.

The algorithms which learn from these data are known as supervised learning algorithms. Though some information can be extracted/generated mechanically, such as in a system log, frequently it might need to be tagged manually, which might raise the price of information acquisition.

We could even classify machine learning algorithms utilizing another logic — regression calculations and classification algorithms. Regression calculations are machine learning algorithms which in fact forecast amount’ — such as the subsequent day’s temperature, the stock market’s closing indicator, etc.. Classification algorithms are the ones which could classify an input signal, such as if it is going to rain or notify the stock exchange will shut negative or positive; if it’s disease x, illness y, or disorder, etc.

Also read:

Top 10 Best Artificial Intelligence Software

It is important to comprehend and love that machine learning algorithms are essentially mathematical calculations and we can execute them in almost any language we enjoy. However, the one I like and use a great deal is the R language. There are lots of popular machine learning modules or modules in various languages. Weka is strong machine learning applications written in Java and is extremely common. Scikit-learn is a favorite among Python programmers. An individual may also choose the Orange machine learning toolbox accessible in Python. While Weka is so strong, it’s some permit issues for industrial usage. Though growth seems to have ceased, it’s an adequate library worth attempting and with sufficient documentation/tutorials to begin easily. I’d recommend people with innovative should learn more about the profound learning algorithms.

Update the detailed information about Clustering Machine Learning Algorithm Using K Means on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!