You are reading the article Speech Emotion Recognition (Ser) Through Machine Learning updated in December 2023 on the website Cattuongwedding.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Speech Emotion Recognition (Ser) Through Machine Learning
Authors: Mohit Wadhwa, Anurag Gupta, Prateek Kumar Pandey
Acknowledgements: Paulami Das, Head of Data Science CoE, and Anish Roychowdhury, Senior Analytics Leader, Brillio
Organizations: Brillio Technologies, Indian Institute of Technology, Kharagpur
1. BackgroundAs human beings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. Emotion detection is a challenging task, because emotions are subjective. There is no common consensus on how to measure or categorize them. We define a SER system as a collection of methodologies that process and classify speech signals to detect emotions embedded in them. Such a system can find use in a wide variety of application areas like interactive voice based-assistant or caller-agent conversation analysis. In this study we attempt to detect underlying emotions in recorded speech by analysing the acoustic features of the audio data of recordings.
2. Solution OverviewThere are three classes of features in a speech namely, the lexical features (the vocabulary used), the visual features (the expressions the speaker makes) and the acoustic features (sound properties like pitch, tone, jitter, etc.). The problem of speech emotion recognition can be solved by analysing one or more of these features. Choosing to follow the lexical features would require a transcript of the speech which would further require an additional step of text extraction from speech if one wants to predict emotions from real-time audio. Similarly, going forward with analysing visual features would require the excess to the video of the conversations which might not be feasible in every case while the analysis on the acoustic features can be done in real-time while the conversation is taking place as we’d just need the audio data for accomplishing our task. Hence, we choose to analyse the acoustic features in this work. Furthermore, the representation of emotions can be done in two ways:
Discrete Classification: Classifying emotions in discrete labels like anger, happiness, boredom, etc.
Dimensional Representation: Representing emotions with dimensions such as Valence (on a negative to positive scale), Activation or Energy (on a low to high scale) and Dominance (on an active to passive scale)
Both these approaches have their pros and cons. The dimensional approach is more elaborate and gives more context to prediction but it is harder to implement and there is a lack of annotated audio data in a dimensional format. The discrete classification is more straightforward and easier to implement but it lacks the context of the prediction that dimensional representation provides. We have used the discrete classification approach in the current study for lack of dimensionally annotated data in the public domain.
3. Data SourcesThe data used in this project was combined from five different data sources as mentioned below:
TESS (Toronto Emotional Speech Set): 2 female speakers (young and old), 2800 audio files, random words were spoken in 7 different emotions.
SAVEE (Surrey Audio-Visual Expressed Emotion): 4 male speakers, 480 audio files, same sentences were spoken in 7 different emotions.
RAVDESS: 2452 audio files, with 12 male speakers and 12 Female speakers, the lexical features (vocabulary) of the utterances are kept constant by speaking only 2 statements of equal lengths in 8 different emotions by all speakers.
CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset): 7442 audio files, 91 different speakers (48 male and 43 female between the ages of 20 and 74) of different races and ethnicities, different statements are spoken in 6 different emotions and 4 emotional levels (low, mid, high and unspecified).
Berlin: 5 male and 5 female speakers, 535 audio files, 10 different sentences were spoken in 6 different emotions.
4. Features used in this studyFrom the Audio data we have extracted three key features which have been used in this study, namely, MFCC (Mel Frequency Cepstral Coefficients), Mel Spectrogram and Chroma. The Python implementation of Librosa package was used in their extraction.
Choice of features
MFCC was by far the most researched about and utilized features in research papers and open source projects.
Mel spectrogram plots amplitude on frequency vs time graph on a “Mel” scale. As the project is on emotion recognition, a purely subjective item, we found it better to plot the amplitude on Mel scale as Mel scale changes the recorded frequency to “perceived frequency”.
Researchers have also used Chroma in their projects as per literatures, thus we also tried basic modeling with only MFCC and Mel and with all MFCC, Mel, Chroma. The model with all of the features gave slightly better results, hence we chose to keep all three features
Details about the features are mentioned below.
MFCC (Mel Frequency Cepstral Coefficients)In the conventional analysis of time signals, any periodic component (for example, echoes) shows up as sharp peaks in the corresponding frequency spectrum (i.e. Fourier spectrum. This is obtained by applying a Fourier transform on the time signal). Any cepstrum feature is obtained by applying Fourier Transform on a spectrogram. The special characteristic of MFCC is that it is taken on a Mel scale which is a scale that relates the perceived frequency of a tone to the actual measured frequency. It scales the frequency in order to match more closely what the human ear can hear. The envelope of the temporal power spectrum of the speech signal is representative of the vocal tract and MFCC accurately represents this envelope.
Mel SpectrogramA Fast Fourier Transform is computed on overlapping windowed segments of the signal, and we get what is called the spectrogram. This is just a spectrogram that depicts amplitude which is mapped on a Mel scale.
ChromaA Chroma vector is typically a 12-element feature vector indicating how much energy of each pitch class is present in the signal in a standard chromatic scale.
5. Pre ProcessingAs the typical output of the feature extracted were 2D in form, we decided to take a bi-directional approach using both a 1D form of input and a 2D form of input as discussed below
1D Data FormatThese features obtained from extraction from audio clips are in a matrix format. To model them on traditional ML algorithms like SVM and XGBoost or on 1-D CNN, we considered converting the matrices into the 1-D format by taking row means and column means. Upon preliminary modelling the results obtained from the array of row means turned out to be better than the array of column means, so we proceeded with the 1-D array obtained from row means of the feature matrices.
2D Data FormatThe 2D features were used in the deep learning model (CNN). The y-axis of the feature matrices obtained depends on the n_mfcc or n_mels parameter we choose while extracting data. The x-axis depends upon the audio duration and the sampling rate we choose while feature extraction. Since the audio clips in our datasets were of varying lengths ranging from just under 2 seconds to over 6 seconds, steps like choosing one median length where we’ll clip all audio files and pad all shorter files with zeroes to maintain dimensions wouldn’t be feasible. This is because this would have resulted in the loss of information for longer clips and the shorter clips would be just silence for the latter half of their audio length. To check this problem, we decided to use different sampling rates in extraction in accordance with their audio lengths. In our approach any, audio file greater or equal to 5 seconds was clipped at 5 seconds and sampled at 16000 Hz and the shorter clips were sampled such that the audio duration * sampling rate multiple remains 80000. In this way, we were able to maintain the dimensions of the matrix for all audio clips without losing much of the information.
6. Exploratory Data AnalysisThe combined data set from the original 5 sources is thoroughly analysed with respect to the following aspects
Emotion distribution by gender
Variation in energy across emotions
Variation of relative pace and power across emotions
We checked the distribution of labels with respect to emotions and gender and found that while the data is balanced for six emotions viz. neutral, happy, sad, angry, fear and disgust, the number of labels was slightly less for surprise and negligible for boredom. While the slightly fewer instances of surprise can be overlooked on account of it being a rarer emotion, the imbalance against boredom was rectified later by clubbing sadness and boredom together due to them being similar acoustically. It’s also worth noting that boredom could have been combined with neutral emotion but since both sadness and boredom are negative emotions, it made more sense to combine them.
Emotion Distribution of GenderRegarding the distribution of gender, the number of female speakers was found to be slightly more than the male speakers, but the imbalance was not large enough to warrant any special attention. Refer Figure. 1
Fig.1 Distributions of emotion with respect to gender
Variation in Energy Across EmotionsTo ensure uniformity in our study of energy variation as the audio clips in our dataset were of different lengths, a power which is energy per unit time was found to be a more accurate measure. This metric was plotted with respect to different emotions. From the graph See Fig. 2) it is quite evident that the primary method of expression of anger or fear in people is a higher energy delivery. We also observe that disgust and sadness are closer to neutral with regards to energy although exceptions do exist.
Figure 2 Distributions of emotion with respect to gender
Variation of Relative Pace and Power with respect to EmotionsA scatter-plot of power vs relative pace of the audio clips was analysed and it was observed that the ‘disgust’ emotion was skewed towards the low pace side while the ‘surprise’ emotion was skewed more towards the higher pace side. As mentioned before, anger and fear occupy the high power space and sadness and neutral occupy the low power space while being scattered pace-wise. Only, the RAVDESS dataset was used for plotting here because it contains only two sentences of equal length spoken in different emotions, so the lexical features don’t vary and the relative pace can be reliably calculated.
Figure 3 Scatter of power Vs relative pace of audio clips
7. Modelling Solution PipelineThe solution pipeline for this study is depicted in the schematic shown in Fig. 4. The raw signal is the input which is processed as shown. At first the 2D features were extracted from the datasets and converted into 1-D form by taking the row means. A measure of noise was added to the raw audio for 4 of our datasets (except CREMA-D as the others were studio recording and thus cleaner). The features were then extracted from those noisy files and our dataset was augmented with them. Post feature extraction we applied various ML algorithms such as SVM, XGB, CNN-1D(Shallow) and CNN-1D on our 1D data frame and CNN-2D on our 2D-tensor. As some of the models were overfitting the data, and taking into consideration a large number of features (181 in 1D) we tried dimensionality reduction to check overfitting and trained the models again.
Figure 4: Schematic of solution pipeline
Selection of Train Test DataWe chose a customized split logic for the various ML models used. For the SVM and XGB models the model was simply split into train-test data in the ratio of 80:20 and validated using cross-validation with 5-folds. For CNN both 1D and 2D, train-test-split was used consecutively, such that at first the data was split in 90:10 ratio where the 10% was test set, the remaining 90% was again split in 80:20 ratio of train and validation set.
CNN Model Architectures CNN-1D (shallow)This model consisted of 1 Convolution layer of 64 channels and same padding followed by a dense layer and the output layer.
CNN-2D (deep)This model was constructed in a similar format as VGG-16, but the last 2 blocks of 3 convolution layers were removed to reduce complexity. This CNN model had the following architectural complexity:
2 convolution layers of 64 channels, 3×3 kernel size and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
2 convolution layers of 128 channels, 3×3 kernel size and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
3 convolution layers of 256 channels, 3×3 kernel size and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
Each convolution layer had the ‘relu’ activation function.
After flattening, two dense layers of 512 units each were added and dropout layers of 0.1 and 0.2 were added after each dense layer.
Finally, the output layer was added with a ‘softmax’ activation function.
Model Results ComparisonThe result is based on the accuracy metrics in which there is a comparison between predicted values and the actual values. A confusion matrix is created which consists of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). From confusion metrics, we have calculated accuracy as follows:
The model was trained on training data and tested on test data with different numbers of epochs starting from 50 to 100, 150 and 200. The accuracies were compared among all models viz. SVM, XGBoost and Convolution Neural Network (shallow and deep) for 1D features and 2D features. Fig. 5 shows the comparative performance across different models.
Figure 5 : Comparative Results from different models We find from Fig.5 that though the CNN-1D Deep model gave the best accuracy on the test set, CNN-1D Deep and CNN-2D models are clearly overfitting the dataset with their training accuracy being is 99% and 98.38 % respectively against a much lower test and validation accuracies. On the other hand, CNN-1D Shallow gave much better results on account of it being more stable with its train, validation and test accuracies being closer to each other, though its testing accuracy was a little lower than the CNN-1D.
Dimensionality Reduction ApproachIn order to rectify the overfitting of the models we used a dimensionality reduction approach. PCA technique was employed for dimensionality reduction in 1D features and dimensions were reduced from 180 to 120 with an explained variance of 98.3%. Dimensionality reduction made the model slightly less accurate but reduced the training time, however it didn’t do much to reduce overfitting in the deep learning model. From this we deduced that our dataset is simply not big enough for a complex model to perform well and realised the solution was limited by lack of a larger data volume. Fig. 6 summarizes the results for different models post dimensionality reduction.
Figure 6 : Comparative Results from different models after PCA based dimensionality reduction
Insights from Testing User RecordingsWe tested the developed models on user recordings, from the test results we have the following observations
An ensemble of CNN-2D and CNN-1D (shallow and deep) based on a soft voting gave best results on user recordings.
The model often got confused between anger and disgust.
The model also got confused among low energy emotions which are sadness, boredom and neutral.
If one or two words are spoken in higher volume than other words, especially at start or end of a sentence, it almost always classifies as fear or surprise.
The model seldom classifies an emotion as happy.
The model isn’t too noise sensitive, meaning it doesn’t falter as long as background noise is not too high.
Grouping Similar EmotionsSince the model was confusing between similar emotions like anger-disgust and sad-bored, we tried combining those labels and training the model on 6 classes which were neutral, sadness/boredom, happy, anger/disgust, surprise and fear. The accuracies certainly improved on reducing the number of classes, but this introduced another problem with regards to class imbalance. After, combining anger-disgust and sad-boredom, the model developed a high bias towards the anger-disgust. This may have happened because the number of instances of anger-disgust became disproportionately more than the other labels. So, it was decided to stick with the older model.
Final Prediction PipelineThe final prediction pipeline is depicted schematically in Fig. 7 below.
Figure 7 Final prediction pipeline
8. Conclusions and Future ScopeThrough this project, we showed how we can leverage Machine learning to obtain the underlying emotion from speech audio data and some insights on the human expression of emotion through voice. This system can be employed in a variety of setups like Call Centre for complaints or marketing, in voice-based virtual assistants or chatbots, in linguistic research, etc. A few possible steps that can be implemented to make the models more robust and accurate are the following
An accurate implementation of the pace of the speaking can be explored to check if it can resolve some of the deficiencies of the model.
Figuring out a way to clear random silence from the audio clip.
Exploring other acoustic features of sound data to check their applicability in the domain of speech emotion recognition. These features could simply be some proposed extensions of MFCC like RAS-MFCC or they could be other features entirely like LPCC, PLP or Harmonic cepstrum.
Following lexical features based approach towards SER and using an ensemble of the lexical and acoustic models. This will improve the accuracy of the system because in some cases the expression of emotion is contextual rather than vocal.
Adding more data volume either by other augmentation techniques like time-shifting or speeding up/slowing down the audio or simply finding more annotated audio clips.
9. AcknowledgementsThe authors wish to express their gratitude to Paulami Das, Head of Data Science CoE @ Brillio and Anish Roychowdhury, Senior Analytics Leader @ Brillio for their mentoring and guidance towards shaping up this study.
-REFERENCES- Blogs and Documentations: MFCC Dataset References: Literature ReferencesKernel References Authors:
Prateek Kumar Pandey Prateek Kumar Pandey is an Undergraduate enrolled in IIT Kharagpur pursuing a Dual Degree (BTech.+MTech.) in the department of Civil Engineering and is currently in his Fourth year. He has worked as a Data Science Intern for Brillio where he worked on a Speech Emotion Recognition project. He has proficiency in Python programming and has previously worked on projects in areas of NLP and Time Series forecasting. Anurag Gupta Anurag Gupta is a pre-final year undergraduate student enrolled in Dual Degree (B.Tech+M.Tech) course in the Department of Civil Engineering, IIT Kharagpur. He is interested in the field of Data Science, Natural Language Processing, Predictive Data Modelling and solving the real-world problems with the help of these technologies. He likes to play with data and bring valuable insights and results out of them whether it comes to structured or unstructured data. He is keen to work on the projects that can help him in nurturing skills and knowledge. Previously he has worked in BRILLIO as Data Science Intern where he worked on machine learning project based on Speech Emotion Recognition. He has worked at IISC Bangalore on a real-life project for forecasting Transit Ridership of BMTC buses. Mohit Wadhwa
As human beings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. Emotion detection is a challenging task, because emotions are subjective. There is no common consensus on how to measure or categorize them. We define a SER system as a collection of methodologies that process and classify speech signals to detect emotions embedded in them. Such a system can find use in a wide variety of application areas like interactive voice based-assistant or caller-agent conversation analysis. In this study we attempt to detect underlying emotions in recorded speech by analysing the acoustic features of the audio data of recordings.There are three classes of features in a speech namely, the lexical features (the vocabulary used), the visual features (the expressions the speaker makes) and the acoustic features (sound properties like pitch, tone, jitter, etc.). The problem of speech emotion recognition can be solved by analysing one or more of these features. Choosing to follow the lexical features would require a transcript of the speech which would further require an additional step of text extraction from speech if one wants to predict emotions from real-time audio. Similarly, going forward with analysing visual features would require the excess to the video of the conversations which might not be feasible in every case while the analysis on the acoustic features can be done in real-time while the conversation is taking place as we’d just need the audio data for accomplishing our task. Hence, we choose to analyse the acoustic features in this work. Furthermore, the representation of emotions can be done in two ways:Both these approaches have their pros and cons. The dimensional approach is more elaborate and gives more context to prediction but it is harder to implement and there is a lack of annotated audio data in a dimensional format. The discrete classification is more straightforward and easier to implement but it lacks the context of the prediction that dimensional representation provides. We have used the discrete classification approach in the current study for lack of dimensionally annotated data in the public chúng tôi data used in this project was combined from five different data sources as mentioned below:From the Audio data we have extracted three key features which have been used in this study, namely, MFCC (Mel Frequency Cepstral Coefficients), Mel Spectrogram and Chroma. The Python implementation of Librosa package was used in their chúng tôi the conventional analysis of time signals, any periodic component (for example, echoes) shows up as sharp peaks in the corresponding frequency spectrum (i.e. Fourier spectrum. This is obtained by applying a Fourier transform on the time signal). Any cepstrum feature is obtained by applying Fourier Transform on a spectrogram. The special characteristic of MFCC is that it is taken on a Mel scale which is a scale that relates the perceived frequency of a tone to the actual measured frequency. It scales the frequency in order to match more closely what the human ear can hear. The envelope of the temporal power spectrum of the speech signal is representative of the vocal tract and MFCC accurately represents this envelope.A Fast Fourier Transform is computed on overlapping windowed segments of the signal, and we get what is called the spectrogram. This is just a spectrogram that depicts amplitude which is mapped on a Mel scale.A Chroma vector is typically a 12-element feature vector indicating how much energy of each pitch class is present in the signal in a standard chromatic chúng tôi the typical output of the feature extracted were 2D in form, we decided to take a bi-directional approach using both a 1D form of input and a 2D form of input as discussed belowThese features obtained from extraction from audio clips are in a matrix format. To model them on traditional ML algorithms like SVM and XGBoost or on 1-D CNN, we considered converting the matrices into the 1-D format by taking row means and column means. Upon preliminary modelling the results obtained from the array of row means turned out to be better than the array of column means, so we proceeded with the 1-D array obtained from row means of the feature chúng tôi 2D features were used in the deep learning model (CNN). The y-axis of the feature matrices obtained depends on the n_mfcc or n_mels parameter we choose while extracting data. The x-axis depends upon the audio duration and the sampling rate we choose while feature extraction. Since the audio clips in our datasets were of varying lengths ranging from just under 2 seconds to over 6 seconds, steps like choosing one median length where we’ll clip all audio files and pad all shorter files with zeroes to maintain dimensions wouldn’t be feasible. This is because this would have resulted in the loss of information for longer clips and the shorter clips would be just silence for the latter half of their audio length. To check this problem, we decided to use different sampling rates in extraction in accordance with their audio lengths. In our approach any, audio file greater or equal to 5 seconds was clipped at 5 seconds and sampled at 16000 Hz and the shorter clips were sampled such that the audio duration * sampling rate multiple remains 80000. In this way, we were able to maintain the dimensions of the matrix for all audio clips without losing much of the chúng tôi combined data set from the original 5 sources is thoroughly analysed with respect to the following aspectsWe checked the distribution of labels with respect to emotions and gender and found that while the data is balanced for six emotions vizand, the number of labels was slightly less forand negligible for. While the slightly fewer instances of surprise can be overlooked on account of it being a rarer emotion, the imbalance against boredom was rectified later by clubbing sadness and boredom together due to them being similar acoustically. It’s also worth noting that boredom could have been combined with neutral emotion but since bothandare negative emotions, it made more sense to combine them.Regarding the distribution of gender, the number of female speakers was found to be slightly more than the male speakers, but the imbalance was not large enough to warrant any special attention. Refer Figure. 1Fig.1 Distributions of emotion with respect to genderTo ensure uniformity in our study of energy variation as the audio clips in our dataset were of different lengths, a power which is energy per unit time was found to be a more accurate measure. This metric was plotted with respect to different emotions. From the graph See Fig. 2) it is quite evident that the primary method of expression of anger or fear in people is a higher energy delivery. We also observe that disgust and sadness are closer to neutral with regards to energy although exceptions do exist.Figure 2 Distributions of emotion with respect to genderA scatter-plot of power vs relative pace of the audio clips was analysed and it was observed that the ‘disgust’ emotion was skewed towards the low pace side while the ‘surprise’ emotion was skewed more towards the higher pace side. As mentioned before, anger and fear occupy the high power space and sadness and neutral occupy the low power space while being scattered pace-wise. Only, the RAVDESS dataset was used for plotting here because it contains only two sentences of equal length spoken in different emotions, so the lexical features don’t vary and the relative pace can be reliably calculated.Figure 3 Scatter of power Vs relative pace of audio clipsThe solution pipeline for this study is depicted in the schematic shown in Fig. 4. The raw signal is the input which is processed as shown. At first the 2D features were extracted from the datasets and converted into 1-D form by taking the row means. A measure of noise was added to the raw audio for 4 of our datasets (except CREMA-D as the others were studio recording and thus cleaner). The features were then extracted from those noisy files and our dataset was augmented with them. Post feature extraction we applied various ML algorithms such as SVM, XGB, CNN-1D(Shallow) and CNN-1D on our 1D data frame and CNN-2D on our 2D-tensor. As some of the models were overfitting the data, and taking into consideration a large number of features (181 in 1D) we tried dimensionality reduction to check overfitting and trained the models again.Figure 4: Schematic of solution pipelineWe chose a customized split logic for the various ML models used. For the SVM and XGB models the model was simply split into train-test data in the ratio of 80:20 and validated using cross-validation with 5-folds. For CNN both 1D and 2D, train-test-split was used consecutively, such that at first the data was split in 90:10 ratio where the 10% was test set, the remaining 90% was again split in 80:20 ratio of train and validation chúng tôi model consisted of 1 Convolution layer of 64 channels and same padding followed by a dense layer and the output chúng tôi model was constructed in a similar format as VGG-16, but the last 2 blocks of 3 convolution layers were removed to reduce complexity. This CNN model had the following architectural complexity:The result is based on the accuracy metrics in which there is a comparison between predicted values and the actual values. A confusion matrix is created which consists of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). From confusion metrics, we have calculated accuracy as follows:The model was trained on training data and tested on test data with different numbers of epochs starting from 50 to 100, 150 and 200. The accuracies were compared among all models viz. SVM, XGBoost and Convolution Neural Network (shallow and deep) for 1D features and 2D features. Fig. 5 shows the comparative performance across different models.Figure 5 : Comparative Results from different models We find from Fig.5 that though the CNN-1D Deep model gave the best accuracy on the test set, CNN-1D Deep and CNN-2D models are clearly overfitting the dataset with their training accuracy being is 99% and 98.38 % respectively against a much lower test and validation accuracies. On the other hand, CNN-1D Shallow gave much better results on account of it being more stable with its train, validation and test accuracies being closer to each other, though its testing accuracy was a little lower than the chúng tôi order to rectify the overfitting of the models we used a dimensionality reduction approach. PCA technique was employed for dimensionality reduction in 1D features and dimensions were reduced from 180 to 120 with an explained variance of 98.3%. Dimensionality reduction made the model slightly less accurate but reduced the training time, however it didn’t do much to reduce overfitting in the deep learning model. From this we deduced that our dataset is simply not big enough for a complex model to perform well and realised the solution was limited by lack of a larger data volume. Fig. 6 summarizes the results for different models post dimensionality reduction.Figure 6 : Comparative Results from different models after PCA based dimensionality reductionWe tested the developed models on user recordings, from the test results we have the following observationsSince the model was confusing between similar emotions like anger-disgust and sad-bored, we tried combining those labels and training the model on 6 classes which were neutral, sadness/boredom, happy, anger/disgust, surprise and fear. The accuracies certainly improved on reducing the number of classes, but this introduced another problem with regards to class imbalance. After, combining anger-disgust and sad-boredom, the model developed a high bias towards the anger-disgust. This may have happened because the number of instances of anger-disgust became disproportionately more than the other labels. So, it was decided to stick with the older chúng tôi final prediction pipeline is depicted schematically in Fig. 7 below.Figure 7 Final prediction pipelineThrough this project, we showed how we can leverage Machine learning to obtain the underlying emotion from speech audio data and some insights on the human expression of emotion through voice. This system can be employed in a variety of setups like Call Centre for complaints or marketing, in voice-based virtual assistants or chatbots, in linguistic research, chúng tôi authors wish to express their gratitude to Paulami Das, Head of Data Science CoE @ Brillio and Anish Roychowdhury, Senior Analytics Leader @ Brillio for their mentoring and guidance towards shaping up this study.Prateek Kumar Pandey is an Undergraduate enrolled in IIT Kharagpur pursuing a Dual Degree (BTech.+MTech.) in the department of Civil Engineering and is currently in his Fourth year. He has worked as a Data Science Intern for Brillio where he worked on a Speech Emotion Recognition project. He has proficiency in Python programming and has previously worked on projects in areas of NLP and Time Series forecasting.Anurag Gupta is a pre-final year undergraduate student enrolled in Dual Degree (B.Tech+M.Tech) course in the Department of Civil Engineering, IIT Kharagpur. He is interested in the field of Data Science, Natural Language Processing, Predictive Data Modelling and solving the real-world problems with the help of these technologies. He likes to play with data and bring valuable insights and results out of them whether it comes to structured or unstructured data. He is keen to work on the projects that can help him in nurturing skills and knowledge. Previously he has worked in BRILLIO as Data Science Intern where he worked on machine learning project based on Speech Emotion Recognition. He has worked at IISC Bangalore on a real-life project for forecasting Transit Ridership of BMTC buses.Mohit Washwa is currently a Data Scientist in Brillio. He has a total of 3+ years of Industry Experience. Prior to this, he was part of AI solutions team in Infosys. He has worked on projects based on Machine Learning, Deep Learning and Computer Vision. His skills helped some clients in achieving Code Optimization around concepts such as Genetic Algorithm. He has leveraged his understanding of object detection algorithms like Yolo by delivering an interesting computer vision solution that demonstrate how customers in a retail store can be automatically billed for items picked up from shelves. This solution was presented at NRF (National Retail Federation) in New York. He has experience of working on Demand forecasting projects that aims to predict the demand of a SKU and has delivered some other client solutions involving deep learning algorithms such as CNN, RNN, LSTM. He holds a good hand on Python and has pursued BTech. in Computer Engineering from Punjabi University, Patiala.
You're reading Speech Emotion Recognition (Ser) Through Machine Learning
Paragraph Segmentation Using Machine Learning
Introduction
Natural language processing (NLP) relies heavily on paragraph segmentation, which has various practical applications such as text summarization, sentiment analysis, and topic modeling. Text summarizing algorithms, for example, frequently rely on paragraph segmentation to find the most important areas of a document that must be summarized. Similarly, paragraph segmentation may be required for sentiment analysis algorithms in order to grasp the context and tone of each paragraph independently.
Paragraph SegmentationThe technique of splitting a given text into different paragraphs based on structural and linguistic criteria is known as paragraph segmentation. Paragraph segmentation is used to improve the readability and organization of huge documents such as articles, novels, or reports. Readers can traverse the text more simply, get the information they need more quickly, and absorb the content more effectively using paragraph segmentation.
Depending on the individual properties of the text and the purposes of the segmentation, there are numerous ways to divide it into paragraphs.
1. Text indentationThis book discusses the issue of indentation in written text. Indentation refers to the space at the beginning of a line of text that is commonly used to signify the start of a new paragraph in numerous writing styles. Readers can benefit from indentation to visually differentiate where one paragraph finishes and another begins. Text indentation may also be used as a characteristic for automated paragraph segmentation, which is a natural language processing approach for automatically identifying and separating paragraphs in a body of text. The computer may be trained to recognize where paragraphs begin and end by analyzing indentation patterns, which is valuable in a variety of text analysis applications.
2. Punctuation marksThis book discusses the role of punctuation indicators which include periods, question marks, and exclamation points. These symbols are widely used to indicate the conclusion of a paragraph and the start of a new one. They can also be used to signal the conclusion of one paragraph and the start of another. Punctuation marks in written communication should be utilized appropriately since they help to clarify material and make the text easier to read and understand.
3. Text lengthA paragraph seems to be a writing style composed of a sequence of connected phrases that address a particular topic or issue. The text’s length can be utilized to split it into paragraphs. A huge block of content, for example, can be divided into smaller paragraphs depending on sentence length. This means that if multiple sentences in a sequence discuss the same topic, they can be concatenated to form a paragraph. Similarly, if the topic or notion changes, a new paragraph may be added to alert the reader. Ultimately, the objective of paragraphs is to arrange and structure written content in an easy-to-read and understandable manner.
4. Text coherenceParagraphs are an important part of writing since they assist to organize ideas and thoughts in a clear and logical way. A cohesive paragraph has sentences that are all connected to and contribute to a major concept or thinking. The coherence of the text refers to the flow of ideas and the logical links between phrases that allow the reader to discern between the beginning and finish of a paragraph. When reading a text, look for a shift in the topic or the introduction of a new concept to identify the beginning of a new paragraph. Similarly, a concluding phrase or a transition to a new concept might signal the conclusion of a paragraph. Ultimately, text coherence is an important aspect in distinguishing paragraph borders and interpreting the writer’s intended meaning.
Paragraph segmentation using machine learningMachine learning algorithms have been employed to automate the job of paragraph segmentation in recent years, attaining remarkable levels of accuracy and speed. Machine learning algorithms are trained on a vast corpus of manually annotated text data with paragraph boundaries. This training data is used to understand the patterns and characteristics that differentiate various paragraphs.
Paragraph segmentation may be accomplished using supervised learning methods. Supervised learning algorithms are machine learning algorithms that learn on labeled data, which has already been labeled with correct answers. The labeled data for paragraph segmentation would consist of text that has been split into paragraphs and each paragraph has been labeled with a unique ID.
Two supervised learning approaches for paragraph segmentation are support vector machines (SVMs) and decision trees. These algorithms employ labeled data to learn patterns and rules that may be used to predict the boundaries of paragraphs in new texts. When given new, unlabeled text, the algorithms may utilize their previously acquired patterns and rules to forecast where one paragraph ends and another begins. This method is especially effective for evaluating vast amounts of text where manual paragraph segmentation would be impossible or time-consuming. Overall, supervised learning algorithms provide a reliable and efficient way for automating paragraph segmentation in a wide range of applications.
For paragraph segmentation, unsupervised learning methods can be employed. Unlike supervised learning algorithms, which require labeled training data, unsupervised learning algorithms may separate paragraphs without any prior knowledge of how the text should be split. Unsupervised learning algorithms use statistical analysis and clustering techniques to detect similar patterns in text. Clustering algorithms, for example, can group together phrases with similar qualities, such as lexicon or grammar, and identify them as belonging to the same paragraph. Topic modeling is another unsupervised learning approach that may be used to discover clusters of linked phrases that may constitute a paragraph. These algorithms do not rely on predetermined rules or patterns, but rather on statistical approaches to find significant patterns and groups in text. Unsupervised learning methods are very beneficial for text segmentation when the structure or formatting of the text is uneven or uncertain. Overall, unsupervised learning algorithms provide a versatile and powerful way for automating paragraph segmentation in a range of applications.
The text file below contains the paragraph above that is starting with ‘For paragraph segmentation, ……’
Python Program import nltk from chúng tôi import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import make_pipeline # Load the data with open('/content/data.txt', 'r') as file: data = file.read() # Tokenize the data into sentences sentences = nltk.sent_tokenize(data) # Label the sentences as belonging to the same or a new paragraph # Create a feature matrix using TF-IDF vectorization vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(sentences) # Create a support vector machine classifier clf = make_pipeline(SVC(kernel='linear')) # Train the classifier on the labeled data clf.fit(X, labels) # Use the classifier to predict the paragraph boundaries in new text new_sentences = nltk.sent_tokenize(new_text) new_X = vectorizer.transform(new_sentences) new_labels = clf.predict(new_X) # Print the predicted paragraph boundaries for i in range(len(new_sentences)): if new_labels[i] == 1: print(new_sentences[i]) Output This is a new paragraph. It is separate from the previous one. This is the second sentence of the second paragraph. ConclusionFinally, paragraph segmentation is an important job in natural language processing that may enhance the readability and structure of enormous texts greatly. In this domain, machine learning algorithms have made great progress, enabling precise and efficient segmentation based on structural data and statistical analysis. However, further study is needed to enhance these models’ performance on more complicated and diverse texts, as well as to investigate novel ways to paragraph segmentation based on deep learning and other sophisticated techniques.
5 Best Desktops For Machine Learning & Deep Learning
5 best desktops for machine learning & deep learning
995
Share
X
Machine learning and deep learning require powerful computers. Amazon has a fair amount of choices for people and budgets of all kinds.
This article will help you decide what to buy by showing you a list of different computers with their pros and cons.
Looking for more computer-related articles? Check out our detailed Computers Hub for more buying guides and information.
Visit our thorough Buying Guides Section if you want more help regarding your choice of gadgets and devices.
X
INSTALL BY CLICKING THE DOWNLOAD FILE
To fix Windows PC system issues, you will need a dedicated tool
Fortect is a tool that does not simply cleans up your PC, but has a repository with several millions of Windows System files stored in their initial version. When your PC encounters a problem, Fortect will fix it for you, by replacing bad files with fresh versions. To fix your current PC issue, here are the steps you need to take:
Download Fortect and install it on your PC.
Start the tool’s scanning process to look for corrupt files that are the source of your problem
Fortect has been downloaded by
0
readers this month.
Deep & machine learning are tools that try to replicate the brain’s neural network in machines. They introduce self-learning techniques that teach AI to behave under certain conditions. Usually, the AI has to do a certain task and learn from its own mistakes in order to fulfill it.
If you want to get into machine learning and deep learning you might need to take a look at your current computer. In other words, even if your desktop can perform everyday tasks with ease, that doesn’t mean it will have the computing power to run machine learning & deep learning programs.
GPU and CPU are crucial. You need a graphics card with high memory and your processor must have many cores. In addition, your RAM memory needs to be high as well, somewhere around 8 gigs or more.
Because these processes run for long periods of time, the computer you’re looking for needs to be able to run them as long as possible without problems. In conclusion, a powerful cooler is required to stop your components from overheating and causing thermal throttling.
What are the best desktops for machine & deep learning?
RTX 2080 Super has over 8GB of dedicated memory
The Intel Core i9-9900K has an ideal 8 cores and can be turbo boosted
Over 32GB of HyperX DDR4 3000mhz RAM memory
Liquid cooling keeps your temps as low as possible even during intensive use
The product is expensive, its price starting from 2000$
Visit website
HP Obelisk Omen is the most powerful item on our list. Geared with the latest hardware such as the 9th generation Intel Core i9-9900K Processor and the hyper-realistic NVIDIA GeForce RTX 2080 Super,
It is perfect for machine learning and deep learning.
If you want speed, power, customization, and the best quality products out there, this is the choice for you.
The GTX 1660 TI offers 6GB of dedicated memory
The i5-9400f has 6 cores
16Gb of DDR4 memory
The processor does not support overclocking
Visit website
Our last item on this list is another midrange choice, the Hp Pavilion. It is close in performance to the Skytech Shiva while also being a bit cheaper.
The GeForce GTX 1660 TI is just about 10% weaker than the aforementioned RTX 2060, but it is less expensive. In addition, the i5-9400f is still capable of deep learning & machine learning processes.
The Ryzen 5 2600 offers 6 cores
The Video Card has 6GB of DDR6 memory
3x RGB RING Fans for Maximum Air Flow, powered by 80 Plus Certified
500 Watt Power Supply
It only has 8GB of RAM
Check price
Expert tip:
Equipped with a Ryzen 5 2600 processor and a GTX 1660 TI graphics card, it is capable enough of running parsing data algorithms. Furthermore, both the GPU and the CPU can be overclocked.
Intel Core i7-9700k offers has an ideal 8 cores
The NVIDIA RTX 2070 Super comes with 8GB of dedicated memory
Liquid cooling keeps the temperatures low
16GB of DDR4 RAM is enough for deep learning & machine learning
You need to update your firmware if you want to get the proper CPU speed boosts working
Check price
CyberpowerPC Gamer Supreme is our next recommendation on the list. Coming close to the HP Omen mentioned above, this desktop trades a bit of power for a lower price.
The Intel Core i7-9700k and GeForce RTX 2070 Super still offer cutting-edge performance for more affordable prices. Moreover, this desktop can also be overclocked with no problem.
The Ryzen 5 2600 processor has 6 cores
It has 16GB of DDR4 RAM
The Graphics Card has 6GB of dedicated memory
Equipped with 3x RGB Ring Fans ensuring good airflow
Lacks a USB Type-C port
Check price
Skytech Shiva is our first budget-oriented choice. Significantly cheaper than the other two desktops from above, yet still holding strong in terms of performance, this computer is perfect for those who want some balance between price and power.
This product is geared with an AMD processor, precisely the Ryzen 5 2600, and an RTX 2060 non-Super version. The CPU is about 20% slower than an i7-9900k, but it is so much cheaper. Moreover, both the CPU and GPU can be easily overclocked.
This list covers all you need to buy a brand new desktop for deep learning & machine learning.
Still experiencing issues?
Was this page helpful?
x
Start a conversation
What To Know About Machine Learning
Machine learning is a discipline of computer science which investigates the analysis and structure of algorithms that can learn from information, according to which they may make predictions. It’s used in several programs like self-driven automobiles, effective Internet search, speech recognition, etc.. Classic programming techniques assume that you know the issue clearly and, with that knowledge, will compose a collection of clear directions to resolve this specific issue or to perform a specific job. Examples for these forms of problems/tasks are numerous; in reality, the majority of the apps are composed with very clear expectations of input, output signal along with a fantastic algorithm for the procedure — for example, sorting of amounts, eliminating a specific series in a text file, copying a document, etc..
But, there’s a specific class of problems that conventional problem-solving or programming methods won’t be of much use. By way of instance, let us assume you have about 50,000 files which need to be categorized into particular categories like sports, business and entertainment — without going through every one of these. Or consider another example of looking for a specific thing in tens of thousands of pictures. In the latter scenario, the thing could be photographed in another orientation, or under different lighting conditions. How do you tell which pictures include the thing? Another very helpful case in point is of constructing an internet payment gateway and needing to stop fraudulent transactions. 1 method is to identify indications of possibly fraudulent transactions and triggering alarms before the trade is complete. Just how would you call this correctly without creating unnecessary alerts?
Since you can readily imagine, it is impossible to write quite exact calculations for all those issues. What we can do is to create systems which work like an individual specialist. A physician can tell what disorder a specific patient has by taking a look at the evaluation reports. As soon as it is not feasible to allow him to make a precise identification 100 percent of their moment, he’ll be right most times. Nobody programmed the physician, but he learned those things by analyzing and by expertise.
Machine learning and mathematicsWhile the illustration of this physician might be somewhat different from real machine learning, the core concept of machine learning is that machines can learn from big data collections and they improve as they gain expertise. As information comprises randomness and doubt, we are going to have to employ concepts from probability and statistics. In reality, machine learning algorithms are so determined by concepts from data that a lot of men and women refer to system learning as statistical understanding. Aside from figures, another important branch of math that is very much in usage is linear algebra. Concepts of matrices, options for systems of equations and optimisation calculations — all play significant roles in machine learning.
Machine studying and Big DataThere are essentially two kinds of machine learning – supervised learning and unsupervised learning. Supervised learning identifies information with tags, examples of which are displayed below:
Let us presume that we must forecast whether it rains in the day, using wind and temperature information. Whether it rains or not will be saved in the column in the day, which becomes the tag.
The algorithms which learn from these data are known as supervised learning algorithms. Though some information can be extracted/generated mechanically, such as in a system log, frequently it might need to be tagged manually, which might raise the price of information acquisition.
We could even classify machine learning algorithms utilizing another logic — regression calculations and classification algorithms. Regression calculations are machine learning algorithms which in fact forecast amount’ — such as the subsequent day’s temperature, the stock market’s closing indicator, etc.. Classification algorithms are the ones which could classify an input signal, such as if it is going to rain or notify the stock exchange will shut negative or positive; if it’s disease x, illness y, or disorder, etc.
Also read:
Top 10 Best Artificial Intelligence Software
It is important to comprehend and love that machine learning algorithms are essentially mathematical calculations and we can execute them in almost any language we enjoy. However, the one I like and use a great deal is the R language. There are lots of popular machine learning modules or modules in various languages. Weka is strong machine learning applications written in Java and is extremely common. Scikit-learn is a favorite among Python programmers. An individual may also choose the Orange machine learning toolbox accessible in Python. While Weka is so strong, it’s some permit issues for industrial usage. Though growth seems to have ceased, it’s an adequate library worth attempting and with sufficient documentation/tutorials to begin easily. I’d recommend people with innovative should learn more about the profound learning algorithms.
Mental Health Prediction Using Machine Learning
Text(0.5, 0, ‘Age’)
Inference: The above plot shows the Age column with respect to density. We can see that density is higher from Age 10 to 20 years in our dataset.
j = sns.FacetGrid(train_df, col='treatment', size=5) j = j.map(sns.distplot, "Age")Inference: Treatment 0 means treatment is not necessary 1 means it is. First Barplot shows that from age 0 to 10-year treatment is not necessary and is needed after 15 years.
plt.figure(figsize=(12,8)) labels = labelDict['label_Gender'] j = sns.countplot(x="treatment", data=train_df) j.set_xticklabels(labels) plt.title('Total Distribution by treated or not')Text(0.5, 1.0, ‘Total Distribution by treated or not’)
Inference: Here we can see that more males are treated as compared to females in the dataset.
o = labelDict['label_age_range'] j = sns.factorplot(x="age_range", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Age') new_labels = labelDict['label_Gender'] for t, l in zip(j._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) plt.show()Inference: This barplot shows the mental health of females, males, and transgender according to different age groups. we can analyze that from the age group of 66 to 100, mental health is very high in females as compared to another gender. And from age 21 to 64, mental health is very high in transgender as compared to males.
o = labelDict['label_family_history'] j = sns.factorplot(x="family_history", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Family History') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) plt.show() o = labelDict['label_care_options'] j = sns.factorplot(x="care_options", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Care options') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) plt.show()Inference: In the dataset, for those who are having a family history of mental health problems, the Probability of mental health will be high. So here we can see that probability of mental health conditions for transgender is almost 90% as they have a family history of medical health conditions.
Inference: This barplot shows health status with respect to care options. In the dataset, for Those who are not having care options, the Probability of mental health situation will be high. So here we can see that the mental health of transgender is very high who have not care options and low for those who are having care options.
o = labelDict['label_benefits'] j = sns.factorplot(x="care_options", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Benefits') new_labels = labelDict['label_Gender'] for t, l in zip(j._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) plt.show()Inference: This barplot shows the probability of health conditions with respect to Benefits. In the dataset, for those who are not having any benefits, the Probability of mental health conditions will be high. So here we can see that probability of mental health conditions for transgender is very high who have not getting any benefits. and probability is low for those who are having benefits options.
o = labelDict['label_work_interfere'] j = sns.factorplot(x="work_interfere", y="treatment", hue="Gender", data=train_df, kind="bar", ci=None, size=5, aspect=2, legend_out = True) j.set_xticklabels(o) plt.title('Probability of mental health condition') plt.ylabel('Probability x 100') plt.xlabel('Work interfere') new_labels = labelDict['label_Gender'] for t, l in zip(g._legend.texts, new_labels): t.set_text(l) j.fig.subplots_adjust(top=0.9,right=0.8) plt.show()Inference: This barplot shows the probability of health conditions with respect to work interference. For those who are not having any work interference, the Probability of mental health conditions will be very less. and probability is high for those who are having work interference rarely.
Scaling and Fitting # Scaling Age scaler = MinMaxScaler() train_df['Age'] = scaler.fit_transform(train_df[['Age']]) train_df.head() # define X and y feature_cols1 = ['Age', 'Gender', 'family_history', 'benefits', 'care_options', 'anonymity', 'leave', 'work_interfere'] X = train_df[feature_cols1] y = train_df.treatment X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.30, Random_state1=0) # Create dictionaries for final graph # Use: methodDict['Stacking'] = accuracy_score methodDict = {} rmseDict = () forest = ExtraTreesClassifier(n_estimators=250, Random_state1=0) forest.fit(X, y) importances = forest.feature_importances_ std = np.std([tree1.feature_importances_ for tree in forest.estimators_], axis=0) indices = np.argsort(importances)[::-1] labels = [] for f in Range(x.shape[1]): labels.append(feature_cols1[f]) plt.figure(figsize=(12,8)) plt.title("Feature importances") plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices],) plt.Xticks(range(X.shape[1]), labels, rotation='vertical') plt.xlim([-1, X.shape[1]]) plt.show() Tuning def evalClassModel(model, y_test1, y_pred_class, plot=False): #Classification accuracy: percentage of correct predictions # calculate accuracy print('Accuracy:', metrics.accuracy_score(y_test1, y_pred_class)) print('Null accuracy:n', y_test1.value_counts()) # calculate the percentage of ones print('Percentage of ones:', y_test1.mean()) # calculate the percentage of zeros print('Percentage of zeros:',1 - y_test1.mean()) print('True:', y_test1.values[0:25]) print('Pred:', y_pred_class[0:25]) #Confusion matrix confusion = metrics.confusion_matrix(y_test1, y_pred_class) #[row, column] TP = confusion[1, 1] TN = confusion[0, 0] FP = confusion[0, 1] FN = confusion[1, 0] # visualize Confusion Matrix sns.heatmap(confusion,annot=True,fmt="d") plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') plt.show() accuracy = metrics.accuracy_score(y_test1, y_pred_class) print('Classification Accuracy:', accuracy) print('Classification Error:', 1 - metrics.accuracy_score(y_test1, y_pred_class)) fp_rate = FP / float(TN + FP) print('False Positive Rate:', fp_rate) print('Precision:', metrics.precision_score(y_test1, y_pred_class)) print('AUC Score:', metrics.roc_auc_score(y_test1, y_pred_class)) # calculate cross-validated AUC print('Crossvalidated AUC values:', cross_val_score1(model, X, y, cv=10, scoring='roc_auc').mean()) print('First 10 predicted responses:n', model.predict(X_test1)[0:10]) print('First 10 predicted probabilities of class members:n', model.predict_proba(X_test1)[0:10]) model.predict_proba(X_test1)[0:10, 1] y_pred_prob = model.predict_proba(X_test1)[:, 1] if plot == True: # histogram of predicted probabilities plt.rcParams['font.size'] = 12 plt.hist(y_pred_prob, bins=8) plt.xlim(0,1) plt.title('Histogram of predicted probabilities') plt.xlabel('Predicted probability of treatment') plt.ylabel('Frequency') y_pred_prob = y_pred_prob.reshape(-1,1) y_pred_class = binarize(y_pred_prob, 0.3)[0] print('First 10 predicted probabilities:n', y_pred_prob[0:10]) roc_auc = metrics.roc_auc_score(y_test1, y_pred_prob) fpr, tpr, thresholds = metrics.roc_curve(y_test1, y_pred_prob) if plot == True: plt.figure() plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for treatment classifier') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.legend(loc="lower right") plt.show() def evaluate_threshold(threshold): confusion = metrics.confusion_matrix(y_test1, predict_mine) print(confusion) return accuracyTuning with cross-validation score
def tuningCV(knn): k_Range = list(Range(1, 31)) k_scores = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score1(knn, X, y, cv=10, scoring='accuracy') k_scores.append(scores.mean()) print(k_scores) plt.plot(k_Range, k_scores) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy') plt.show()Tuning with GridSearchCV
def tuningGridSerach(knn): k_Range = list(range(1, 31)) print(k_Range) param_grid = dict(n_neighbors=k_range) print(param_grid) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy') grid.fit(X, y) grid.grid_scores1_ print(grid.grid_scores_[0].parameters) print(grid.grid_scores_[0].cv_validation_scores) print(grid.grid_scores_[0].mean_validation_score) grid_mean_scores1 = [result.mean_validation_score for result in grid.grid_scores_] print(grid_mean_scores1) # plot the results plt.plot(k_Range, grid_mean_scores1) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy') plt.show() # examine the best model print('GridSearch best score', grid.best_score_) print('GridSearch best params', grid.best_params_) print('GridSearch best estimator', grid.best_estimator_)Tuning with RandomizedSearchCV
def tuningRandomizedSearchCV(model, param_dist): rand1 = RandomizedSearchCV(model, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state1=5) rand1.fit(X, y) rand1.cv_results_ print('Rand1. Best Score: ', rand.best_score_) print('Rand1. Best Params: ', rand.best_params_) best_scores = [] for _ in Range(20): rand1 = RandomizedSearchCV(model, param_dist, cv=10, scoring='accuracy', n_iter=10) rand1.fit(X, y) best_scores.append(round(rand.best_score_, 3)) print(best_scores)Tuning by searching multiple parameters simultaneously
def tuningMultParam(knn): k_Range = list(Range(1, 31)) weight_options = ['uniform', 'distance'] param_grid = dict(N_neighbors=k_range, weights=weight_options) print(param_grid) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy') grid.fit(X, y) print(grid.grid_scores_) print('Multiparam. Best Score: ', grid.best_score_) print('Multiparam. Best Params: ', grid.best_params_) Evaluating ModelsLogistic Regression
def logisticRegression(): logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred_class = logreg.predict(X_test1) accuracy_score = evalClassModel(logreg, y_test1, y_pred_class, True) #Data for final graph methodDict['Log. Regression'] = accuracy_score * 100 logisticRegression()True value: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Predicted value: [1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]
[[0.09193053 0.90806947] [0.95991564 0.04008436] [0.96547467 0.03452533] [0.78757121 0.21242879] [0.38959922 0.61040078] [0.05264207 0.94735793] [0.75035574 0.24964426] [0.19065116 0.80934884] [0.61612081 0.38387919] [0.47699963 0.52300037]] [[0.90806947] [0.04008436] [0.03452533] [0.21242879] [0.61040078] [0.94735793] [0.24964426] [0.80934884] [0.38387919] [0.52300037]]
[[142 49] [ 28 159]]
KNeighbors Classifier
def Knn(): # Calculating the best parameters knn = KNeighborsClassifier(n_neighbors=5) k_Range = list(Range(1, 31)) weight_options = ['uniform', 'distance'] param_dist = dict(N_neighbors=k_range, weights=weight_options) tuningRandomizedSearchCV(knn, param_dist) knn = KNeighborsClassifier(n_neighbors=27, weights='uniform') knn.fit(X_train1, y_train1) y_pred_class = knn.predict(X_test1) accuracy_score = evalClassModel(knn, y_test1, y_pred_class, True) #Data for final graph methodDict['K-Neighbors'] = accuracy_score * 100 Knn()[0.816, 0.812, 0.821, 0.823, 0.823, 0.818, 0.821, 0.821, 0.815, 0.812, 0.819, 0.811, 0.819, 0.818, 0.82, 0.815, 0.803, 0.821, 0.823, 0.815] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]
[[0.33333333 0.66666667] [1. 0. ] [1. 0. ] [0.66666667 0.33333333] [0.37037037 0.62962963] [0.03703704 0.96296296] [0.59259259 0.40740741] [0.37037037 0.62962963] [0.33333333 0.66666667] [0.33333333 0.66666667]] [[0.66666667] [0. ] [0. ] [0.33333333] [0.62962963] [0.96296296] [0.40740741] [0.62962963] [0.66666667] [0.66666667]]
[[135 56] [ 18 169]]
Decision Tree
def treeClassifier(): # Calculating the best parameters tree1 = DecisionTreeClassifier() featuresSize = feature_cols1.__len__() param_dist = {"max_depth": [3, None], "max_features": randint(1, featuresSize), "min_samples_split": randint(2, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} tuningRandomizedSearchCV(tree1, param_dist) tree1 = DecisionTreeClassifier(max_depth=3, min_samples_split=8, max_features=6, criterion='entropy', min_samples_leaf=7) tree.fit(X_train1, y_train1) y_pred_class = tree1.predict(X_test1) accuracy_score = evalClassModel(tree1, y_test1, y_pred_class, True) #Data for final graph methodDict['Decision Tree Classifier'] = accuracy_score * 100 treeClassifier()[0.83, 0.827, 0.831, 0.829, 0.831, 0.83, 0.783, 0.831, 0.821, 0.831, 0.831, 0.831, 0.8, 0.79, 0.831, 0.831, 0.831, 0.829, 0.831, 0.831] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]
[[0.18 0.82 ] [0.96534653 0.03465347] [0.96534653 0.03465347] [0.89473684 0.10526316] [0.36097561 0.63902439] [0.18 0.82 ] [0.89473684 0.10526316] [0.11320755 0.88679245] [0.36097561 0.63902439] [0.36097561 0.63902439]] [[0.82 ] [0.03465347] [0.03465347] [0.10526316] [0.63902439] [0.82 ] [0.10526316] [0.88679245] [0.63902439] [0.63902439]]
[[130 61] [ 12 175]]
Random Forests
def randomForest(): # Calculating the best parameters forest1 = RandomForestClassifier(n_estimators = 20) featuresSize = feature_cols1.__len__() param_dist = {"max_depth": [3, None], "max_features": randint(1, featuresSize), "min_samples_split": randint(2, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]} tuningRandomizedSearchCV(forest1, param_dist) forest1 = RandomForestClassifier(max_depth = None, min_samples_leaf=8, min_samples_split=2, n_estimators = 20, random_state = 1) my_forest = forest.fit(X_train1, y_train1) y_pred_class = my_forest.predict(X_test1) accuracy_score = evalClassModel(my_forest, y_test1, y_pred_class, True) #Data for final graph methodDict['Random Forest'] = accuracy_score * 100 randomForest()[0.831, 0.831, 0.831, 0.831, 0.831, 0.831, 0.831, 0.832, 0.831, 0.831, 0.831, 0.831, 0.837, 0.834, 0.831, 0.832, 0.831, 0.831, 0.831, 0.831] True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]
[[0.2555794 0.7444206 ] [0.95069083 0.04930917] [0.93851009 0.06148991] [0.87096597 0.12903403] [0.40653554 0.59346446] [0.17282958 0.82717042] [0.89450448 0.10549552] [0.4065912 0.5934088 ] [0.20540631 0.79459369] [0.19337644 0.80662356]] [[0.7444206 ] [0.04930917] [0.06148991] [0.12903403] [0.59346446] [0.82717042] [0.10549552] [0.5934088 ] [0.79459369] [0.80662356]]
Boosting
def boosting(): # Building and fitting clf = DecisionTreeClassifier(criterion='entropy', max_depth=1) boost = AdaBoostClassifier(base_estimator=clf, n_estimators=500) boost.fit(X_train1, y_train1) y_pred_class = boost.predict(X_test1) accuracy_score = evalClassModel(boost, y_test1, y_pred_class, True) #Data for final graph methodDict['Boosting'] = accuracy_score * 100 boosting()True val: [0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0] Pred val: [1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0]
[[0.49924555 0.50075445] [0.50285507 0.49714493] [0.50291786 0.49708214] [0.50127788 0.49872212] [0.50013552 0.49986448] [0.49796157 0.50203843] [0.50046371 0.49953629] [0.49939483 0.50060517] [0.49921757 0.50078243] [0.49897133 0.50102867]] [[0.50075445] [0.49714493] [0.49708214] [0.49872212] [0.49986448] [0.50203843] [0.49953629] [0.50060517] [0.50078243] [0.50102867]]
Predicting with Neural NetworkCreate input function
%tensorflow_version 1.x import tensorflow as tf import argparseTensorFlow 1.x selected.
batch_size = 100 train_steps = 1000 X_train1, X_test1, y_train1, y_test1 = train_test1_split(X, y, test_size=0.30, random_state=0) def train_input_fn(features, labels, batch_size): dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels)) return dataset.shuffle(1000).repeat().batch(batch_size) def eval_input_fn(features, labels, batch_size): features=dict(features) if labels is None: # No labels, use only features. inputs = features else: inputs = (features, labels) dataset = tf.data.Dataset.from_tensor_slices(inputs) dataset = dataset.batch(batch_size) # Return the dataset. return datasetDefine the feature columns
# Define Tensorflow feature columns age = tf.feature_column.numeric_column("Age") gender = tf.feature_column.numeric_column("Gender") family_history = tf.feature_column.numeric_column("family_history") benefits = tf.feature_column.numeric_column("benefits") care_options = tf.feature_column.numeric_column("care_options") anonymity = tf.feature_column.numeric_column("anonymity") leave = tf.feature_column.numeric_column("leave") work_interfere = tf.feature_column.numeric_column("work_interfere") feature_column = [age, gender, family_history, benefits, care_options, anonymity, leave, work_interfere]Instantiate an Estimator
model = tf.estimator.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 10], optimizer=tf.train.ProximalAdagradOptimizer( learning_rate=0.1, l1_regularization_strength=0.001 ))Train the model
model.train(input_fn=lambda:train_input_fn(X_train1, y_train1, batch_size), steps=train_steps)Evaluate the model
# Evaluate the model. eval_result = model.evaluate( input_fn=lambda:eval_input_fn(X_test1, y_test1, batch_size)) print('nTest set accuracy: {accuracy:0.2f}n'.format(**eval_result)) #Data for final graph accuracy = eval_result['accuracy'] * 100 methodDict['Neural Network'] = accuracyThe test set accuracy: 0.80
Making predictions (inferring) from the trained model
predictions = list(model.predict(input_fn=lambda:eval_input_fn(X_train1, y_train1, batch_size=batch_size))) # Generate predictions from the model template = ('nIndex: "{}", Prediction is "{}" ({:.1f}%), expected "{}"') # Dictionary for predictions col1 = [] col2 = [] col3 = [] for idx, input, p in zip(X_train1.index, y_train1, predictions): v = p["class_ids"][0] class_id = p['class_ids'][0] probability = p['probabilities'][class_id] # Probability # Adding to dataframe col1.append(idx) # Index col2.append(v) # Prediction col3.append(input) # Expecter #print(template.format(idx, v, 100 * probability, input)) results = pd.DataFrame({'index':col1, 'prediction':col2, 'expected':col3}) results.head() Creating Predictions on the Test Set# Generate predictions with the best methodology
clf = AdaBoostClassifier() clf.fit(X, y) dfTestPredictions = clf.predict(X_test1) # Write predictions to csv file results = pd.DataFrame({'Index': X_test1.index, 'Treatment': dfTestPredictions}) # Save to file results.to_csv('results.csv', index=False) results.head() Submission results = pd.DataFrame({'Index': X_test1.index, 'Treatment': dfTestPredictions}) resultsThe final prediction consists of 0 and 1. 0 means the person is not needed any mental health treatment and 1 means the person is needed mental health treatment.
ConclusionAfter using all these Employee records, we are able to build various machine learning models. From all the models, ADA–Boost achieved 81.75% accuracy with an AUC of 0.8185 along with that we were able to draw some insights from the data via data analysis and visualization.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Unraveling Data Anomalies In Machine Learning
Introduction
In the realm of machine learning, the veracity of data holds utmost significance in the triumph of models. Inadequate data quality can give rise to erroneous predictions, unreliable insights, and overall performance. Grasping the significance of data quality and making oneself familiar with techniques to unearth and tackle data anomalies is important for constructing robust and reliable machine learning models.
This article presents a comprehensive overview of data anomalies, their impact on machine learning, and the techniques employed to address them. Moreover, by way of this article, readers will understand the pivotal role played by data quality in machine learning and practical expertise in detecting and mitigating data anomalies effectively.
This article was published as a part of the Data Science Blogathon.
What Encompasses Data Anomalies?Data anomalies, otherwise known as data quality issues or irregularities, allude to any unanticipated or aberrant characteristics present within a dataset.
These anomalies may arise due to diverse factors, such as human fallibility, measurement inaccuracies, data corruption, or system malfunctions.
Identifying and rectifying data anomalies assumes critical importance, as a result of which the reliability and accuracy of machine learning models is ensured.
An Assortment of Data AnomaliesData anomalies can be present in sundry forms. Prominent types of data anomalies include:
Missing Data: Denoting instances where specific data points or attributes remain unrecorded or incomplete.
Duplicate Data: Signifying the existence of identical or highly similar data entries within the dataset.
Denoting instances: where specific data points or attributes remain unrecorded or incomplete.: Pertaining to data points that diverge significantly from the expected or normal range.
Noise: Entailing random variations or errors in data that can impede analysis and modeling.
Categorical Variables: Encompassing inconsistent or ambiguous values within categorical data.
Detecting and addressing these anomalies assumes utmost importance in upholding the integrity and reliability of data employed in machine learning models.
Unearthing and Navigating Missing DataMissing data can exert a notable impact on the accuracy and reliability of machine learning models. Various techniques exist to handle missing data, such as:
import pandas as pd # Dataset ingestion data = pd.read_csv("dataset.csv") # Identifying missing values missing_values = data.isnull().sum() # Eliminating rows with missing values data = data.dropna() # Substituting missing values with mean/median data["age"].fillna(data["age"].mean(), inplace=True)This code example shows the loading of a dataset using Pandas, detection of missing values using the isnull() function, elimination of rows containing missing values using the dropna() function, and substitution of missing values with mean or median values through the fillna() function.
Contending with Repetitive DataRepetitive data has the potential to skew analysis and modeling outcomes. It is pivotal to identify and expunge duplicate entries from the dataset. The following example elucidates the handling of duplicate data:
import pandas as pd # Dataset ingestion data = pd.read_csv("dataset.csv") # Detecting duplicate rows duplicate_rows = data.duplicated() # Eliminating duplicate rows data = data.drop_duplicates() # Index reset data = data.reset_index(drop=True)This code example demonstrates the detection and removal of duplicate rows using Pandas. The duplicated() function identifies duplicate rows, which can subsequently be eliminated using the drop_duplicates() function. Finally, the index is reset using the reset_index() function, resulting in a pristine dataset.
Managing Outliers and NoiseDetecting and managing these anomalies in a suitable manner is crucial. The subsequent example elucidates the management of outliers using the z-score method:
import numpy as np # Calculating z-scores z_scores = (data - np.mean(data)) / np.std(data) # Establishing threshold for outliers threshold = 3 # Detecting outliers # Eliminating outliers cleaned_data = data[~outliers]This code example shows the calculation of z-scores for the data using NumPy, the establishment of a threshold for identifying outliers, and the removal of outliers from the dataset. The resultant dataset, cleaned_data, is devoid of outliers.
Resolving the Conundrum of Categorical VariablesCategorical variables bearing inconsistent or ambiguous values can introduce data quality predicaments.
Handling categorical variables entails techniques such as standardization, one-hot encoding, or ordinal encoding. The subsequent example employs one-hot encoding:
import pandas as pd # Dataset ingestion data = pd.read_csv("dataset.csv") # One-hot encoding encoded_data = pd.get_dummies(data, columns=["category"])In this code example, the dataset is using Pandas, and execute one-hot encoding through the get_dummies() function.
The resulting encoded_data will incorporate separate columns for each category, with binary values denoting the presence or absence of each category.
Data Preprocessing for Machine LearningPreprocessing the data assumes importance in managing data quality predicaments and priming it for machine learning models.
You can execute techniques like scaling, normalization, and feature selection. The ensuing example showcases data preprocessing through Scikit-learn.
from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression # Feature scaling scaler = StandardScaler() scaled_data = scaler.fit_transform(data) # Feature selection selector = SelectKBest(score_func=f_regression, k=10) selected_features = selector.fit_transform(scaled_data, target)This code example illustrates the performance of feature scaling using StandardScaler() and feature selection using SelectKBest() from Scikit-learn.
The resultant scaled_data incorporates standardized features, while selected_features comprises the most relevant features based on the F-regression score.
Pioneering Feature Engineering for Enhanced Data QualityFeature engineering entails the creation of novel features or the transformation of existing ones to bolster data quality and enhance the performance of machine learning models. The subsequent example showcases feature engineering through Pandas.
import pandas as pd # Dataset ingestion data = pd.read_csv("dataset.csv") # Creation of a novel feature data["total_income"] = data["salary"] + data["bonus"] # Transformation of a feature data["log_income"] = np.log(data["total_income"])In this code example, a novel feature, total_income, is created by aggregating the “salary” and “bonus” columns. Another feature, log_income, is generated by applying the logarithm to the “total_income” column using the log() function from NumPy. These feature engineering techniques augment data quality and furnish supplementary information to machine learning models.
ConclusionAll in all, data anomalies pose customary challenges in machine learning endeavors. Acquiring comprehension regarding the distinct types of data anomalies and acquiring the proficiency to detect and address them is imperative for constructing dependable and accurate machine learning models.
For the most part, by adhering to the techniques and code examples furnished in this article, one can effectively tackle data quality predicaments and enhance the performance of machine learning endeavors.
Key Takeaways
Ensuring the quality of your data is crucial for reliable and accurate machine learning models.
Poor data quality, such as missing values, outliers, and inconsistent categories, can negatively impact model performance.
Use various techniques to clean and handle data quality issues. These include handling missing values, removing duplicates, managing outliers, and addressing categorical variables through techniques like one-hot encoding.
Preprocessing the data prepares it for machine learning models. Techniques like feature scaling, normalization, and feature selection help improve data quality and enhance model performance.
Feature engineering involves creating new features or transforming existing ones to improve data quality and provide additional information to machine learning models. This can lead to significant insights and more accurate predictions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Frequently Asked QuestionsQ1. What are the anomalies in the data?
A. Anomalies in data refer to observations or patterns that deviate significantly from the norm or expected behavior. They can be data points, events, or behaviors that are rare, unexpected, or potentially indicative of errors, outliers, or unusual patterns in the dataset.
Q2. How do you use ML in anomaly detection?
A. Machine learning (ML) is commonly used in anomaly detection to automatically identify anomalies in data. ML models are trained on normal or non-anomalous data, and then they can classify or flag instances that deviate from the learned patterns as potential anomalies.
Q3. What is an anomaly in machine learning?
A. An anomaly in machine learning refers to a data point or pattern that does not conform to the expected behavior or normal distribution of the dataset. Anomalies can indicate unusual events, errors, fraud, system malfunctions, or other irregularities that may be of interest or concern.
Q4. What are the different types of anomaly detection?
A. There are various types of anomaly detection methods used in machine learning, including statistical methods, clustering-based approaches, density estimation techniques, supervised learning methods, and time-series analysis. Each type has its own strengths and is suited for different types of data and anomaly detection scenarios.
Related
Update the detailed information about Speech Emotion Recognition (Ser) Through Machine Learning on the Cattuongwedding.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!