Friday, 7 July 2023

Unveiling the Clever Hans Effect in Audio Deepfakes: A Deep Dive by Bhusan Chettri

Unveiling the Clever Hans Effect in Audio Deepfakes: A Deep Dive by Bhusan Chettri: Unveiling the Clever Hans Effect in Audio Deepfakes: A Deep Dive by Bhusan Chettri Deepfakes, computer-generated data produced using advanced deep learning algorithms, have become increasingly difficult to differentiate from real data. In this article, Dr. Bhusan Chettri takes a comprehensive exploration of the Clever Hans effect in audio deepfake detectors, focusing on two widely […]

Wednesday, 28 December 2022

Data Analytics Revision By Bhusan Chettri

 

In this lecture, Bhusan Chettri summarises the topics covered in the course Data Analytics using Python. First, a description and summary of Data Analytics, in general, are provided. This is followed by a discussion of key concepts of python programming language and how it is used in data analytics. Then basics of Numpy - numerical python programming are provided. Next brief discussion about different concepts studied under the Pandas data analysis library is discussed. Next, an introduction to Machine learning basics is provided. Key concepts such as overfitting and underfitting, biases and variance, and their trade-offs are discussed. Next, a quick recap about things learned under the deep learning introduction session is provided. Finally, the importance of algorithm design and analysis techniques related to the time complexity analysis of an algorithm is discussed. Please check out the video for details. 

Subscribe to Bhusan Chettri's YouTube Channel for more informative videos.


Timestamps:
1:15:25 Data analytics a quick recap
1:17:00 Introduction to Data Analytics overview
1:18:33 Python programming basics overview
1:20:36 Numpy- numerical python library overview
1:21:08 Pandas - data analysis library overview
1:22:43 Machine learning basics overview
1:31:49 Deep Learning with Keras overview
1:35:32 Algorithm complexity analysis overview
 

Sunday, 20 November 2022

Different approaches to Biometric authentication and why voice biometrics is regarded as a futuristic approach by Bhusan Chettri

 

 

Biometric authentication

 

Bhusan Chettri explains that every human being possess a unique and distinctive biological attributes that makes one person different from another. DNA, fingerprints, iris, facial pattern, and vocal apparatus are such unique characteristics that differ from person to person, and these are often utilized for the purpose of automatic person identification and verification using computer and AI algorithms. 

In this article, Bhusan Chettri takes a deeper look at how different biometric authentication methods are used, their limitations and advantages and how voice biometrics is different from them, and why the future of authentication should be based on voice will be discussed. Bhusan Chettri says, “many methods for biometric authentication are based on measuring the similarity of data points in the feature space. In other words, these methods extract relevant features from the data during both the training and testing phase of the system and basically measure the distance between the feature extracted from the test sample to that of the template created during the training step.”

  

DNA for authentication

DNA stands for deoxyribonucleic acid, and it has been used for personal identification and authentication in different applications. One of the main advantages of using DNA as a means of biometric authentication is that the DNA of a person does not change whether he/she is alive or dead. It remains the same as a person transit through various phases of the life cycle — as an infant, childhood, adulthood, aging, and finally, death — the DNA does not change. Therefore, it is regarded as one of the most reliable forms of identification and person verification. However, practical usage and mass adoption of this method is not possible due to several shortcomings that are discussed next: (a) The whole process of analyzing DNA samples is time-consuming; (b) Ethical concerns. Protecting the privacy of people is important as different information extracted in the process can easily reveal the identity of individuals. Therefore, there must be safety measures in place to prevent spoofing; © Verifying the identity of twins: as twins usually share a similar genome it is hard to distinguish their identity using DNA; (d) Cost: the equipment involved in performing DNA analysis and maintenance of the infrastructure is expensive. Therefore, not every person or research group can afford to run biometric authentication using DNA unless their research is backed with good funding.

 

 Fingerprints for authentication

It is still one of the most widely adopted methods for personal identification. This is commonly used for verifying a person’s identity in passport control for border protection; digital identity; and financial services. Usually, all ten fingers are considered for the creation of the user’s fingerprint template. The fingerprint authentication method works in two stages.


Enrollment: during this step, the system takes the user’s fingerprint of all ten fingers and creates the fingerprint template. For example, this step could be considered as someone entering a new country where he/she needs to go through the border control office. Here the person’s fingerprints are all recorded. This can be regarded as the creation of that user’s fingerprint template. Next time, when the person again enters the country, his fingerprints are again recorded. This time the system compares the stored fingerprint template with the new ones to verify that the person is the same and the system also pulls up all the records/history about the person from the last visit to that county.


Unlike identity cards which can be spoofed easily, fingerprint authentication is difficult to spoof, and therefore it is often considered one of the most reliable forms of authentication methods. Furthermore, with such a method, a user is not required to remember a long password — which means there is no fear of forgetting a password. Thus, this method offers a simplified and convenient means of authentication. This method also has some limitations and drawbacks. Cost is one key factor. Implementation and maintenance of fingerprint identification systems are costly — usually when it comes to an individual or small organization running such a system. Furthermore, like any other electronic device, it suffers an issue with power failure. This means that a constant power backup needs to be there in case of power failures to ensure the system is up and running. Another key issue with this method is that a significant amount of people with physical disabilities (loss of finger or hands) may be excluded from availing the facility of this method.

 

IRIS for authentication

An iris is a part of an eye behind the cornea that surrounds the pupil (an adjustable circular opening in the center of an eye). Every person has a unique iris pattern that does not change during their lifetime. As in other forms of identification, this methodology works in two stages. First, the template of iris for a user needs to be prepared. Then template matching is performed at test time (or deployment). It works by first locating the pupil position in an eye and then the location of the iris and eyelids are identified. Then unnecessary data such as eyelids, and eyelashes are removed through cropping and only the iris part is retained. From the retained data, relevant features are extracted, and a user template is created and stored in the database. At test time (during deployment) for a new person’s iris relevant features are extracted using the same methodology as used during the training step. The template is then compared with the new features to find the degree of match between the two.

It provides fast and accurate means of person authentication and works even when a person is wearing eyeglasses. It is possible to discriminate between twins using this method due to the uniqueness of iris patterns across twin children. With such contact-less methods for authentication, it makes it much more hygienic in using them in contrast to the fingerprint authentication method that involves physically touching the device. As the device operates on infrared rays, this form of recognition can work even in dark or night conditions. This method requires the use of high-quality sensors and an infrared light source and cannot operate on a normal camera. It may not work from a distance, and therefore requires the user to stare at the system’s scanner/camera from proximity which may not be comfortable for every user.

 

Facial pattern for authentication

Every person shares a unique face (with few exceptions of similarly looking faces or twins). Patterns extracted from an individual face are widely used for person authentication across different domains such as border control, law enforcement, personalized user experience, etc. Like other biometric systems (for example voice-based biometrics), to build facial recognition systems there are two major steps involved. Training and testing (deployment). The training step usually involves extracting relevant features corresponding to a facial pattern of an individual and preparing a face template (or faceprint) using the extracted features. During deployment (while the system is up and running), it compares the image (either captured in a real-time or previously captured digital image) with the stored user’s face template to find the degree of similarity. If the degree of match is very high (as defined by the probability score returned by the authentication system) then the system allows the user access to services or passes control to the subsequent phases depending upon the application. For example, in the case of border and passport control, if the newly detected face matches one of the templates in the database of fraudsters, then the authorities might be alerted to take quick action on the matter.


As the AI algorithms for facial recognition (dominated by deep learning) have advanced substantially in the past few years these systems can detect and verify persons with a high degree of accuracy even at night conditions (or even when a user is wearing a mask). For example, commercial phones such as iPhone X have Face ID that allows users to protect access using a face pattern. This means every time someone tries to access the phone, it prompts for a face match and allows access only to registered faces (in this case would be the phone owner). While such a method allows many benefits in terms of convenience and flexibility, it does have limitations and disadvantages. Privacy and security are some of the growing concerns about how these systems capture people’s images, store them in their database (or in the cloud), and how they are being used in their application pipeline. These data could be hacked by scammers or hackers to steal someone’s identity or agencies may use them to track people without getting their consent. Furthermore, one of the major issues with such facial recognition technology is the racial biases (or discrimination) that such AI systems learn from the training data. The research on racial discrimination in face recognition technology by Alex Najibi found that the AI system trained to detect a person using their face was highly biased based on a person’s color. Their system performed poorly in recognizing black males and females while it showed a high degree of accuracy in recognizing white persons. Thus, one natural question that arises is how we can trust such a system that has learned bias and discrimination on its own. This further means that such AI systems are not faithful and cannot be trusted to be used in safety-critical applications. Because of such biases, in many parts of the US such as Boston, San Francisco, etc., police and local law enforcement agencies were banned from using facial recognition software. Alex Najibi is a 5th-year Ph.D. candidate studying bioengineering at Harvard University’s School of Engineering and Applied Sciences. 


Voice for authentication

Voice (or speech) is the primary means of communication among human beings. Let us briefly talk about the human voice production mechanism. Broadly, there are three major parts (or let’s say steps) involved in speech/voice production by humans. These are breathing mechanisms through the lungs; the air produced from the lungs then passes through vocal folds that vibrate continuously to produce different types of sounds; and finally, it passes through the vocal tract which is a kind of resonating system that helps produce different varieties of sounds. Different sounds, for example, vowels, consonants, etc. are produced based on how air generated from lungs interacts with vocal apparatus (vocal tract) including the position of tongue and articulation of lips (closure/opening). Usually, the length of the vocal tract for females is much shorter than those of males. This is one reason why female voices have a high pitch in contrast to male voices.

Every person has a unique voice. Voice patterns are often used for the purpose of identity verification of an individual. Please check An overview of Voice Biometrics by Bhusan Chettri to learn the basics of voice biometrics. Bhusan also Explains the Usage & Challenges of Voice Authentication Systems. Although a person’s voice changes over time, today’s voice biometric systems take care of such implicit factors during their system development. Therefore, their accuracy remains almost the same even in conditions where the speaker is suffering from a sore throat and cold.
Voice or speech conveys several levels of information. These are:


  1. Speech/voice: the message
  2. Emotion: the emotional state of the person
  3. Gender: male or female
  4. Language: which language is being spoken?
  5. Accent (demography): which part of the country/state?
  6. Speaker: who is speaking?


Based on the above, voice/speech technology has several applications.

Automatic Speech Recognition (ASR): this technology is used to extract with a high degree of accuracy the spoken words. In other words, an ASR system would output the text (transcript) from a given input spoken speech.

Automatic Emotion Detection: this technology is used to extract the state of emotion of the speaker while he/she was speaking a particular phrase/word. This emotional state could take values such as happy, sad, angry, etc.

Automatic Gender Detection: this technology is used to automatically extract gender information from a given speech/voice signal. It usually takes two values: male or female unless the system is trained to also consider neutral gender or transgender information. In such a case, the system will no longer remain a binary classifier as it has more than two output states representing gender information.

Automatic Language Identification: given a speech/voice signal, this technology is used to identify which language is being spoken by the speaker. Automatic language identification systems are usually trained in a multi-class setting as there are more than two languages that need to be considered. With such automatic systems in place, it becomes easier, for example, for border control and law enforcement agencies to understand the language being spoken by a person (who has been detained for some reason and doesn’t speak English) and therefore arrange for someone who understands the foreign language to deal with the situation more proactively.

Automatic Accent Identification: given a speech/voice signal, this technology is used to automatically detect/identify the accent used by the person during a conversation. With such information, it is often easier to also identify the demographic information of the person. For example, the accent used by an Englishmen from Yorkshire is completely different from an Englishmen in London.

Automatic speaker Identification: Automatic speaker verification or automatic speaker identification technology aims at verifying/identifying the identity of a claimed speaker from a given speech signal. This post on speaker recognition and AI and challenges of voice authentication systems provides a good summary of the technology. The use of Voice technology is increasing day by day because of its flexibility and simplicity in operating and voice/speech being the very base of how humans communicate. Such technology gives users the freedom to do things by simply giving a command: a voice speaking from a distance to a machine. For example, speaking to a digital assistant e.g., Cortana or Alexa to play a song called Last Kiss by Pearl Jam while laying down on the bed. This is possible because of the integration of technologies — such as ASR, Natural Language Processing, Natural Language Understanding, and Text-to-Speech — into one engine (running on the cloud) that runs the functioning of a digital assistant. As evident from these products (Microsoft’s Cortana, Apple’s Siri, Google’s Assistant, and Amazon’s Alexa) there is a huge investment and development by Tech Giants in AI and machine learning for voice/speech technology with an aim to enhance user’s degree of interaction with the technology by simply using their voice.

Apart from these top Tech Giants, Voice Technology has been adopted across numerous industries/sectors. For example, voice biometrics are widely used in Banks, Insurance companies for personal authentication; they are used to enhance personal user experience; they are also used for surveillance and forensic applications. Despite the success brought by recent advancements in AI algorithms for voice authentication technology, like any other biometric technology, they are not 100% secure. They can be manipulated to gain illegitimate access to the biometric system of some other registered users and tamper with their data and perform illegal activities. Bhusan Chettri recently explained security of voice biometrics and voice authentication and spoofing attacks . To counter such spoofing attacks, Dr. Bhusan Chettri who earned his Ph.D. in Voice Technology and AI (analysis and design of countermeasures for voice spoofing attacks using AI and Machine learning) from Queen Mary University of London has been working hard to make significant research contributions in the field (see google scholar ), raising awareness of the problem, issues about how biases in training data impact decision of anti-spoofing systems and have also focused on explaining the predictions of such systems so as to build better, reliable and trustworthy systems for spoofing detection. No matter what, because of ease in operating conditions using voice, voice biometrics has more potential for user authentication. Combined with other biometrics such as iris and facial recognition, the ensemble system for person authentication could bring more accuracy in detecting and discriminating a real person from fraudsters.


Initially posted by Ventsmagazine 

Related links:                        

ISSUU     DEEPAI      ORCID      GITHUB        QUORA  

 




 

 

 




Saturday, 19 November 2022

Applications, advantages, and the risks of synthetic voices created using AI by Bhusan Chettri

 

 Artificial Voices and Artificial Intelligence

 

Dr. Bhusan Chettri a Ph.D. graduate from Queen Mary University of London (QMUL) explains the fundamentals of how today’s AI technology and computers are capable of producing human-like sounding synthetic voices; he discusses their various applications and talks about advantages along with the threat it imposes on voice-controlled applications.

Before diving into discussing computer-generated voice using AI and its applications, he considers a simple use case scenario of a banking application to provide a better context of the topic. Bhusan Chettri says “Let’s assume that you are an employee of some reputed international bank. And one fine day while working late hours to keep up with the deadline ahead you receive a phone call from an unknown number from abroad. The number seems to be from the UK. You attend the call, and the person introduces himself to be your manager. This makes absolute sense because your manager is currently in London to attend some business meetings to fix up some international deals. Yes, you are now confident - because of the country code and the familiar voice you hear - that the person you are speaking to is your manager. You decide to engage in the conversation despite the urgent deadline that needs to be finished. He asks you to do a wired transfer of some huge amount to a new account. He emphasizes that this money is very important to complete the deal that he has come to London for and that it needs to be done asap. Then, he hangs up the phone after giving some clear instructions for making the transfer. This does not sound right. You have so many questions running in your head. You are very confused and do not understand what to do. On one hand, it is your manager’s voice from London and the voice confirms it’s him and the country code says it’s a UK number. But hang on, what he is asking is not in line with the company’s policy. Transfer to a new account needs to go through verification checks. There are procedures to be followed and he has asked me to just ignore everything else and make the wired transfer asap. You start to think about all the possible options that you have. If I stick to the company’s protocol and do not listen to him, he may get very upset; the deal might not go through because of this, and consequently, I will not get my most awaited promotion next month. Okay, with all these thoughts running in your mind, you follow your instincts and finally make the transfer of the sum to the account that was provided.”

What do you think could have happened? There can be two likely outcomes.

  1. The voice was really from the manager and the phone call was genuine (and of high emergency). So, you are praised for your prompt action and support. The deal went through, and you got your most awaited promotion. Everybody is happy.

  2. The second option does not have a happy ending though. The voice was fake. It was from an attacker (scammer) who had been planning to attack/scam the bank for a long time. This means you just got scammed. You end up losing huge money to a scammer. So, in this case, instead of getting that promotion your job is at high risk. Most likely you may be facing some internal investigation where some legal action might be taken against you for not following the company’s policy. This sounds very ugly!

 

Bhusan Chettri adds, “the second point being discussed here is of biggest concern. With the recent advancement AI has made in Text-to-Speech and Voice-Conversion technology, computers are now capable of producing fake/synthetic voices that sound as natural as if spoken by a real human. The technology behind this is often driven by so-called Deep Learning which makes use of a massive amount of speech data to learn the pattern in the voices (similarities and differences across different voices). They are capable of building synthetic voices that sound flawless to human ears. One reason that humans are often unable to discriminate between today’s computer-generated voice (advanced AI algorithms) and real voice is the fact that our ears are not designed to detect tiny artifacts that appear in synthetic voices from the application of AI algorithms, rather our ears focus on the bigger picture – aims to understand the contents (spoken words, speaking styles, etc) rather than tiny differences induced from algorithms.”

Therefore, it is very important to gain some basic understanding of these technologies that can produce artificial voices which humans cannot discriminate; their pros and cons – well we will understand towards the end of this blog that such technology can be used equally for good and bad – there are both good and bad actors. Their applications: advantages and potential risks that it imposes in voice-driven applications in various settings (for example banking, digital home, etc). Next, Bhusan Chettri aims a providing a basic overview of TTS and VC technology and further discusses its applications.


Text to Speech (TTS)

Transforming the input text into a speech/voice. Given an input text, the goal of TTS technology is to produce speech/voice corresponding to the text while maintaining naturalness as closely as possible such that for human ears it becomes quite hard to distinguish them from real human speech. Figure 1 illustrates a typical TTS system. It consists of two major components: text analysis and speech waveform generation module. The text analysis module is responsible for analysing the input text and producing a sequence of phonemes defining the linguistic specification of the input text. These are then passed to the waveform generation module which is responsible for producing the speech waveform from the input phonemes. It is also worth noting that today’s advanced AI algorithms (so-called end-to-end deep learning) are capable of producing speech waveforms directly from the input text. 

 

                                                Figure 1: A typical Text-to-Speech system

Such recent advancement in technology driven by big data and deep learning (a form of AI algorithm) has brought significant progress in the TTS field. For example, a Canadian start-up company lyreBird claims that their technology is capable of producing synthetic voice by listening to just one minute of sample audio of the target speaker. In simple words, their technology can clone any voice (by using only 60 seconds of voice sample) and make you speak anything they like. Furthermore, they claim that their technology can also incorporate emotions in synthesized speech. This further means that their customers can create synthetic voices expressing anger, joy, or expressing being stressed out in the spoken speech. As we are all aware of the ability of computers to photoshop (edit) images creating fake images, with such commercial speech technology people can also edit and manipulate voices very easily. For example, check out this link to hear the computer-generated voices of famous US politicians Donald Trump, Barack Obama, and Hilary Clinton. Even big companies like Google, Microsoft, Apple, and Adobe have built AI systems that can create human-like voices. 

 

Applications of TTS

This technology can be used in a wide range of applications such as automatic text reading on mobile phones, e-book reading using the voices of popular celebrities, synthesizing voices for disabled people, and translating speech from one language to another, to name a few. However, it can also have severe implications when used with the wrong intentions. With such technologies, engineers/programmers can create convincing fake voices of anyone you like. An example and use case of fake voice and the impact it could have has been discussed considering banking applications at the start already.

Combining such synthetic voices with fake images one can further create customized fake videos of any person (for example some celebrity) doing something unethical and shouldn’t have been done. For example, imagine the consequences and impact of a fake video (created using AI technologies for voice, images, and videos) of President Joe Biden making some negative comments about China (and its culture) going viral on social media. What could be the consequence? These synthetic voices can also trick voice-controlled biometrics systems which are used to verify the identity of a person. Please see this post to gain insights into the basics of voice biometrics.


Voice conversion (VC)

This technology aims at producing an artificial voice of the target speaker from the source speaker’s voice. Thus, unlike TTS, voice conversion technology typically takes as input two voices: one from the source speaker (whose voice is to be transformed/converted to sound like the target voice) and the target speaker’s voice (the voice to be created) during its training step. It should be noted that the contents – or spoken words – remain the same during the conversion process. The only thing that changes is the voice identity. Figure 2 illustrates a typical VC system. Usually, the voice conversion algorithm works directly on speech signals of the source and the target speaker where both persons are speaking the same utterances. In other words, this system requires a parallel corpus of the two speakers to learn the transformation function that converts the vocal characteristics/properties of one speaker to another.


 

Figure 2: A typical Voice Conversion system showing both training and deployment steps.

    

 

Applications of VC

Some of the applications of VC technologies include producing natural-sounding voices for people with speech disabilities and voice dubbing in the entertainment industries. Alternatively, this technology can also be used to produce fake voices of some targeted speaker with bad intentions either to defame the person or steal the user’s identity to perform bad activities using their identity.

 

Risks on Voice Biometrics

It is well known that images can be faked using photoshop. Often when we see certain images we instantly react by commenting “oh that image is photoshopped”. It’s a fake image. We make such conclusions either because the image was too good to be true or we already had some prior knowledge about that photoshopped picture’s contents. Without such prior knowledge, it is very easy for anyone to get fooled into believing that the image is real. It becomes hard to judge between a real and a photoshopped image.

As discussed above, TTS and VC technology are capable of producing human-like synthetic voices which an attacker with malicious intentions can use to attack voice biometric controlled access systems for example banks, personalized systems such as voice-controlled automobile access, smartphones of some other person. What you hear may not be 100% trustworthy and the person who he/she claims to be may not be a real person. With this article, we aim to raise awareness about the existing computer AI technology and algorithms that are capable of editing or synthesizing voices to make them sound as natural as if spoken by a real human. With such awareness, it helps us to think about how to safeguard ourselves from bad events launched by such bad actors. Next, we provide a brief overview of AI that is deployed to counter fake speech generated using TTS and VC technology.


Protecting Voice Biometrics against synthetic voices

Now we briefly discuss how voice biometrics can be protected from being manipulated using computer-generated voices. Figure 3 illustrates different components of a typical countermeasure a.k.a. fake speech/voice detector. A countermeasure is basically an AI system, typically a binary classifier, whose main task is to determine whether the input speech is a human-spoken real speech or a computer-generated voice. To make such judgments these systems are first trained using a dataset of large speech containing both real and computer-generated voices collected from several speakers across the globe. During the training process, the algorithm learns to find the discriminative pattern between the real and fake voices. Later during deployment (testing step), the fake speech detector looks for the learned pattern/signature (between real and fake speech) in the voice to make the judgment. If it finds the pattern of a fake voice in the unknown voice (being tested) then it classifies the new voice as being fake otherwise it regards the new voice to be a genuine voice, and therefore the detector allows the voice to be passed through other components of a biometric system to provide further services. It should be noted however that during both training and testing steps one common step is feature extraction. This step is primarily responsible for transforming the input speech into some representable format that is simpler and meaningful for the algorithm/computer to process further towards building the desired classifier. And this step of extracting features will be the same during both training and testing.



                                           Fig 3: Fake speech detector (countermeasure).


As fake speech detection and prevention has become quite a hot topic and an emerging research field, the speech community has been promoting this research by launching so-called automatic speaker verification and countermeasures challenges (ASVspoof) since 2015. The main goal of ASVspoof is to promote awareness of voice spoofing techniques and bring researchers around the globe to combat voice spoofing attacks. For this, they also release free databases of spoofing corpus which are available from their website.


Summary

In this article, Bhusan Chettri talked about how computers and AI can be used to generate synthetic voices. Much like the way one can photoshop images using commercial software available (e.g., Adobe Photoshop), using such technologies one can easily edit and manipulate voices. Text-to-speech and voice conversion are two commonly used technologies to produce artificial voices which sound as natural as if spoken by a real human. Bhusan Chettri further discussed their applications in different domains and talked through the dangers of such human-like sounding synthetic voices. Finally he also briefly discussed the risks of computer-generated voices on voice-biometrics systems.

 

 

 

 

 

 

 

 

 



 


Overfitting & underfitting in Machine Learning, the two prime factors that needs attention by Bhusan Chettri

 

Bhusan Chettri a PhD graduate in Machine Learning for Voice Technology from QMUL, UK discusses overfitting & underfitting which are the two prime factors that needs attention during Machine Learning.

 


 

Machine learning (ML) is simply an art of teaching computers to solve problems by guiding them to map a given input into an appropriate output by exploiting patterns and relationships within training data”, says Bhusan Chettri a PhD graduate in Machine Learning for Voice Technology from QMUL, UK. In simple words a mapping function is learned that maps an input into an appropriate output guided by a loss function, often called objective function, that measures how well an algorithm is doing its intended job. In this document, Bhusan Chettri summarises the two dangers one may fall into during machine learning: overfitting and underfitting. These are the two key points that any ML practitioner (or a beginner) must account for in order to ensure that a trained ML model would show expected behavior when deployed in real world applications. This article, taking a deep neural network based machine learning as an example, summarises the two concepts briefly and further outlines key steps to overcome them during machine learning.

ML models are data-driven. This means they are data hungry. A massive amount of data, therefore, is often required to train and test before deploying them in a production pipeline. ML engineers and researchers often believe that throwing as many data as possible into a ML pipeline yields better results. While this is true to some degree, quality of data is also a key in machine learning to ensure that trained model is neither biased nor exploited irrelevant cues in data (that may have accidentally occurred may be during data collection process). Usually, having identified a ML problem (for any business or research) the next crucial step is acquiring data. One may purchase data or download them from www if freely available or setup data collection pipelines if such data is unavailable for purchase or download. Thus having collected the dataset, one important tasks that ML practitioner needs to do is to partition the data into three disjoint subsets: training, development and test sets. The training dataset is used to learn model parameters – in simple terms the mapping function that maps input into appropriate output is learned from the training data. Development set is used for model selection. In other words, during the training process, performance is often tested on both training and development set, and based on how model perform on development set ML practitioners make judgement about when to stop training step and which model setting to chose for final use. This so called training a.k.a learning process happens in an iterative process. Due to computational constraints the training samples are shown to learning algorithm in steps.

Bhusan Chettri says “Expecting model to learn relevant patterns and solving a problem in a single step (by showing whole training data at one-go) is not right, and is not the best practice to follow”. Usually, deep neural networks, a kind of machine learning, are trained iteratively by showing them small group of data samples called mini-batch. The iterative learning process is often referred as stochastic gradient descent. Here the term stochastic implies that such group of data samples are picked at random, and gradient descent simply means a method used to learn a suitable mapping function through optimization of the objective function. As algorithm is run several times to find the optimal setting the performance on seen training examples starts to improve with every new iterations.

Initially, the performance on both training and development data is poor. After certain iterations the performance on training and development set start to improve. Eventually after say few-hundred of steps/iterations development set performance stall while training set performance might reach a 100% accuracy. It means that the model has learned to perform a good job on the training data – the data it has seen during training. However, it performs poorly on the development set - a small fragment of disjoint data. The trained model is finally evaluated on a test set which is a large disjoint data set kept aside to test model performance.

Usually, the test set is designed in such a way so as to reflect real-world use case of deployment and therefore it follows a completely different distribution than those seen in training and development dataset. So what would one expect here? Does model perform well on unseen test set? Would this model show similar performance as it had shown on the training set? The answer is no. This model would show poor performance on the test set, as this model has been over-trained to learn noise in the training data. In simple words model has learned to memorise almost everything including irrelevant patterns in the training data, and hence it failed to show good performance on the development data as well. And, therefore it is obvious that it would show poor performance on the test set. The performance gap between training and development set is what ML practitioners are usually concerned. This phenomenon is called overfitting – models showing good performance on the seen data but failing to perform well on unseen data. Thus the model has poor generalisation capability. The fundamental problem therefore is the ability to train ML models that show good generalisations: that is minimizing the gap between performance on training and test datasets. Likewise, it often happens that the model being trained fails to even perform well on the training dataset even after undergoing several number of training iterations. This suggests that the model is unable to even learn the underlying patterns from training data. This means that the model is not powerful or flexible enough to be able to discover patterns in the data. This phenomenon is called underfitting, which needs to be taken into account while building ML models. Therefore, a good ML model is one that neither over-fits nor under-fits on the training data.

A simple solution that is often effective to deal with underfitting is simply to increase the model complexity. And, increasing the number of hidden layers or adding more units in a deep neural network often helps overcome the underfitting problem. On the contrary, dealing with overfitting is something not very straightforward. Different solution approaches to combat overfitting are: (1) increasing the amount of training data; (2) reducing the model complexity; reducing the size of the neural network; (3) adding regularization techniques during model training, for example, L1 and L2 weight decay (reducing the magnitude of weight parameters of model); (4) adding dropout techniques: removing randomly some of the units in a neural network to force model to avoid memorizing the patterns in training data and encouraging them to rather learn generalisable patterns.



 

 



 

Unveiling the Clever Hans Effect in Audio Deepfakes: A Deep Dive by Bhusan Chettri

Unveiling the Clever Hans Effect in Audio Deepfakes: A Deep Dive by Bhusan Chettri : Unveiling the Clever Hans Effect in Audio Deepfakes: A ...