What they do in Machine Learning ?

When I first started learning on Machine Learning and reading a lot of documents, Watching tons of YouTubes.. a question that pops up in my mind was

It seems there are relatively small number of Machine Learning algorithms widely used in the industry.
According to those tutorials that I have seen, it seems pretty much of everything is automatic. Putting the bunch of input data to a neural network, the network train itself and put out the output.
Then the question is.. why they say there are still a lot of demand for engineers in this field ?

I think it is natural that you get this kind of questions whenever you are trying to get into an area that is in a kind of hype and everybody is talking about it.

One way to get a relatively clear view (or anwser) to this kind of question would be to widen your scope of study a little bit further and try to study the whole flow of the technology being utilized in real world rather than sticking to the software tools or algorithms etc.

Overall Machine Learning Pipeline

In my view, overall flow of Machine Learning can be illustrated as follows. The first step (1) is to collect the data from real life activity. The type of data to be collected would be different depending on which company you are working for. In most cases, it is highly likely that the data is already there in the organization and the organization decided to apply Machine Learning in the hope that they can get more useful information from the data.

And based on the data and the business requirement, you may need to think up of what kind of specific algorithm to apply and what should be the input and output for the algorithm.

Once the algorithm is determined, you would need to process the data in such a format that can be fed into the algorithm you want to execute.

If I breakdown the three major process labeled above and add a few more steps which comes not from the technology but from business point of view, the list can be extended as follows.

(1) Determine What is the outcome you want to achieve ?

(2) Determine What is the business impact ?

(3) Determine What kind of data you need to collect in real life ?

(4) Determine What kind of Algorithm you want to choose ?

(5) Figure out How do you implement the algorithm in a certain tools (e.g, TensorFlow, Pytorch etc)

(6) Can you justify on why you chose the algorithm ?

(7) Figure out How to do process the raw data in such a way that it fits the input of your algorithm ?

(8) In some rare case, you don't find any proper algorithm among the existing ones that completely fits your purpose of analysis. In this case, you may need to come up with new algorithm.

I think most of the tutorials and tech blogs / videos are focused on Algorithm part. That is, mostly about item (4),(5) in the following list. But in reality, there are many other steps are involved in the overall data flow as listed below. I am not saying every engineers in the area of Machine Learning but I think I can say that it would be rare case that you will be working only in (4) or (5) unless you are working on inventing new algorithm in Academia. Even though your major job is with (4),(5) in your workplace you would need to do at least a few other parts as well.

From reviewing many of real life use cases and interview video from those working in various area applying machine learning, step (7) is one of the largest portions of what many of Machine Learning engineers have to do in their real work even though there are not so many people who really enjoys it. This is the reality for most of the engineering job. Before you jump into the area and doing the real job, everything may look fancy. But once you get yourself into the job, you may find most of the task given to you may be those that you've never expected before and you don't like much.

Common Data-Preprocess

Followings are some common examples of what type of data processing you would need to perform before you put the origian data (i.e, data that you collected) to a specific machine learning model that you want to use. It doesn't mean that you always need to do all of these data processing to all of your model. Depending on the machine learning model and the format of raw data, the type of preprocessing tasks would vary. I am just trying to make a list of common/frequent form of pre processing. If you are not so familiar with computer programming language and try to know where I should start in terms of computer programming on machine learning or data science, I would suggest you to pick up a specific language that you like and make a lot of practice for writing programs to do this kind of tasks listed below. By donig that, I think you can learn programming and machine learning at the same time.

NOTE : I don't have any clear information on the source of this data (I got this information from one of the YouTube video in 2019). The presenter mentioned that around 60% of the cost for machine learning project are used for preparing data (i.e, data pre-processing). I think the exact cost split between data preparation and other part of those project would vary depending on the nature of the project and the amount would very as the automation technology/tool for the data preparation evolve, but at least for now (as of dec 2019) the data preparation is one of the biggest part of many Machine Learning project.

Putting data into labeled folders : At least for now, most of neural network algorithm is based on a large set of known examples that we call 'Labeled Data'. So the one mandatory step to apply Neural Network is to prepare these labeled data. In many cases (e.g, Image classification), we gather a lot of images and put those images into many categorized folders. If we need to do this only for a few images, you may try creating the categorized folder manually and put corresponding images into each folder manually. However, it is not practical to do this type of classification manually if there are thousants and thousants images and you would need to come up with some program to automate this process. You should be familiar with how to write a code to do this kind of things in whatever language you are using (e.g, Python, Matlab etc).

Resizng or crop Image : Most of neural network algorithm related to image classification (e.g, CNN) requires a specific size of the image files as an input. But most of the images that you want to process would not be the same size as required by the network. In this case, you would want to resize the image or to crop a portion of images that fits to the input size of the network. Here are some examples to do this in Python (here, here)

Converting Image Color: Most of neural network algorithm related to image classification (e.g, CNN) requires a specific layers of the image files as an input. The image file layer is related to how the file represent the color for each pixed. For example, if the algorithm requires the image with RGB color it is three layered image. If the algorithm requires the image with gray scale, it would be single layered. If your algorithm requires the single layered image file and you are given with RGB color image, you have to conver the RGB file to gray scale file format and vice versa.

Replacing a word with another : This type of task would be mostly related to the algorithm that process the data in various test forms (e.g, chatbox, email, q&A data base etc). Since different person tend to use different words for the same thing, you may want to convert many different words to a specific common word before you put those data into the network. You would see an example of this case from this video.

Converting a Sequential data into Image data : Since Machine Learning algorithm for image classification has been such a well established there are many cases to apply the image classification algorithm (e.g, CNN) to the non-image data like voice recognition or communication signal/data analysis. In this case, we need convert the orginal data into some form of image. However, this conversion is not straightforward as other types of preprocessing and tend to vary depending on the specific domains. Converting Voice data to a spectrogram, converting radio frequency time domain signal to spectrogram is a good example for this category. This kind of data transformation may require not only programming skills but also a specific domain knowledge.

Examples of What they do (Applications / Use Cases)

In this section, I would try to list of videos based on what big players in the AI/ML industry has been doing. I am listing the videos mostly from those presentations directly presented by specific companies that are listed. This is a little bit on purpose. I think the presentation directly coming from the company would best describe on the big picture of what they are doing and what they 'intend to' do. Also I am trying to list those presentations showing the various types of input and output to the machine learning system. Since most of the list are about big picture or business model, they do not carry the much technical details. For more technical issues, big trends in terms of technology (not in terms of buisiness), various courses, I am listing in another page here.

Machine Learning at Google

Machine Learning at Facebook

Machine Learning at Microsoft

Machine Learning at Amazon

Machine Learning at Apple

Apple WWDC2017 - Session 703 - Introducing Core ML (2017)
A Guide to CoreML on iOS (2017)
Core ML 3 Framework 2019 (2019)

Machine Learning at Cisco

Machine Learning at AutoDesk

Machine Learning at Ericsson

Machine Learning at Verizon

Machine Learning at Qualcomm

Semantic Segmentation of Image

This explains very intuitively on what is Semantic Segmentation, what it is used for and how to prepare the labeled data for training.

Semantic Segmentation Overview - Train a Semantic Segmentation Network Using Deep Learning.

Predictive Maintenance

This shows a good example of how to define a meaningful feagures from various sensors in a pump system and how to process those data that fits Machine Learning Algorithm.

Natural Language Processing

This use case shows a case where the system takes in customer support message given in the form of natural language (e.g, text message) and analyze it, suggest possible root causes, treatment. In this presentation, you would learn not only learn about an application of machine learning but also on how to justify this application in terms of business.

Machine Learning at Uber (Natural Language Processing Use Cases)

Churn Prediction

This use case shows the case to predict whether a customer would change the carrier (the carrier he/she subscribe) from the given set of customer history data.