Understanding Machine Learning by Analogy with a Simple Contour Map

Understanding Machine Learning By Analogy with a Simple Contour Map – It’s only Mathematics

Note – This blog post contains links to recent warnings related to dangers in using Machine Learning results.

SUMMARY

If you feel that you don’t understand how Machine (or Deep) Learning is typically being done, this should answer many of your questions at a level that anyone who can read a simple contour map should understand. The idea is to attempt to solve a particular minimization problem related very specifically to what we hope to do. This is often done using a particular type of algorithm known as gradient descent, for example, in order to move from a present state of a system, to a desired state. We will do this in such a way that we minimize some measure of error between where we are at present and where we would like to be when the error has been reduced to zero (if that is possible). Problems of this kind present themselves in many practical applications today and are being adopted in AI research in an attempt to produce possibly useful results by what is being called “machine learning”. We can understand machine learning by analogy with a simple contour map.

We emphasize again that what is happening is not learning in the human sense, the machine has no hidden intelligence to use unless we provide it by means of a suitable program that it can follow without deviation (since it can’t actually think or make its own decisions). It is all just an attempt to use some basic and easily implemented mathematics as a possible basis for what is being called “Machine Learning”.

We’ll use, by analogy, the simple contour map to explore and elucidate the nature of solutions and will reveal many aspects of the problem that determine the characteristics of the solution(s) obtained by minimizing an error measure (the common basis for machine learning, in practice). In particular, we will see that the solutions that we may find (if we succeed in finding one or more) depend strongly on the initial conditions (how we begin the problem); the initial state of the system or network architecture (the structure and initial values of the parameters of the system); the data used in the minimization process; the accuracy of the computations to be made; the composition and form of the measure of error that we choose to use; the type of algorithm chosen to solve the minimization problem; whether we choose to “tamper” with the parameters during the solution process; how we decide how the final solution(s) will be reported (for example, a system output, including, but not restricted to an interpretation such as “yes” or “no”, “true” or “false”, “0” or “1”, etc; or by making a random rule for interpretation to decide what the final solution or output from the system means, if making a meaningful interpretation is even possible.

In short, it all depends on the particular problem we are actually solving. The few categories that we mentioned above all play a significant role in determining the problem we are actually solving. Once the problem is completely determined then so is the possible space of solutions. Since there is an infinity of possible variations in the categories we named, we must understand that if we fully understand the problem we need to solve, then we must find the correct combination in all of these categories to reach the desired state, solution, or system output in the end, and we must be able to correctly interpret what we have found. Most of these dependencies are often either ignored or simply not understood, in practice, leaving the correct interpretation or meaning of a solution unresolved. Make what you will of any solution which might be found by a standard minimization process, when the whole process is not properly understood and implemented.

The simple contour map can provide a very convenient way to explain what has just been said so that it hopefully can be understood, and the solution(s) we might obtain can be viewed in the proper context. This is in no way a matter of the computer being “creative” or developing “intuition”, or any of the other popular meaningless and misleading explanations, it is simply mathematics at work and nothing more. It will never do anything different if you repeat the same process using the exact same choices in all of the categories, again. In other words, if absolutely nothing is changed, the result (solution) should be reproducible, whether or not we can figure out a correct interpretation of it. However, if anything is changed, as will be explained, then the previously obtained solution may very well not be reproducible. Hopefully our analogy will make this clear.

The Basis for Machine Learning by Analogy, Using a Contour Map

In this post, we will take a closer look at Machine Learning and its nephew, Deep Learning. There is no “Learning” (in the human sense) in either Machine learning or Deep Learning, there are only quite simple and readily available mathematical procedures which allow us to adapt parameters of many kinds of parameterized systems (or networks), such as a neural network, in such a way that the system (or network), together with the properly adjusted values of its parameters will satisfy certain goals or objectives, to the extent that this may be mathematically possible. The process of adjusting the system parameters mathematically is what it being called Machine Learning, in the AI community.

We alone, as intelligent human beings with a brain and a mind, are free to specify and control what these goals or objectives might be. In many cases, they involve minimizing some measure of error resulting from our initial choice of system parameters, and the choice of an algorithm and data to be used to solve the problem. To the extent that we don’t understand what we, as human beings, are doing, we often fail to achieve our goal, or we do not understand and cannot properly interpret what we have found. Part of understanding what we are doing is related to understanding how the critical ingredient in what is being called “Machine Learning” works. Deep Learning is simply Machine Learning using a Neural Network with more than a few layers (still a neural network), and so there is no significant difference in how we approach the problem of determining the desired system parameters in either case. Our comments apply to any system that fits our needs, and it need not be a neural network, but this is what is commonly used in practice, today (we will say much more about this future posts).

Now, it is unfortunately the case that many students of what is being called “AI” are told that they really don’t have to know very much, if any, mathematics to be able to do AI using Neural Networks. They will be shown all they need to know and it’s just a matter of learning some simple steps to follow and everything will work just fine. I know of few, if any true cases, where such instructions have any validity in the real world, and I’m pretty sure that they have no validity in the fantasy world of AI, either. Of the hundreds of thousands of papers that have been written on this subject, it is probably true that the vast majority of them are simply suggesting variations (useful tricks) that might lead to a better result in one or more of the infinity of possible cases that might be considered. So much for learning a few simple instructions and all will work out just fine!

Our purpose today is to simply elucidate how the major process involved in “training” neural networks, actually works without teaching a lot of necessary mathematics. We will do this by using a very simple analogy, which hopefully anyone can understand. We can convey a great deal of understanding by simply considering the simple concept of a contour map, which most of us are hopefully quite familiar with. There really isn’t too much more that you need to know in order to begin to understand how Machine Learning works. Actually using the mathematics correctly to achieve our goals in practice is another matter which we will only make references to in this post. With that in mind, we hope to answer many often unanswered questions about “Machine Learning” and what is actually involved, and how to begin to understand what is really happening, and why it so often fails to meet our (often misunderstood) expectations.

What is a Contour Map?

First, in case we need to explain what a contour map is, let us define what we are talking about. Imagine that we have a map of a small section of the Rocky Mountains. One of the most useful pieces of information that we might like to have, when using such a map is the elevation above sea level of various locations on the map. We use such maps all the time if we do much hiking or camping in any terrain, for example. Now it is certainly true that every point represented on a map has a set of corresponding numerical values associated with it. The usual map coordinates tell us where on the earth’s surface we are, rendered as a two-dimensional map printed on a flat sheet of paper. The other useful piece of numerical information, which takes us into the (vertical) third dimension, is the elevation above sea level, at each point on the map, expressed in some system of units such as feet or meters. Here is an example from the Rocky Mountains in the United States:

Example of a contour map from an area in the Rocky Mountains near Bierstadt Lake (elevation 9,416 ft). The curving brown lines on the map are contours and every point on a single contour is at the same elevation above sea level. The contour interval is 40 ft and the contour elevations are labeled at 200 ft intervals. There are several creeks or rivers shown flowing on the map (blue color) and you can see how they move from contour to contour as they descend from the elevation of the lake or other water source to increasingly lower elevations as shown by the contour lines. Due to size limitations, details may be hard to read. Where the contours are very close together, they represent a steep incline (as seen along the Bierstadt glacial moraine).

Of course, if we had to print this information as numbers all over the map, it would be possible, but would clutter the map too much to be practical. Instead, we have the common contour map. On this map, curved lines (called contours) are drawn which connect points at the same elevation above sea level. If we walk along the paths indicated by these lines, we would be moving along the ground while remaining at exactly the same altitude above sea level, neither going up nor going down in elevation.

In order to avoid a profusion of contour lines on the map, they are normally drawn only at fixed intervals of elevations for us. For example, they may be drawn in steps of 100 feet of elevation above sea level, in which case, we will see some of the lines denoted by benchmarks showing their elevations, such as 2,000 feet, 2100 feet, etc. As long as we know the elevation spacing between the contour lines, then we can count how many contour lines lie between two given contour lines and easily determine the difference in elevation between the lines by using the known contour interval. If we are located at a particular point on the map and wish to ascend to a higher elevation, then the map will show us which way to go (toward the next contour line which is at a higher elevation), and similarly, if we wish to descend from the location to a lower elevation, then we want to move toward the next contour line which represents a lower elevation. If the lines are far apart, then we are on a relatively flat surface and the ascent or descent will be easy. If the lines are very close together, then we are facing a steep ascent or descent from the point where we are located.

If you can understand this basic overview of a contour map, then you can understand a great deal about Machine Learning which you might be wondering about. In particular, it can help you to understand various pitfalls and perhaps unexpected behaviors, and many other things that you may not know about, but which will be encountered when working with the usual approach to Machine Learning, which you are likely to be taught if you are a student.

Some Simple Mathematical Comments

And now for a few words about mathematics, for your general information. We have pointed out that a contour map carries information in three dimensions, including the location on the surface of the earth where you are standing (two dimensions on the flat sheet of paper which is the map), and the vertical elevation of points on the maps, shown as contour lines. The contour lines are the the last piece of information and represent the third dimension, or elevation above a point on the map surface above sea level (actually, Mean Sea Level , which is an average value since we don’t care about the height of waves in the ocean, or tides, etc.). Since the coordinates of any point on the earth’s surface are simply a pair of numbers, and since the elevation of each point on the earth’s surface is also a number, this numerical information which the map gives us can be stored in a digital computer and operated on using mathematical processes.

Mathematicians have long worked on specific types of problems, often referred to as minimization or maximization problems. These include examples like finding the shortest time path between two points in three-dimensional space for a particle moving under the influence of a potential field such as gravitation or electromagnetic fields. There are powerful applications for answers derived from studies of this kind. In fact, many other common problems also involve minimizing or maximizing some quantity which is controlled by some external influence, including those listed above.

Minimization and maximization problems occur in essentially every area of inquiry, including physics, chemistry, biology, economics, and, in fact almost anything you can think of. As an example, minimizing the cost or number of electrical components in a radio or TV can have a great impact on profitability and even sale price of the finished product. There are well known procedures to find equivalent electrical circuits, for example, which can reduce or even minimize the number of components required to achieve the performance level we seek in a commercial product of this kind.

There is no end of examples of methods to reach an optimum point in manufacturing and production which produce the maximum profit at the minimum cost. These are just a few examples of what mathematics can do for us, and it plays an enormous, but often unseen role in essentially every product made, and with significant impact on what you pay for the product, and how profitable it can be to produce it. AI practitioners have adopted some of these ideas for use in what they call “machine learning”.

Machine learning is a relatively simple (as commonly practiced) application of a minimization problem, as we will discuss here. In common practice, it does not attempt to use higher-level methods, but instead the object has been to use simple procedures which are readily available, involving simple concepts which may or may not be appropriate for the purpose. In some sense, the idea, as far as AI is concerned, is that the magical Neural Network will figure it all out for us so that we don’t have to actually understand what we are doing. If this sounds harsh, it is, but someone needs to be honest about what is going on with all the hype we hear on a daily basis.

We are now going to use the simple contour map as a way of understanding what is actually being done in what is called “Machine Learning”. We refer the reader to our First Dialogue if the concept of a neural network and the idea of minimizing a measure of error is new to you. This is really just about all that is involved, and the common contour map is a convenient vehicle to use for understanding what it is all about. We are going to make an analogy to the kind of computer algorithm which is commonly used, as we see how we might go about solving a simple minimization problem using a contour map. We will then discuss how the same procedure can be realized on a computer, as is done in most commercial applications.

Posing a Simple Minimization Problem and Comments on the Nature of Solutions

Using our simple contour map, now, we are going to pose a very simple minimization problem which we would like to solve, beginning at our current position in our Rocky Mountain location in north America. We shall describe how a solution can be obtained, what the pitfalls are that may be encountered, and will discuss the nature of solutions, including how many solutions there are and something about the solutions in a context which we can easily understand.

In discussions of Machine Learning, you will often see references to “Gradient” methods used to solve the “Learning” problem. We are going to consider a simple gradient method that we ourselves might use for solving our minimization problem using our contour map.

The Minimization Problem We Pose for Our Contour Map

The problem that we will pose is a simple one. Given that we are at some location on the map, and that this location has a well-defined elevation above sea level, we wish to find a path which will lead in a very specific way from our present location and elevation to a new location which is at sea level (elevation zero). Think of this as a minimization problem in which we seek to minimize the error (difference) between our current elevation, and our desired elevation which is sea level (zero elevation). Our error is therefore given simply by our current elevation above (or below) sea level. To make the problem more precise, therefore, we will choose to use the absolute value of the error, or the square of the error, etc, to insure that the error measure that we use is always a non-negative number.

For the sake of discussion, let us just assume that we have arranged for our error to always be a non-negative number. By doing this, we insure that if and when we reach the minimum possible error (which is now zero), we are also at sea level (and not possibly below sea level), as required by our problem.

Machine learning simply attempts to solve an analogous problem by reducing a non-negative error, between a current state of a system and a desired state of the system, to zero (as explained in Dialogue 1). In our case, moving along the earth’s surface is our system, and our present state is our physical location where we initially find ourselves, in the Rocky Mountains. Our desired final state is to reduce this elevation to sea level by moving through our contour map in some well defined way (an algorithm), determined by a process of minimizing the non-negative error to zero, if possible, so that we arrive at our desired state.

We are free to move only on the surface that defines the topography of the earth around us, as represented by the contour map. We shall seek an algorithm using the map alone to guide us as we attempt to reduce the (non-negative) error, between our current elevation and sea level, to zero.

The Nature of Solutions to our Problem

Before developing an algorithm to solve our problem, using the contour map, it might be worthwhile to pause for a moment and to consider the possible outcomes of our minimization problem which we have posed for ourselves. First, we might ask, what constitutes a solution to our problem? The answer is straightforward, and is exactly as we have described it. Any location that we reach that is at elevation zero, or sea level, is a solution to our problem. We did not pose the problem with any more constraints than that, even though we could have done that.

For the simple problem we have posed, we can immediately agree that there exists an infinity of possible solutions. Any position on the surface of the earth that we might be able to reach from our present position in the Rocky Mountains, and which is at sea level, is a valid solution to our problem. This would include the entire coastline of North America bordering (touching) the Pacific, Atlantic, and Arctic Oceans, for example, as well as the Gulf of Mexico and even the coastlines of Mexico and Central America down to the Panama Canal, etc. However, we did not pose a requirement that says we have to wind up at a point on any of these large bodies of sea water. There are many places that are not on continental coasts that are at sea level in elevation, and all of these locations are also a valid solution to our problem. They include locations around places like Death Valley, for example, but there are many such places on our continent which we could realistically get to by walking from the Rocky Mountains, following our contour map, which are at zero elevation above sea level. If we realize that we didn’t actually require that we had to be able to physically get there, and are only looking for an algorithm to get us there, we cannot rule out any point on our planet, for example, that is at sea level, as long as we can find an algorithm that could get us there.

Okay, not to belabor the point, there are infinitely many valid solutions to our current problem, all varying in location, but all having zero elevation (sea level). We only need to find one of them in order to be successful in solving the problem. If we would not be happy with finding just any of the possible solutions, then we would have to consider how we would change our algorithm to meet the requirements of finding a particular location at sea level, for example. We can discuss these issues later, but we need to understand that it is our algorithm, our initial location in the Rocky Mountains, and how we measure the error to be minimized, in this simple problem, that will ultimately determine which solution we might or might not find first.

Notice that, in our contour map problem, we are moving through a pre-determined system of existing topography, and for our problem we will not be adjusting any neural network parameters. That problem is solved using an algorithm like the one we are going to develop for our specific problem, but implemented entirely on a computer, using similar mathematics, but with reference to an imaginary world defined by the current state of parameters of the system or network we have chosen. That is to say, the “topography” that we will be traversing in the computer implementation is determined entirely by the architecture (parameters and how they are related) of the system (neural network, for example) that we have chosen to use. This topography is analogous to the earth’s topography which we are following in our contour map example. If we change our starting location, then everything else (including the solution we might find first) will change in accordance with our new choice, if necessary. The same comment applies to any system using our algorithm on a computer, when we change the initial parameters of the system (which define our initial location in the system topography).

Now we should point out that if you wanted to reach something perhaps more specific, such as a particular location on the earth which is at elevation zero, then at this point, you have not built that information into the problem we are going to solve. Hence, if you had wanted more, it is up to you to reformulate the problem until it can be expected to provide the kind of result you really wanted. That, of course, requires understanding what you are actually doing. In any event, we are going to solve our problem exactly as we have posed it, (that is exactly what the mathematics operating in the computer will do). We will reference the more specific problem later.

Developing a Gradient Algorithm to Solve Our Problem, Using a Contour Map

Okay, so we might as well imagine ourselves as a body of water free to find a path from our present elevation, all the way to zero elevation, if we can do that. At this point, we all should have the idea of a river in our minds, flowing from our present location downward from our location in the Rocky Mountains, along a very specific path to a point of zero elevation, if it can get there. Have you ever wondered what defines that specific path? Well, the answer is not as simple as we’d like. Water will respond to all sorts of variations, including natural barriers, its own momentum, and other factors that are not strictly related to the contours on a map. Nonetheless, water will find some path from contour to contour so that it is always descending, if possible. Some call it the “path of least resistance”, but there is no good all-encompassing term to describe simply how water will actually flow to lower elevations. Still, it provides a useful analogy and if you look at how the creek exits from Bierstadt Lake (on the contour map), which is on a relatively flat shelf, you can see how many different routes water might follow, and, in very short order, they could lead to many very different final locations. For example, if it would be diverted to start flowing toward the moraine, a very rapid descent would be possible to the valley below, but the actual creek, as shown on the map, goes elsewhere.

In order for us to make an algorithm for descent, we are forced to choose some criterion or rule for how we might move from a higher elevation to a lower elevation using our contour map. We shall do this with the concept of a “gradient” in mind. There will be many possible paths that might serve this purpose, but we must choose only one and do it consistently so that it becomes a rule, or algorithm, which we can apply as we reach each new location at a lower elevation.

The word “gradient” refers to a specific mathematical entity which can be computed at just about every point on the topographical (or mathematical) surface which our path is constrained to follow. The gradient that we are referring to is simply a vector (think of an arrow) which has a magnitude (number) and a direction associated with it. For the sake of discussion, we will take it to be pointing in the direction of “steepest descent” from our current location. We won’t worry about falling off a cliff because we are only making an analogy. When applied to our problem in a computer, these possible dangers are not a concern.

The greatest possible speed for descent is found at places like vertical cliffs, where we would actually fall (or rappel) vertically from one elevation to another. On the other hand, the gradient can even go to zero (magnitude), meaning that there is no direction currently available where we can immediately continue our descent. This can be a plateau, or horizontal shelf, or even the bottom of a basin on the earth’s surface with no clear way to proceed downward from that point. Thinking about how water will continue to flow downward from such a flat place can suggest how we might alter our own algorithm to deal with this problem. In general, there will be many alternatives, and we must create an algorithm that can deal with these choices in some particular way.

Some Issues and Possible Pitfalls

In order for us to continue to descend, if the gradient has gone to zero, we may have to search for a location from which further descent is possible, or we may even have to ascend to a higher elevation (climb out of the bowl, for example) before being able to find a new location from which further descent is again possible. A computer following our suggested algorithm would have to deal with the same problem. A location from which further immediate descent is not possible is called either a local or a global minimum in mathematics (depending on whether further descent is possible or whether further descent is impossible, respectively). If a local minimum can be overcome, then descent toward a global minimum can proceed. A very simple computer algorithm will normally stop at any local or global minimum if it is being guided by a gradient, and further programming is required to determine what it is to do next, if anything.

We begin to see that there can be pitfalls in using our simple gradient algorithm, as we have described it so far, and it will take special instructions for a computer version of our algorithm to overcome some of these pitfalls which we might encounter along our way. We will save further comments for later, and will now describe how to arrive at our simple algorithm for descent, using a contour map.

How to Easily Create an Algorithm for Our Problem, Using the Contour Map

Taking the idea of steepest descent as our example, let us begin to create our simple algorithm that we will use with our contour map to discover how we might determine a steepest descent path to lower elevations, and eventually perhaps even arrive at sea level, if all goes well.

Looking at our contour map, it would appear that the immediate shortest path from our current location and elevation to a point on the contour line at the next lower elevation (depending on the contour interval), could be approximated by simply finding the closest point to our current location which is on the contour line at the next lower elevation. If we are at a location 9,000 feet above sea level and our contour map shows contour lines at intervals of 100 feet in elevation, then we want to find the closest point to our current location which lies on the 8900 foot (elevation) contour line. We, of course, can just roughly mark our next destination point by eye, approximating the closest point to us on the next lower contour line. We could make a much more accurate descent plan if the contour lines were spaced more closely, but this approximation will work fine for our purposes, it will get us to the next lower elevation contour line in something approximating the shortest possible distance (steepest descent). When carried out mathematically in the computer, elementary calculus will accurately find the required gradient for us, and does not require the use of contour lines.

We wish to point out that, in our simple problem, since contour lines are curved, there may be multiple points on the next contour line in the descent direction, which are closest to our current position. In that case, we must choose one of these to determine the direction of our next descent step. How we choose the descent direction, in that case, has an impact on which solution we will finally reach first, out of the infinity of possible solutions.

More Related Thoughts to Consider

We might pause here to think about an issue for comment, which now presents itself. If we make a consistently small error in the (steepest descent) direction we have chosen to travel, then think about the possible consequence of that error. We are going to move perhaps a small distance down to the 8900 foot level to a new location. When we arrive there, we are simply going to perform the same operation again, using our contour map and our new location at 8900 foot elevation in order to chart our path to the next contour line at the 8800 foot elevations, and so on, until we reach sea level. Now at each new location on a contour line, we are going to determine our next descent direction, as we did above. This is now our currently complete algorithm to solve our simple problem.

Suppose that our direction of travel that we determine at each level of the algorithm is in error by perhaps a degree or so. Can you imagine how great a change in our final destination that this error might accumulate to, given that we might have to travel hundreds or perhaps even thousands of miles before we may reach an elevation at sea level. This helps us to see how very small changes in our initial conditions can significantly alter the location that we finally reach at sea level (our solution). Recall that each time we apply the algorithm to descend to the next contour line, we have new initial conditions determined by our new starting location.

To see another example of we are talking about, think of starting on the continental divide, as an extreme sort of case. In that case, we have at least two available descent directions, one pointing to the west and the other pointing to the east. We can choose either of these from our current location on the continental divide. However, once we choose, there is a pretty good chance that west will lead toward the Pacific Ocean and east will lead toward the Atlantic Ocean. A single small initial choice that we make can have a significant effect on where we might wind up! This is called sensitivity to initial conditions and it can plague many algorithms of the type that we are considering. On a smaller scale, we are going to make many similar decisions with our algorithm (as we move from each current location to the next one). Even if we start heading eastward, for example, there will still be an infinity of possible locations where we might actually wind up, depending on how accurately, or by which choice we make our next move at each step (called an iteration) of our descent algorithm).

Okay, so much for the algorithm. We are simply going to do our best to follow a shortest or steepest path downward in elevation (as determined by our contour map) until we might eventually reach sea level, by applying our new algorithm iteratively (at each new contour location) to our contour map. We have described how to use a contour map to help us find such a path from our current location (no matter where it is) to our next elevation, and we have explained the role of a gradient and how we might approximate it without mathematics, by using a contour map as an example.

We, as humans, actually would have other senses to help us to solve this problem, and so the contour map would technically not be necessary, and, in any event, we ourselves could not always follow the steepest path from any intermediate location since we would face dangers such as falling off a cliff, for example. A mathematical version of our simple algorithm, however, has no such worries, and can always follow a path, as we have outlined in our algorithm. We are simply using the contour map idea as an understandable way to explain what the computer can do mathematically, and it amounts to the same kind of idea that we are following in our discussion.

The Computer Version of Our Algorithm and Related Comments

We have hinted that the computer, when given enough numerical information to work with, can do mathematically what we have described doing with the aid of a contour map (which is a convenient and useful source of numerical information for us). The computer will face the same kinds of obstacles that we have described, including encountering possible local minima, zero gradients, sensitivity to initial conditions, etc. Many common mathematical issues are easily revealed by our simple example of trying to move to lower elevations using only the numerical information in a contour map as our guide.

We have pointed out that we have considered a minimization problem in which the error to be minimized is determined by elevation alone. In many cases, this is not going to satisfy us. For example, let’s suppose that our real objective is to reach sea level at a specific location like Galveston, Texas, on the Gulf of Mexico. In that case, the sea shore in Galveston is only one of an infinity of possible solutions to our elevation problem, and the chance that our current algorithm will lead us to Galveston is very small, indeed.

However, the larger problem can still be solved using the same kind of algorithm we have described here. The only difference is that we must add more information relevant to arriving at Galveston, Texas. That can be done by using an error measure which not only uses elevation, but also uses the geographical coordinates of Galveston, for example. In that case, our error measure will be a combination of three errors, one in elevation, and two in geographical coordinates (latitude and longitude, for example) on the surface of the earth, but the mathematical approach remains essentially the same, with the appropriate modification for the new error measure. In our discussion, we have ignored many related issues related to differentiability of functions involved, exact form and properties of the error measures, step size and learning rate, etc, but the informed reader may look further if these technical answers are needed and are not already known.

We might also notice that when we posed our original problem, it was simply a problem of reaching a location at sea level, beginning at our current position in the Rocky Mountains. However, we chose an algorithm that will determine a very specific path (determined by steepest descent) that we will actually attempt to follow. In that case, the first location at sea level that we reach following this path will be our (first) solution to the problem, wherever it may be. The idea of using a steepest descent criterion only, will rule out easily reaching most of the possible solutions that we enumerated earlier, if we follow only this algorithm. This further clarifies the important role played by the algorithm itself, in determining the solution that we will actually encounter first. To find additional solutions, we can just continue to use the algorithm (or necessary modifications to restart it) to go further, and that can lead to a second, third, etc. solution, always at sea level.

Using a Computer Implementation of this kind of Algorithm to Solve Problems Involving Mathematically Defined Systems or Networks

In the corresponding computer implementation of this kind of algorithm, but replacing the topography of the earth with a similar problem posed for a parameterized system such as a neural network, the parameters and actual architecture of the neural network (or other system that we may be using) will define the (mathematical) topography that we are descending through . Our starting location will be determined by the initial values which we give to the parameters of the system. In this case descent simply means that the error is reduced at each iteration. Similarly, ascent would refer to a direction in which the error is increased. This means that the actual architecture of the system being used to solve the problem plays a very significant role in determining where (in our “topography”) we will first encounter a solution to our problem using the computer, if indeed we actually find a first solution.

It is somewhat analogous to realizing that if we had started from a different location on the earth, using our same algorithm and contour map, that location will play a very significant role in where we will actually reach our first solution to our problem, using our simple algorithm. Generally, we would not go looking for more solutions once we have found one, since that involves many more issues to consider, all of which would be important in determining where we would find the next solution, and so on. We must be sure that we have formulated our problem with enough information to actually insure that the first solution we find is the one we wanted or expected, if that is possible. We must never assume that it is the one we wanted or expected, just because it appears first. The first solution that we might find is entirely determined by how we pose, start, and carry out the process of finding a solution, as explained earlier. This relates in a very significant way to the architecture of the system we are moving through (our topography) and to the error measure and algorithm that we are using for minimization.

We also note that in this case, the “topography” through which we are moving is not likely to be three dimensional, as is the topography in our contour map example. The mathematical dimensionality of the problem is determined by the architecture and the number of the independent parameters of the particular architecture of the system or network under consideration. The dimensionality of the problem being solved may be quite large in practice. We will look into this comment in future posts. Suffice it to say however, that the mathematical procedures themselves, involved in solving the problem, can be valid in any number of dimensions. Depending on the actual architecture chosen, there may be no solutions; a unique solution (only one); or as we saw in our contour map example, a large number of solutions. If there is no solution, then the mathematics involved will try to come as close as possible to a solution, by minimizing the error measure to the extent that this is possible. If there are no solutions, then zero error cannot be achieved.

Many practitioners of machine learning have commented on their surprise at how the machine does something unexpected, which they may often describe in such terminology as saying “the machine has developed creativity”, or “the machine has developed intuition”, or similar phraseology to justify the fact that the answer which they thought they were going to get was not the one they actually got (surprise!). We discuss this issue a little further with an related example from AI in the next few paragraphs. However, the issues that we raise here relate in a very real way to why so many results from machine learning defy analysis or explanation (this is currently a key issue in machine learning as we have mentioned in other posts, referencing XAI, or explainable AI). We will address this specific problem further in later posts.

What Can Go Wrong? An Example from AI

Please note that in the following we are simply citing an extreme simplification of a facial recognition problem for the purposes of discussion. There are many variations on how the problem might be posed, but the basic approach will generally be based on minimization of an error measure, and a possible reduction of the output to a number between 0 and 1, analogous to the discussion below.

Let us now consider a simple example from AI. We would like to train a neural network to recognize us in a photograph. Think of a photograph as a set of pixels in two-dimensions, with each pixel carrying information about the picture it represents. For convenience, we’ll consider black and white (BW) photographs. In that case, each pixel is associated with a number which relates to what is called a gray scale, or an intensity of light from darkest dark (black) to brightest bright (white). If we use 256 shades of gray between black and white, then each shade of gray can be assigned a number (0 = black, 255 = white, etc). Such a photograph can be represented in the computer by a rectangular array of numbers (one for each pixel) representing the shade of gray for that pixel. If you send the numbers to a suitably programmed printer, it will print a corresponding shade of gray at each pixel location and the result will look like the original photograph, when printed.

In the case of our example, we are going to go for extreme simplicity. Recall that the idea behind “learning” by “training” a neural network is to determine the parameters in a network in such a way that will take a given type of input and produce a desired type of output as a result, We will assume that we are going to use a gradient-type algorithm to automatically adjust the neural network parameters to accomplish this. In our simple example, we are going to train a chosen neural network to produce a desired (single) output of symbol 1 when the input is a photograph of you, and to produce a desired output of symbol 0 when we input a photograph which is not of you. For training purposes, we will use what is called “supervised learning” which means that for each training photograph that we input to the neural network, we will also specify the correct desired output (0 or 1). The gradient algorithm will try to adjust the parameters in such a way as to minimize the error between the actual output from the network and the desired output in each case.

Now, in order to train our network, as described, we are going to assemble a large number of photographs of you (the training set), and we will also incorporate into this training set a large number of photographs which are not of you. For each photograph in the training set, we will specify the desired output. The gradient algorithm will be expected to adjust the neural network parameters in such a way that each input photograph in the training set will produce the desired output symbol (1 or 0), as appropriate, after the training is complete, After training, our hope is that the neural network has “learned” to distinguish between an input photograph which is of you and a photograph which is not of you by outputting the desired symbol in each case.

Using Our Neural Network Output to Make a Decision

Now suppose that we have adjusted the parameters in our neural network so that it operates, as described above, for photographs in the training set. In that case, the neural network should output a value near 1, for each photograph of you in the training set, and a value near 0 otherwise. Note that there may not be a set of parameter values that will give a perfect output in every case.

Next, we might test our neural network by inputting a photograph of you which was not in the training set and we’ll look at the corresponding output value. If we are using a nonlinear s-shaped function, as is usually done to produce the final output, then we’ll assume that we have chosen that function so that the output will always be bounded between 0 and 1. Let’s suppose that the network is working pretty well in this case, and produces an output value near 1, let’s say .76, for example. In order to make a decision (whether or not the picture is of you), we could just make a simple rule that says if the output number is greater than .5, the photograph is of you, and if less than .5, it is not you (does this sound sufficiently accurate to you?). This latter rule is a super simple analogy to what is actually being done in some cases. Please keep in mind that this is an oversimplification for the sake of discussion only!

Now, in practice, we should be aware that the single trained neural network in our example will not always give a high score (near 1) to every possible photograph of you, so even for the training set, many, if not all of the output values will not necessarily be 1. Other techniques can be used to bias the results in such a way that the average value over most of the training set of photographs will hopefully turn out to be greater than .5, for example, when presented with a photograph of you. This provides some kind of validation of our simple rule, in some (questionable) sense. As pointed out earlier, the actual results that you will get depend very heavily on the problem you are actually solving. This includes the error measure and algorithm being used, and the architecture and starting values of the system (neural network) parameters, for example.

As an additional comment, realizing that if we had computed the actual best parameter values to minimize a truly meaningful error for each photograph in the training set separately, we would generally expect somewhat different parameter values to be found for each photograph. If we did this, and realizing that we must produce one neural network to identify you, it is not unheard of to simply average all the corresponding best parameter values in the trained neural net for each photograph in the training set, and use those averaged parameter values for the final version of the required neural network. This might all seem like mathematical nonsense, but lacking a scientific foundation, methods like this have sometimes actually been used. Who can say what the averaged parameter final neural network actually does, but this is what may be done, in the simplest cases. There can be many variations on the basic idea of how to train a neural network using a large training data set together with the appropriate desired outputs, and we will not delve into those issues further in this post. There are many possibilities and very little way to know how they might or might not work without trying them (as pointed out in other posts, this is an entirely empirical process, devoid of any proper scientific foundation).

In any event, if you believe that such procedures are actually sufficiently accurate for identifying you, then good luck with that. Even most practitioners would agree that it really isn’t as simple as this, and fortunately they realize that and try to do more to correct the problem, but it is all done in the absence of a proper scientific foundation or even a good understanding of the effect of the mathematics involved. Very simply put, in this simple example, an output number for a photograph of less than .5 may serve to say you are not the person in the photograph and a number greater than .5 may implicate you in a crime. Beware of strangers bearing gifts!

A Comparison with a Recent AI Classification Experiment

How do we measure the difference between two BW photographs? There are many ways to do this, but perhaps the simplest might be just to compare each pixel in the new photograph to its corresponding pixel in the reference photograph and form an numerical measure to use, based on our numerical gray scale explained earlier. For greatest simplicity, however, we could just average the brightness (determined by our gray scale, as mentioned above) in the pixels in each photograph. In that case, our simple measure will just represent the difference in the average brightness (a single number for each photograph). We wouldn’t actually want to do this, because it wouldn’t work well in most cases, as we shall see, but for sake of discussion, it gives the idea.

In a reported AI study which we will only paraphrase, the attempt was to be made to try to classify (called “classification” or “clustering”) a set of photographs, each of which contained one of two different objects, say dogs and cats, for example. The objective was to separate the photos of dogs and cats into separate classifications or clusters. All of the photographs of the objects used for the classification had either been taken in shade or in bright sunlight. Using our knowledge of how we might compare BW photographs, mentioned above, the mathematical classification procedure determined that the most obvious classification of the photographs in question related not to the presence of a cat or a dog in the photographs, but rather it was that the average brightness of the photographs made the most compelling case for dividing them into two categories based on whether the photograph had been made in bright sunlight or in shade.

Hence a proper classification was made between photographs taken in bright light and photographs taken in shade, but each category contained both cats and dogs and so there was no easy way to separate the cats and dog photographs in the result – an unexpected outcome because not enough thought went into correctly posing the problem being solved, and making sure that the information in the photographs and the error measure emphasized the differences in characteristics between photographs of cats and dogs and not the light conditions at the time the photograph was taken.

That is why we’d better be sure that we understand what we are actually doing (what problem is actually being solved) when we pose and supposedly solve a problem. Otherwise, we might announce, on reaching zero elevation in our original contour map example, that we had arrived in Galveston, Texas, if that was the expected answer! We had not built in any location specific information into the problem we solved, and so it was not likely to be there in any solution, unless those solutions involving Galveston, Texas occurred by chance out of the infinity of possibilities.

We are simply issuing a warning that what we don’t understand well, or misuse because we didn’t understand what we were doing, can lead to unexpected consequences which may have severe implications. Until something is done to correct the problems inherent in these approaches, it is necessary to conclude that there is no science in much of AI, as being practiced today, (especially in “machine learning” using neural networks), and further, that there is often no intelligence behind much of what is being done; and, as we are sometimes being told, it really isn’t necessary (the neural network will figure it all out for us!).

Recent Comments and Warnings from Members of the Medical Community

It is not inappropriate to mention various recent references in which a medical AI practitioner has issued a warning to beware of decisions based on machine learning today. One such warning relates to the fact that machine learning, as practiced today must always give an answer (a recommendation, or a prediction, for example). Often the answer is wrong or unreliable, and possibly even life threatening. One notable comment relates to the idea that we need to have a form of machine learning in which something like “I don’t know what to tell you” is a possible answer, thereby avoiding being forced to give an answer that can easily be wrong or misleading. I am paraphrasing, of course, and this is a large and important topic. This and similar warnings have appeared in reports from recent Conferences and Symposia related to AI or Medicine (or both).

It is probably also relevant to this discussion that one of our largest tech companies recently started re-assigning workers away from its self-driving car project, though no particular reason was cited for the action. We have mentioned other similar actions by large tech companies in other posts.

A Further Reference of Note Related to our Discussions

The interested reader who is following this commentary might like to read a recent posting entitled “AI’s Big Challenge” by Garrett Kenyon in Scientific American, February, 2019, for additional well-informed background and commentary relevant to our own prognostications, to come, concerning the current state of AI and its implications for the future.

What’s Next?

Though much more can be written, and will be, we will stop with this brief overview for now and hope that the contour map has helped to improve our understanding of at least the underlying basis for what is being called “Machine Learning”, today. This discussion also points the way to further commentary on the often unexplainable nature of outputs from machine learning experiments. We will continue this discussion in our next post.

Again we thank you for visiting, we hope you have learned something of value that will help you to better understand the poorly understood and often mysterious, hidden, and even misleading world of AI today. We hope to see you again soon as we begin to draw some of our initial questions, related to artificial intelligence (AI), to a conclusion and perhaps offer a prediction or two.