TALEND WEBINAR : March 27th, 2018 | Step-by-Step to Enterprise Data Integration

# Using Neural Networks with Talend DI and ESB

Many times during Data Integration projects we have situations where we have to analyze the data in order to come up with acceptance criteria for it. In a lot of cases, this is pretty straight forward and can be easily written into simple rule-based logic. But in some situations, it is not so cut and dry. In these situations a lot of people will generate rule of thumb logic which will isolate certain rows to be double-checked by a human. This works. It is time consuming and requires human intervention, but it works. However, in a lot of those situations we can use Neural Networks to do that job for us.

In this tutorial, I will be demonstrating how to use a Multilayer Perceptron Neural Network to learn Tic-Tac-Toe end game states. I have chosen this as it is an easy game to understand and the data set to learn is relatively small. I used a data set found here in this example. In order to implement the Neural Network, I am using a Java API from Neuroph. Neuroph is a lightweight Neural Network framework which allows you to make use of this powerful machine learning technique quickly and easily. I won't be going into too much detail explaining Multilayer Perceptrons in this tutorial, only explaining where it is necessary to understand this tutorial. For information on Neural Networks in general, I recommend exploring the Neuroph site where there are tutorials using their Neuroph Studio.

So, let's start. First up, I will talk about the training data.

## Training Data

For the training data in this tutorial, I have made use of data provided by the University of California, Irvine's Center for Machine Learning and Intelligent Systems. You can find it here.  This data set holds all of the end game scenarios for the situation where X starts and X is the focus. For example, if X wins the result is POSITIVE, if X loses the result is NEGATIVE. Blank fields are represented by a "b". For example....

 b,b,b,o,o,b,x,x,x,positive b,b,b,o,b,o,x,x,x,positive b,b,b,b,o,o,x,x,x,positive x,x,o,x,x,o,o,b,o,negative x,x,o,x,x,o,b,o,o,negative x,x,o,x,x,b,o,o,o,negative

The first thing we have to do before even thinking about our Neural Network is to make our data suitable for a Neural Network. For Neural Networks we need to standardize our data. In this situation, it is reasonably simple since we only have a choice of up to 3 alternatives for each value. However, it can be a lot more complicated. Take a look here for a good explanation of this with examples. For this tutorial, I chose to convert this data as follows....

 b = 0 o = -1 x = 1 positive = 1  negative = 0

## TicTacToeUtils Routine

Since the Neural Network will be used with data in the unconverted format, I have built the logic for converting the data into a Talend Routine. This is used by both the job training the Neural Network and the service using the trained Neural Network. This routine is shown below, it is also included with the job and service at the bottom of this tutorial.....

 package routines;public class TicTacToeUtils {    /**      * Translates String values to int values to suit Neural Network requirements      *       * @param data - A String value to be changed to an int      * @return -  The corresponding int value      */     public static int translateStringValueToNumber(String data) {        int returnVal = -9999;        data = data.trim();        if (data.compareToIgnoreCase("X") == 0) {             returnVal = 1;         } else if (data.compareToIgnoreCase("O") == 0) {             returnVal = -1;         } else if (data.compareToIgnoreCase("B") == 0) {             returnVal = 0;         } else if (data.compareToIgnoreCase("POSITIVE") == 0) {             returnVal = 1;         } else if (data.compareToIgnoreCase("NEGATIVE") == 0) {             returnVal = 0;         }        return returnVal;     }    /**      * Translate from an int value to a String TicTacToe value "X", "O", "B" (blank)      * @param data - an int value      * @return - The corresponding String value      */     public static String translateNumberValueToString(int data) {        String returnVal = "";        if (data == 1) {             returnVal = "X";         } else if (data == -1) {             returnVal = "O";         } else if (data == 0) {             returnVal = "B";         }         return returnVal;     }    /**      * Translate the result value into a String value representing the result of the TicTacToe      * game from player X's perspective.      *       * @param data - A the double response from the Neural Network      * @return - The String result      */     public static String translateResultValueToString(double data) {        String returnVal = "";        long tmpData = Math.round(data);        if (tmpData == 1) {             returnVal = "POSITIVE";         } else if (tmpData == 0) {             returnVal = "NEGATIVE";         }         return returnVal;     }    /**      * A method for retrieving a section of a String according to its position. Used      * to extract TicTacToe board data from value supplied to REST service      *       * @param data - The complete TicTacToe board in a String      * @param position - an int representing the section of the String data to be returned      * @return - A String section of the String data supplie      */     public static String getStringAtPosition(String data, int position){         String[] dataArray = data.split(",");         String returnVal = "";                  if(position=0){             returnVal = dataArray[position].trim();         }                          return returnVal;     }      }

## NeuralNetworkUtils Routine

In order to use the Neuroph API in a Talend job, I have built some methods to simplify the process. This is by no means the "perfect solution" for all Talend jobs, but it suits the requirements for this one.  The routine I put together is shown below, it is also included with the job and service at the bottom of this tutorial.....

 package routines;import java.util.ArrayList; import java.util.Arrays; import org.neuroph.core.NeuralNetwork; import org.neuroph.core.data.DataSet; import org.neuroph.core.data.DataSetRow; import org.neuroph.core.events.LearningEvent; import org.neuroph.core.events.LearningEventListener; import org.neuroph.nnet.MultiLayerPerceptron; import org.neuroph.nnet.learning.BackPropagation; import org.neuroph.nnet.learning.MomentumBackpropagation; import org.neuroph.util.NeuronProperties; import org.neuroph.util.TransferFunctionType;/*  * A class making use of the Neuroph API (http://neuroph.sourceforge.net/javadoc/index.html). The methods here have been  * written to demonstrate how this API can be used with Talend to enable Neural Network functionality in a Talend job or  * Service.   *   */  public class NeuralNetworkUtils {    //Constants for use with TransferFunctionType - currently only SIGMOID, but can be extended     public static final Enum SIGMOID = TransferFunctionType.SIGMOID;          //Private Static variables shared by the Static methods     private static DataSet trainingSet;     private static NeuralNetwork loadedMlPerceptron;     private static MultiLayerPerceptron myMlPerceptron;     private static int numOfIterations;               /**      * Returns the number of iterations that took place training the      * Neural Network      *       * @return - an int representing the number of iterations      */     public static int getNumOfIterations() {         return numOfIterations;     }          /**      * Creates a new training data set      *       * @param dataColumns - an int representing the number of input columns      * @param resultColumns - an int representing the number of expected result columns      */     public static void createTrainingSet(int dataColumns, int resultColumns) {         trainingSet = new DataSet(dataColumns, resultColumns);     }    /**      * Adds data to the training data set created using "createTrainingSet" method      *       * @param dataColumns - a double array containing one row of input data      * @param resultColumns - a double array containing one row of expected result data      */     public static void addTrainingData(double[] dataColumns,             double[] resultColumns) {         trainingSet.addRow(dataColumns, resultColumns);     }     /**      * A method which creates a Multi-layer Perceptron Neural Network using backpropogation with momentum      *       * For a brief explanation of this see http://neuroph.sourceforge.net/tutorials/MultiLayerPerceptron.html and      * https://en.wikipedia.org/wiki/Multilayer_perceptron      *       * @param learnRate - a double which sets the learning rate for the network (0 neuronsInLayersVector = new ArrayList<>();         for (int i = 0; i < neuronsInLayers.length; i++) {             neuronsInLayersVector.add(Integer.valueOf(neuronsInLayers[i]));         }        // create multi layer perceptron         myMlPerceptron = new MultiLayerPerceptron(neuronsInLayersVector,                 neuronProperties);        // Set learning rules         MomentumBackpropagation mbp = new MomentumBackpropagation();         mbp.setLearningRate(learnRate);         mbp.setMomentum(momentum);         mbp.setMaxError(maxError);         mbp.setMaxIterations(maxIterations);        //Learning event listener to keep track of iterations         mbp.addListener(new LearningEventListener() {            @Override             public void handleLearningEvent(LearningEvent arg0) {                 // TODO Auto-generated method stub                BackPropagation bp = ((org.neuroph.nnet.learning.BackPropagation) arg0                         .getSource());                 numOfIterations = bp.getCurrentIteration();            }        });        // learn using the training set         myMlPerceptron.learn(trainingSet, mbp);        // test neural network         testNeuralNetwork(myMlPerceptron, trainingSet);                 //Used for outputting neuron configuration         String neurons = "";                  for(int x=0; x

Since this routine makes use of third party APIs, we need to link the related Jars to the Talend routine. The API can be downloaded from here

To link the Jars to the Talend Routine do the following....

1. Right click on the routine and select "Edit Routine Libraries"
2. Click "New"
3. Select "Browse a library file"
4. Click "Browse" and search for the required Jars

For this routine, the required Jars are...
neuroph-core-2.92.jar
slf4j-api-1.7.5.jar
slf4j-nop-1.7.6.jar

## The TrainNeuralNetwork ForTicTacToe Job

This job is used to train the Neural Network. It is a pretty straight forward Talend job and can be seen below...

There are two tLogRow components which are deactivated in the screenshot above. It is sometimes quite useful to add these and deactivate them so that you don't have to make major changes to your job in order to simply debug what goes in and comes out of a component. I use them a lot with tMap and tXMLMap components.

### Context Variables

For this job I only used two context variables; 1 for the training set file and one for the serialized neural network object. These can be seen below. If you download this job you will need to change these to suit your system.

### 1) "Data" (tFileInputDelimited)

This component is used to read the data file (downloaded from here). You can see the configuration of the component below...

### 2) "Convert to suitable format" (tMap)

This component is used to simply convert the String input type of the column data to an Integer type. This can be seen below...

In order to carry out the conversion of the data, we are using the "translateStringValueToNumber" method from the TicTacToeUtils routine that is show above. The code used is shown below. It is exactly the same for each column, with just a change in the column name supplied.

 routines.TicTacToeUtils.translateStringValueToNumber(row9.a1)

### 3) "Train Network" (tJavaFlex)

This component is where the magic happens. Since it is a tJavaFlex and only has 3 Java sections (Start Code, Main Code and End Code) I will not post a screenshot here. Instead I will go through each of the Java sections and explain what is happening.

Start Code

Below is the code in the Start Code section.

 //Create a training set object routines.NeuralNetworkUtils.createTrainingSet(9, 1);

Here we are creating a training set. This is an object for storing the training data which is made up of 9 input columns and 1 result column. The configuration of the training set depends on the data you will be working with. In this Tic-Tac-Toe tutorial we have 9 squares that make up the 3x3 board state and 1 result column which returns whether a positive or negative result has been obtained by the X player.
The Start Code section is only fired once at the beginning when the component is initialized.

Main Code

Below is the code in the Main Code section.

 //Add data to the training set object double[] inputData = new double[9]; double[] resultData = new double[1];inputData[0] = row10.a1; inputData[1] = row10.a2; inputData[2] = row10.a3; inputData[3] = row10.b1; inputData[4] = row10.b2; inputData[5] = row10.b3; inputData[6] = row10.c1; inputData[7] = row10.c2; inputData[8] = row10.c3;resultData[0] = row10.result;routines.NeuralNetworkUtils.addTrainingData(inputData,resultData);

Here we are creating two double arrays. The inputData array is made up of 9 elements (1 for each of the squares in a Tic-Tac-Toe board) and the resultData is made up of 1 element. This is then added to the training set using the "addTrainingData" method. The Main Code section is fired for every row passed to the component.

End Code

Below is the code in the End Code section.

 //Create the Neural Network routines.NeuralNetworkUtils.trainMultiLayerPerceptronWithMomentumBackProp(0.5, 0.7, 0.000001, 1000, routines.NeuralNetworkUtils.SIGMOID, 9,26,1);//Save trained Neural Network - The filename and path may need changing in your environment routines.NeuralNetworkUtils.saveMultiLayerPerceptron(context.neuralnet_filepath);

Here we use the "trainMultiLayerPerceptronWithMomentumBackProp" method to create a Neural Network and initiate the training. The important thing here are the parameters that have been used. I will explain those below....

 Parameter Value Description learnRate 0.5 The learning rate applies a greater or lesser adjustment to the old weight based on the new result. The lower the value, the slower the learning that takes place. However, the greater the number the more likely that if there is a great variance in the input data, that the wrong thing will be learned. This value needs to be tweaked until you hit the sweet spot. For this data I have found that 0.5 is a good value. momentum 0.7 The momentum simply adds a fraction of the previous weight update to the current one. The reason for this is that sometimes the functions being calculated are not smoothly moving in a constant direction or gradient. Imagine a ball rolling down a hill. During its descent, it might hit the occassional bump that might hinder its progress. In our ball rolling down a hill example, momentum would allow it to continue rolling down the hill by using its momentum to ride over the bump. Both learning rate and moment are explained quite well here. maxError 0.000001 The max error is the maximum total net error between the actual and desired outputs we will allow over a training iteration, before the network is considered trained. Since this data should be easily trained, I have set this to quite a low level of tolerance for errors. Usually this value will be much higher. maxIterations 1000 The total number of iterations before we give up training. This is low compared to other environments you might wish to train. transferFunctionType routines.NeuralNetwrkUtils.SIGMOID Transfer function choice is a big question in Neural Networks. Without going into any detail, the choice here was somewhat arbitrary for this problem. For your Neural Networks you will want to experiment and research the function you use. However, for simple problems SIGMOID is a reasonable one to start with. neuronsInLayers 9,26,1 The number of neurons at each level in the network. In this Neural Network I tried a few combinations and found that 26 hidden neurons worked best. The input neurons (9) are dictated by the number of input columns and the output neurons (1) is dictated by the expected result.

The last thing that is done in the End Code section is to save the Neural Network to a file. Once trained (so long as your data doesn't change all that much) the Neural Network is able to be saved and used in jobs/services making use of the same sort of data.

## The TicTacToeStateScore Service

To show how to use the trained Neural Network I decided to use a REST Service example. I could have used a DI job, but felt that a service might open up some ideas as to how Neural Networks can be used in real-time environments as well as batch. Also, a REST Service is pretty simple and quick to show this working. The Service can be seen below...

### Context Variables

Below are the context variables created for this Service. In this service we are using just 1 for the path to the Neural Network file.

### 1) "tRESTRequest_1" (tRESTRequest)

This component is where we configure the REST Service. The screenshot below shows how this has been configured...

We are using the "GET" verb and using a relative path for the endpoint. When you run this through the Studio it will use the port that is specified for your REST Service testing. When using it in Apache Karaf it will use the defaults of the Karaf.

In this example, we make use of a REST Service Query Parameter. The configuration for this is shown in the screenshot below...

First we open the Output Flow schema tool by clicking on the button circled in red.
Once the window appears we configure a column called "state" as a String and add "query" to the comment box. This is important. If this is not done, you will not be able to use it as a query parameter. Now that this is set, we can call this Service with a variation on the following URL.....

### 2) "Get data from state" (tMap)

This component is used to simply to extract each of the 9 state positions from the "state" query parameter that is supplied in the URL, and output them to the next component as individual Strings. This can be seen below...

To extract the values we are using a method in the TicTacToeUtils routine called "getStringAtPosition". This extracts the section of the String indicated by the second parameter which is used to identify position. The use of this method can be seen below....

 routines.TicTacToeUtils.getStringAtPosition(row2.state,0)

### 3) "tJavaFlex_1" (tJavaFlex)

Like the last tJavaFlex that was used, this is where the magic happens. Also like the last one, I will not post screenshots of this, I will simply go through each of the code sections. In this tJavaFlex we only use the Start Code and Main Code sections.

Start Code

This is what is used in the Start Code section.

In this section we simply load the Neural Network that we want to use. This was the Neural Network trained in the last job.

Main Code

This is what is used in the Main Code section.

 //Convert state String values to numbers suitable for a Neural Network double[] inputData = new double[9];String a1 = out1.a1; String a2 = out1.a2; String a3 = out1.a3; String b1 = out1.b1; String b2 = out1.b2; String b3 = out1.b3; String c1 = out1.c1; String c2 = out1.c2; String c3 = out1.c3;inputData[0] = routines.TicTacToeUtils.translateStringValueToNumber(a1); inputData[1] = routines.TicTacToeUtils.translateStringValueToNumber(a2); inputData[2] = routines.TicTacToeUtils.translateStringValueToNumber(a3); inputData[3] = routines.TicTacToeUtils.translateStringValueToNumber(b1); inputData[4] = routines.TicTacToeUtils.translateStringValueToNumber(b2); inputData[5] = routines.TicTacToeUtils.translateStringValueToNumber(b3); inputData[6] = routines.TicTacToeUtils.translateStringValueToNumber(c1); inputData[7] = routines.TicTacToeUtils.translateStringValueToNumber(c2); inputData[8] = routines.TicTacToeUtils.translateStringValueToNumber(c3);//Calculate result using the previously trained Neural Network double[] output = routines.NeuralNetworkUtils.calcData(inputData);System.out.println("Actual Result:"+ routines.TicTacToeUtils.translateResultValueToString(output[0]));//Format state for Sys out String tictactoe = a1+","+a2+","+a3+"\n"+b1+","+b2+","+b3+"\n"+c1+","+c2+","+c3+"\n";System.out.println(tictactoe);//Pass result and board state to be formatted for the XML output row3.result = routines.TicTacToeUtils.translateResultValueToString(output[0]); row3.tictactoe = tictactoe;

In this section we create a double array called "inputData" to hold our state values.
We then use the "translateStringValueToNumber" method from the TicTacToeUtils routine to convert the String values to their corresponding numeric values.
We then use the "calcData" method to run that data through the trained Neural Network. This returns a double array with the result.
After some "System.out" calls to show what is happening in the output window, we pass the result (converted to a String using the "translateResultValueToString" method from TicTacToeUtils) and the Tic-Tac-Toe board state on to the next component.

### 4) "Format the XML output" (tXMLMap)

This component is used to format the response into an XML output. It is really very basic and the configuration can be seen in the screenshot below.....

The reason we wrap the "tictactoes_state" element value with "<![CDATA[" and "]]>" is to allow formatting carried out in the last component will be shown in the web browser (it works in some browsers, not in others). This isn't terribly important but allows you to easily see the board state as it would be written on a piece of paper.

The output to the browser looks like below....

 NEGATIVE

### 5) "tRESTResponse_1" (tRESTResponse)

This component simply allows us to return the XML to the browser. REST Services can be (and usually are) a lot more complicated than this one. I have chosen to put together a bare bones REST Service in this case and it will not handle incorrect formats being supplied as the "state". As such, this component is simply configured to return a 200 status and the XML. The config can be seen below...

## Running the TrainNeuralNetworkForTicTacToe Job

To run this job simply make sure the source file is downloaded and in the correct location (configured in the context variables), then click Run. If this runs successfully, you should see something like the following in the output window....

 Input: [1.0, -1.0, 1.0, -1.0, -1.0, 1.0, 1.0, 1.0, -1.0] Expected Output: [0.0] Output: [0.0073486272363958655] Input: [1.0, -1.0, -1.0, -1.0, 1.0, 1.0, 1.0, 1.0, -1.0] Expected Output: [0.0] Output: [0.0010815233039514665] Input: [-1.0, 1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0, 1.0] Expected Output: [0.0] Output: [0.002757025113506054] Input: [-1.0, 1.0, 1.0, 1.0, -1.0, -1.0, 1.0, -1.0, 1.0] Expected Output: [0.0] Output: [0.0022573148740257934] Input: [-1.0, 1.0, 1.0, 1.0, -1.0, -1.0, -1.0, 1.0, 1.0] Expected Output: [0.0] Output: [0.001865857943502293] Input: [-1.0, 1.0, -1.0, 1.0, 1.0, -1.0, 1.0, -1.0, 1.0] Expected Output: [0.0] Output: [0.003494778673600675] Input: [-1.0, 1.0, -1.0, 1.0, -1.0, 1.0, 1.0, -1.0, 1.0] Expected Output: [0.0] Output: [0.020374120565192142] Input: [-1.0, 1.0, -1.0, -1.0, 1.0, 1.0, 1.0, -1.0, 1.0] Expected Output: [0.0] Output: [0.008408965104047168] Input: [-1.0, -1.0, 1.0, 1.0, 1.0, -1.0, -1.0, 1.0, 1.0] Expected Output: [0.0] Output: [0.022969116290342522] LearnRate = 0.5| Momentum = 0.7|Neurons = 9,26,1| Iterations = 98 [statistics] disconnected Job TrainNeuralNetworkForTicTacToe ended at 17:22 01/06/2016. [exit code=0]

## Running the TicTacToeStateScore Service

To run this Service simply make sure the path to the Neural Network file is set, that it has been trained, then click Run.
Once the Service is started, you will see a message like below in the output window....

 Starting job TicTacToeStateScore at 17:26 01/06/2016.[statistics] connecting to socket on port 3652 [statistics] connected Jun 01, 2016 5:26:26 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://127.0.0.1:9099/statescore

In order to work out how to call this Service from your web browser, look at the last line I have copied above. That tells you the endpoint you need to use minu the state query parameter. Be aware that the IP address above is just for localhost. If you want to use the service from another computer on your network, you will need to identify the machine that the service is running on's IP. To call the above Service, the following end point should be used ....

 http://127.0.0.1:9099/statescore?state=X,X,O,X,O,X,X,O,O

Remember that the "state" should be changed according to whatever state you want to assess. Since only legal states were trained, you can only get reliable results from legal states. The above call should result in the following XML response....

 POSITIVE

A copy of the completed tutorial can be found here. You will also need the Neuroph Jars which can be downloaded here (we are using Neuroph 2.92 in this tutorial). The Tic-Tac-Toe data can be downloaded here. This tutorial was built using Talend ESB 6.1.1 but can be imported into subsequent versions. It cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial. You will need to set the Context variables according to your system before running it.

About the Author - Richard Hall

Richard comes from a background of over 10 years working in Data Integration and has moved his focus to Application Integration over the last few years. Throughout his career he has worked in high pressure, delivery driven environments. He has provided Data Integration and Application Integration consulting services in Africa, Asia, North America, Europe and Australia, to Banks, Telcos, Insurance companies, Finance companies, Media leaders and many other smaller entities.