Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making TensorFlowTransform trainable. #1063

Merged
merged 24 commits into from
Oct 8, 2018
Merged

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Sep 26, 2018

This PR addressed #951 where user can retrain TensorFlow models using TensorFlowTransform in ML.Net.

The implemented training functionality is an add-on feature on top of current implementation of TensorFlowTransform where model scoring using TensorFlowTransform is untouched.

Current implementation includes

  • Retraining of un-frozen model (saved with simple_save method in python).
  • Support for batch training
  • Serializing retrained model into ML.Net stream

ToDo

  • [SOLVED] Serializing retrained model into ML.Net stream. It seems like a big challenge right now because TensorFlow C-API does not support serialization of TFSession objects which contains all the retrained parameter.
  • Turn TensorFlowEstimator from TrivialEstimator to NonTrivialEstimator. This is because TensorFlowTransform now trains model in one of the constructor call where TrainingArguments are passed (Will be done as separate PR Convert TensorFlowEstimator into Non-TrivialEstimator. #1110).
  • Think about how we can get ride of data copying during creation of batches.

private class TensorValueGetter<T> : ITensorValueGetter
{
private readonly ValueGetter<T> _srcgetter;
private readonly List<T> _bufferedData;
Copy link
Contributor Author

@zeahmed zeahmed Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List [](start = 29, length = 7)

may be think about making it an array to make it faster. #Resolved

[Fact]
public void TensorFlowTransformMNISTTemplateTrainingTest()
{
var model_location = @"E:\Tensorflow\LogisticRegression";
Copy link
Contributor Author

@zeahmed zeahmed Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@"E:\Tensorflow\LogisticRegression" [](start = 33, length = 35)

Need to fix this harded coded path... #Resolved

trans = new ConcatTransform(env, "Features", "Prediction").Transform(trans);
trans = new CopyColumnsTransform(env, ("LabelOriginal", "Label")).Transform(trans);

var trainer = new LightGbmMulticlassTrainer(env, "Label", "Features");
Copy link
Contributor Author

@zeahmed zeahmed Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightGbmMulticlassTrainer [](start = 34, length = 25)

Note: a trainer on top is actually not a limitation. In Estimator/Transformer world,prediction ipeline will work without adding a trainer at the end. This is just because it being used in direct instantiation case. #Resolved

@ericstj
Copy link
Member

ericstj commented Sep 27, 2018

Serializing retrained model into ML.Net stream. It seems like a big challenge right now because TensorFlow C-API does not support serialization of TFSession objects which contains all the retrained parameter.

Is enough of this state exposed so that we can write serialization on top, or are things internal / not exposed to C-API? At least the Python and JS APIs are able to save so there must be a way, but I can imagine that the exports needed aren't exposed by the TF C-API (or the underlying structure of the objects isn't exposed in the headers). #Closed

@zeahmed
Copy link
Contributor Author

zeahmed commented Sep 27, 2018

I found out a standard way of doing this through session. I will update the PR with my changes most probably today.


In reply to: 425152137 [](ancestors = 425152137)

@zeahmed
Copy link
Contributor Author

zeahmed commented Sep 27, 2018

For that I also included some other features from TF#. Your feedback will be valuable on that.


In reply to: 425159496 [](ancestors = 425159496,425152137)

@abgoswam
Copy link
Member

abgoswam commented Sep 27, 2018

If we can save the re-trained model in the SavedModel format we will be able to use capabilities for save & restore of SavedModel format that we added in #784


In reply to: 425171940 [](ancestors = 425171940,425159496,425152137)

trans = CategoricalTransform.Create(env, trans, "Label");
trans = NormalizeTransform.CreateMinMaxNormalizer(env, trans, "Features", "Placeholder");

var args = new TensorFlowTransform.TrainingArguments()
Copy link
Member

@abgoswam abgoswam Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainingArguments [](start = 51, length = 17)

So in this scenario we are re-training the entire graph. this is good place to start the PoC.

Two other patterns for transfer-learning we need to look into :

(a) replace last layer (or last few layers) with a layer that matches shape of "Label" for the new dataset that we want to re-train on.

i.e. re-train the "lr_model" above but instead of the final layer what current has shape [-1,10] I want to re-train on iris dataset (say) which has shape [-1, 5]

(b) support 'freeze_up_to_node=<node_name>' functionality. i.e. freeze weights for nodes that <node_name> depends on, and only fine-tune weights of nodes that depend on <layer_name>

#ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhishek for pointing out these scenarios. Actually, these are mostly research related scenario and used for experimentation purposes. These can be added on top of what has been implemented so far if required.

BTW, it is no longer a PoC...:)
I would suggest reviewing it as real PR that will go in master.


In reply to: 221026509 [](ancestors = 221026509)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we track those scenarios above as separate issues ?

To me they seem more relevant -- esp (a). Otherwise we would have to tell customers that we can only do transfer learning if our label dimension is same as label dimension used in the pre-trained model.

(maybe there might be a simple fix of this -- just want to make sure we keep track of assumptions made in this PR)


In reply to: 221084395 [](ancestors = 221084395,221026509)

Copy link
Contributor Author

@zeahmed zeahmed Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the label dimension means changing cost function, prediction function, label input dimension and whatnot. Its hard to identify operations in the graph that are using label and adjust them to new dimensions/settings.
If you have a solution then definitly it would be worth trying...:)


In reply to: 221099796 [](ancestors = 221099796,221084395,221026509)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I can take a shot at this as a separate issue. We need to document the existing behaviour though i.e. we only support transfer learning on models which have same feature/label dimensions as user data


In reply to: 221102056 [](ancestors = 221102056,221099796,221084395,221026509)

@zeahmed zeahmed changed the title WIP: Making TensorFlowTransform trainable. Making TensorFlowTransform trainable. Sep 27, 2018
@zeahmed zeahmed changed the title Making TensorFlowTransform trainable. WIP: Making TensorFlowTransform trainable. Sep 27, 2018
for (int j = 1; j < tfInputShapes[i].NumDimensions; j++)
newShape[j] = tfInputShapes[i][j];
tfInputShapes[i] = new TFShape(newShape);
}
Copy link
Contributor Author

@zeahmed zeahmed Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to bit more shape checking here due to missing dimensions... #Resolved

@zeahmed
Copy link
Contributor Author

zeahmed commented Sep 27, 2018

Yup, did it the same way.


In reply to: 425178632 [](ancestors = 425178632,425171940,425159496,425152137)

@zeahmed zeahmed changed the title WIP: Making TensorFlowTransform trainable. Making TensorFlowTransform trainable. Sep 27, 2018
/// })
/// </summary>
[Fact]
public void TensorFlowTransformMNISTTemplateTrainingTest()
Copy link
Contributor Author

@zeahmed zeahmed Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorFlowTransformMNISTTemplateTrainingTest [](start = 20, length = 44)

Write more tests... #Resolved

// Save the model on disk
var path = Path.Combine(modelDir, "mlnet_model");
Session.GetRunner().AddInput("save/Const", TFTensor.CreateString(Encoding.UTF8.GetBytes(path)))
.AddTarget("save/control_dependency").Run();
Copy link
Member

@abgoswam abgoswam Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions:

  1. is this what's serializing the Session to disk ?

  2. it will save only the parameter weights, not the saved_model.pb graph itself .. am i right ?

  3. Is this a standard technique -- of using nodes with names "save/Const" and "save/control_dependency" to overwrite model weights #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to all three question above.
TensorFlow saves/restores models through these operations on graph.


In reply to: 221098961 [](ancestors = 221098961)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for #3, does TF document this somewhere in code / other language bindings etc.

Just using a combination of strings "save/Const" and and "save/control_dependency" seems so arbitrary :) ..

wonder if this is the same technique used when checkpointing models ?


In reply to: 221102898 [](ancestors = 221102898,221098961)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is how TensorFlow saves/restores models to/from disks. Both for checkpointing and simple_save. There is a reason to inject those operation in the graph because TF prefers all operation through graph instead of doing it through python or other language code. I will find a proper documentation for that.

I have made a parameter in TrainingArgument class so that in case user need to change it they can change. However, it is highly unlikely to get changed.


In reply to: 221104033 [](ancestors = 221104033,221102898,221098961)

///
/// # Set model weights
/// W = tf.Variable(tf.truncated_normal([784, 10]), name = "W")
/// b = tf.Variable(tf.constant(0.1, shape=[10], dtype=tf.float32), name = "b")
Copy link
Member

@abgoswam abgoswam Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant [](start = 31, length = 8)

can we verify that after the fine-tuning we are actually ending up with different values for the parameters ?

e.g. verifying the bias term we get after fine-tuning the model is different from the initialization value (0.1) #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do it.


In reply to: 221100728 [](ancestors = 221100728)

@@ -38,7 +39,7 @@ namespace Microsoft.ML.Transforms
/// <include file='doc.xml' path='doc/members/member[@name="TensorflowTransform"]/*' />
public sealed class TensorFlowTransform : ITransformer, ICanSaveModel
{
public sealed class Arguments : TransformInputBase
public class Arguments : TransformInputBase
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public class Arguments [](start = 8, length = 22)

Can you make it abstract? Also I would suggest to call it ArgumentsBase rather than Arguments, because it would be confusing to have Arguments class which is actually an class with params and this class. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can not be made abstract because Arguments object is instantiated for scoring scenario and TrainingArguments is instantiated for training scenario.


In reply to: 221764289 [](ancestors = 221764289)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it can't remain sealed if you don't use it as base class?
Also maybe rename it into ScoringArguments?


In reply to: 221768356 [](ancestors = 221768356,221764289)

{
float loss = 0;
float metric = 0;
var runner = Session.GetRunner();
Copy link

@yaeldekel yaeldekel Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runner [](start = 16, length = 6)

I want to make sure I understand: the runner is tuning the weights of the graph in runner.Run()? and the weights are stored somewhere in Session? #Closed

Copy link
Contributor Author

@zeahmed zeahmed Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the line 451

runner.AddTarget(args.OptimizationOperation);

adds the optimization operation in this runner that updates the weights. Session hold those weights.


In reply to: 222501506 [](ancestors = 222501506)

{
if (isVector)
return new TensorValueGetterVec<T>(input, colIndex, tfShape);
else
Copy link

@yaeldekel yaeldekel Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else [](start = 12, length = 4)

nit: "else" not needed here. #Closed

@@ -268,7 +268,7 @@ internal class TFStatus : TFDisposable
{
// extern TF_Status * TF_NewStatus ();
[DllImport(NativeBinding.TensorFlowLibrary)]
private static extern unsafe TF_Status TF_NewStatus();
internal static extern unsafe TF_Status TF_NewStatus();
Copy link

@yaeldekel yaeldekel Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

Instead of exposing the C-API directly, you can use the C# API defined on top of it (the ctor below). Same for the other C imports below. #Resolved

@@ -2238,4 +2244,19 @@ public TFTensor AsTensor()
return shape.AsTensor();
}
}

internal class TFString
Copy link

@yaeldekel yaeldekel Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFString [](start = 19, length = 8)

Consider importing these APIs directly in the TFTensor class, this way they can be private. #Resolved

/// The name of the column used as label for training.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Training labels.", ShortName = "label", SortOrder = 4)]
public string LabeLColumn = DefaultColumnNames.Label;
Copy link

@yaeldekel yaeldekel Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabeLColumn [](start = 26, length = 11)

This must be identical to the name of the node in the graph corresponding to the label, correct?
If this is the case, then I don't think we should have the default value be "Label". #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it would make more sense to have two arguments: one for the name of the label column in the data view, and one for the name of the label node in the graph (unless it is possible to find it in the graph without user input?).


In reply to: 222725910 [](ancestors = 222725910)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same way the InputColumns and OutputColumns are treated. Do you think we should treat Label differently? What I can do is make it required.


In reply to: 222726613 [](ancestors = 222726613,222725910)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should treat it differently, I think we should also make that change for input/output columns :-).


In reply to: 222754435 [](ancestors = 222754435,222726613,222725910)

Copy link
Contributor Author

@zeahmed zeahmed Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think we already have #769 for tracking this. I can add work for label column as well. Do you agree to do it later?


In reply to: 222841987 [](ancestors = 222841987,222754435,222726613,222725910)

@@ -316,7 +695,14 @@ public void Save(ModelSaveContext ctx)
{
ctx.SaveBinaryStream("TFSavedModel", w =>
{
string[] modelFilePaths = Directory.GetFiles(_savedModelPath, "*", SearchOption.AllDirectories);
// only these files need to be saved.
string[] modelFilePaths =
Copy link

@yaeldekel yaeldekel Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modelFilePaths [](start = 29, length = 14)

Is this a change in the serialization format, or just a more explicit way of writing it? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change in the format. Previously, our assumption was that folder will contain only these files so were writing all files in the folder. Now folder may contain archive files due retraining. So, I am explicitly writing required files.


In reply to: 222727569 [](ancestors = 222727569)

if (args.LossOperation != null)
pch.SetHeader(new ProgressHeader("Average Loss"), (e) => e.SetProgress(1, loss / ((cursor.Position + 1) / args.BatchSize)));
if (args.MetricOperation != null)
pch.SetHeader(new ProgressHeader("Average Metric"), (e) => e.SetProgress(2, metric / ((cursor.Position + 1) / args.BatchSize)));
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// A common interface for progress reporting.
/// It is expected that the progress channel interface is used from only one thread.
///
/// Supported workflow:
/// 1) Create the channel via .
/// 2) Call as many times as desired (including 0).
/// Each call to supersedes the previous one.
/// 3) Report checkpoints (0 or more) by calling .
/// 4) Repeat steps 2-3 as often as necessary.
/// 5) Dispose the channel.

see 2).

I think you need to do something like this:

pch.SetHeader(new ProgressHeader(metrics.ToArray() new[] { "Epochs" }, entry => entry.SetProgress(0, epoch, _args.Epochs);

and after each train
if (metric.Count>0)
pch.Checkpoint(double[] avgMetrics);
where metrics and avgMetrics depend on your arguments.

#Closed

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

// Clear offset table
IntPtr dst = TF_TensorData(handle);
Marshal.WriteInt64(dst, 0);
var status = new TFStatus();
Copy link

@yaeldekel yaeldekel Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status [](start = 16, length = 6)

using (var status ...) #Resolved

@@ -290,7 +290,7 @@ public TFStatus() : base(TF_NewStatus())

// extern void TF_DeleteStatus (TF_Status *);
[DllImport(NativeBinding.TensorFlowLibrary)]
private static extern unsafe void TF_DeleteStatus(TF_Status status);
internal static extern unsafe void TF_DeleteStatus(TF_Status status);
Copy link

@yaeldekel yaeldekel Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

Can this one and TF_GetCode go back to being private? #Closed


private (int, bool, TFDataType, TFShape) GetInputMetaData(ISchema inputSchema, string columnName, string tfNodeName, int batchSize)
{
if (!inputSchema.TryGetColumnIndex(columnName, out int inputColIndices))
Copy link

@yaeldekel yaeldekel Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputColIndices [](start = 67, length = 15)

inputColIndex #Resolved

var index = inputsForTraining.Length - 1;
inputsForTraining[index] = args.TensorFlowLabel;
(inputColIndices[index], isInputVector[index], tfInputTypes[index], tfInputShapes[index]) =
GetInputMetaData(inputSchema, args.LabelColumn, inputsForTraining[index], args.BatchSize);
Copy link

@yaeldekel yaeldekel Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabelColumn [](start = 55, length = 11)

Check that this is not null. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that check inCheckTrainingParameters method.


In reply to: 223085996 [](ancestors = 223085996)


return (inputColIndices, isInputVector, tfInputType, tfInputShape);
}
private void TrainCore(Arguments args, string model, IDataView input)
Copy link

@yaeldekel yaeldekel Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private [](start = 8, length = 7)

Add an empty line. #Resolved

var srcTensorGetters = GetTensorValueGetters(cursor, inputColIndices, isInputVector, tfInputTypes, tfInputShapes);

float loss=0;
float metric=0;
Copy link

@yaeldekel yaeldekel Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ctrl+k+d. #Resolved

@yaeldekel
Copy link

yaeldekel commented Oct 5, 2018

/// <include file='doc.xml' path='doc/members/member[@name="TensorflowTransform"]/*' />

Please add documentation about this new feature. #Resolved


Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:39 in 9c07a53. [](commit_id = 9c07a53, deletion_comment = False)

}, new MultiFileSource(dataPath));

IDataView trans = CategoricalTransform.Create(env, loader, "OneHotLabel","Label");
trans = NormalizeTransform.CreateMinMaxNormalizer(env, trans, "Features", "Placeholder");
Copy link

@yaeldekel yaeldekel Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateMinMaxNormalizer [](start = 43, length = 22)

Why do we add a normalizer here? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the simple TensorFlow LR model which need normalization otherwise gradients blow out.


In reply to: 223087660 [](ancestors = 223087660)


var trainedTfDataView = TensorFlowTransform.Create(env, args, trans);
trans = new ConcatTransform(env, "Features", "Prediction").Transform(trainedTfDataView);
//trans = new CopyColumnsTransform(env, ("LabelOriginal", "Label")).Transform(trans);
Copy link

@yaeldekel yaeldekel Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//trans = new CopyColumnsTransform(env, ("LabelOriginal", "Label")).Transform(trans); [](start = 16, length = 85)

Delete. #Closed

@zeahmed zeahmed merged commit d2ed0ad into dotnet:master Oct 8, 2018
@zeahmed zeahmed self-assigned this Oct 9, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants