\( \def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \newcommand{\dee}{\mathrm{d}} \newcommand{\Dee}{\mathrm{D}} \newcommand{\In}{\mathrm{in}} \newcommand{\Out}{\mathrm{out}} \newcommand{\pdf}{\mathrm{pdf}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ve}[1]{\mathbf{#1}} \newcommand{\mrm}[1]{\mathrm{#1}} \newcommand{\etal}{{et~al.}} \newcommand{\sphere}{\mathbb{S}^2} \newcommand{\modeint}{\mathcal{M}} \newcommand{\azimint}{\mathcal{N}} \newcommand{\ra}{\rightarrow} \newcommand{\mcal}[1]{\mathcal{#1}} \newcommand{\X}{\mathcal{X}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\Z}{\mathcal{Z}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\z}{\mathbf{z}} \newcommand{\tr}{\mathrm{tr}} \newcommand{\sgn}{\mathrm{sgn}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\Real}{\mathbb{R}} \newcommand{\sseq}{\subseteq} \newcommand{\ov}[1]{\overline{#1}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \newcommand{\data}{\mathrm{data}} \newcommand{\N}{\mathcal{N}} \)

Understanding Tensorflow.js

I would like to run my "Talking Head Anime from a Single Image" network on web browsers and mobile devices. My networks were trained with Pytorch. The current plan is to convert the model to Tensorflow.js (TFJS) and try to run it in web browsers. If this goes well, I can use the TFJS model to develop mobile applications with tools like Flutter or React Native. I can also develop desktop applications with Electron.

However, the main problem is that my network uses layers that are not yet implemented in TFJS. These are affine_grid, grid_sample, and instance_norm. I think I cannot really rely on the TFJS team to implement them, so I will have to do it myself.

Implementing new layers needs understanding of how the underlying software works, and this notes is written as I try to gain this knowlege. As I am interested in implementing inference, not training, I will not cover how gradients are computed in TFJS.

1 Related Terms

1.1 Engine

The engine is the object that implements all TFJS functionality. It does memory management for tensor objects and execute mathematical computations. There seems to be only one engine present at a time, and it is accessible from the global variable ENGINE and also by calling tf.engine(). Nevertheless, end users of TFJS rarely use it directly.

1.2 Backends

A backend is a part of the engine that carries two functions:

Storing tensor data in a platform-specific way.
Acting as a platform-specific collection of tensor computations.

When our Javascript program is running in a web browser, it can use two backends:

The CPU backend. [CODE]
The WebGL backend. [CODE]

There are other backends that can be used when we run our program with Node.js. However, we do not care about them in this note.

A backend is represented by the KernelBackend class. It implements the TensorStorage interface, which allows one to read, write, and manage memory for tensor data.

1.3 Ops

An operation (op) is a function that takes a number of tensors and produce a number of output tensors. It is an abstract operation that can have different implementations in different backends. Examples of ops include square, conv2d, and matMul. Ops are defined in the tfjs-core package.

1.4 Kernels

A kernel is a backend-specific implementation of an op. A kernel can be executed by calling the runKernel function of the engine.

To define a custom kernel, we first need to create a KernelConfig object, where we must specify:

The kernel's name.
The backend's name.
The kernel's code, which satisfies the KernelFunc interface.
The kernel's setup code (optional).
The kernel's disposal code (optional).

A kernel is identified by a lookup key that is made up of its name and its backend. TFJS maintains a global dictionary mapping keys to corresponding KernelConfig, which it calls the kernelRegistry. The registerKernel function can be used to add new kernels to the registry.

2 The CPU Backend

2.1 Data Storage

Tensor data is represented by the TensorData generic class. It contains:

An optional values, which is a BackendValues, which is in turn an array of Float32Array, Int32Array, or Uint8Array.
A dtype, which has a type that extends DataType, which is simply "float32"|"int32"|"bool"|"complex64"|"string".
An optional complexTensorInfo.
A refCount.

The data strorage is implemented by the DataStorage<TensorData<DataType>> generic class, which contains a WeakMap from DataId (arbitrary object) to TensorData<DataType>.

The backend maintains in increasing integer nextDataId, which is incremented every time a new data is written to the backend through the write method.

2.2 Data Layout

From BatchMatMul.ts, it seems that the tensor data is stored in row major order.

For 2D convolution operation, the input data can either be stored in the NCHW or NHWC layout. This can be specified as input to the Conv2D op. The filter weights, however, is stored in the [filterHeight, filterWidth, inDepth, outDepth] [LINK].

Nevertheless, the transposed convolution operation can only supports input of the NHWC format [LINK]. It is, however, possible to invoke the kernel, called the conv2DBackpropInput, with the 'NCHW' format.

2.3 Kernel Implementation

An example kernel implementation is given here. We will study it in more details.

2.3.1 Understanding `KernelConfig`

First, we need to construct an instance of KernelConfig, whose type definition is given below:

/** Config object for registering a kernel in the global registry. */
export interface KernelConfig {
    kernelName: string;
    backendName: string;
    kernelFunc: KernelFunc;
    setupFunc?: KernelSetupFunc;
    disposeFunc?: KernelDisposeFunc;
}

We see that we must provide (1) the kernel name, (2) the backend name (i.e., "cpu"), and (3) the KernelFunc. The main bulk of work is in the KernelFunc, whose type definition is:

/** Specifies the code to run when executing a kernel. */
export type KernelFunc = (params: {
    inputs: NamedTensorInfoMap,
    backend: {},
    attrs?: NamedAttrMap,
}) => TensorInfo|TensorInfo[];

Let's go through the types that are involved with the KernelFunc.

The return value is TensorInfo.
```
/** Holds metadata for a given tensor. */
export interface TensorInfo {
    dataId: DataId;
    shape: number[];
    dtype: DataType;
}
```
Recall that the dataId field holds an arbitrary object. In the CPU backend, the object holds an integer ID of the tensor.

The input to the kernel function is an object of the following type:

{
    inputs: NamedTensorInfoMap,
    backend: {},
    attrs?: NamedAttrMap,
}

It has three fields:

inputs is a collection of named tensors, which is represented by the NamedTensorInfoMap class.
```
export interface NamedTensorInfoMap {
    [name: string]: TensorInfo;
}
```
Kernels will specialize this type based using the Pick utility type. For example, the SquareInputs class, which represent tensor inputs to the Square kernel, is defined as follows:
```
export type SquareInputs = Pick<NamedTensorInfoMap, 'x'>;
```
backend is the backend that is going to run the kernel.

attrs represents a collection of named "attributes," where an attribute is just a non-tensor input. Its type is NamedAttrMap. Here's the collection of types that are used to represent attributes.

export interface NamedAttrMap {
    [name: string]: Attribute;
}

/** These are extra non-tensor/primitive params passed to kernel functions. */
export type Attribute = AttributeValue|RecursiveArray;

type AttributeValue = 
    number|number[]|boolean|boolean[]|string|string[]|NamedAttrMap;

export interface RecursiveArray {
    [index: number]: T|RecursiveArray;
}

Again, kernels will specialize this type as follows:

export interface FusedConv2DAttrs {
    strides: [number, number]|number;
    pad: 'valid'|'same'|number|ExplicitPadding;
    dataFormat: 'NHWC'|'NCHW';
    dilations: [number, number]|number;
    dimRoundingMode: 'floor'|'round'|'ceil';
    activation: Activation;
    leakyreluAlpha?: number;
}

2.3.2 Writing a `KernelFunc`

Now, let's loook at the implementation of the Square kernel.

import {Square, SquareInputs} from '@tensorflow/tfjs-core';
import {KernelConfig} from '@tensorflow/tfjs-core';
import {MathBackendCPU} from '../backend_cpu';
import {assertNotComplex} from '../cpu_util';

export const squareConfig: KernelConfig = {
  kernelName: Square,
  backendName: 'cpu',
  kernelFunc: ({inputs, backend}) => {
    const {x} = inputs as SquareInputs;
    const cpuBackend = backend as MathBackendCPU;
    assertNotComplex(x, 'square');

    const values = cpuBackend.data.get(x.dataId).values as Float32Array;
    const newValues = new Float32Array(values.length);
    for (let i = 0; i < values.length; ++i) {
      const value = values[i];
      newValues[i] = value * value;
    }
    const dataId = cpuBackend.write(newValues, x.shape, x.dtype);
    return {dataId, shape: x.shape, dtype: x.dtype};
  }
};

We see that there are a number of steps to follow.

Cast the inputs field of the input object to the specific input type of the kernel and then extract the relevant individual inputs.
```
const {x} = inputs as SquareInputs;
```
Here, we see that x is a TensorInfo object that represents the input tensor.
Cast the backend field to the specific backend that the kernel works with.
```
const cpuBackend = backend as MathBackendCPU;
```

Retrieve the tensor data from the backend.

const values = cpuBackend.data.get(x.dataId).values as Float32Array;

Compute the output data.

const newValues = new Float32Array(values.length);
for (let i = 0; i < values.length; ++i) {
    const value = values[i];
    newValues[i] = value * value;
}

Write the new data to the backend.
```
const dataId = cpuBackend.write(newValues, x.shape, x.dtype);
```
Note that the write function of all backends is supposed to return a new DataId.
Create an object satisfying the TensorInfo interface and return it as output.
```
return {dataId, shape: x.shape, dtype: x.dtype};
```

2.4 Registering the Kernel

After creating the KernelConfig, we need to call the registerKernel function on it. For the CPU backend, all kernels are registered in register_all_kernels.ts.

3 The WebGL Backend

3.1 Data Storage

The WebGL backend is implemented by the MathBackendWebGL class. It has a field called texData, which has type DataStorage<TextureData>, meaning that the TextureData class represents storage of texture data. Here's the definition of the class:

export interface TextureData {
    // Required.
    shape: number[];
    dtype: DataType;

    // Optional.
    values?: backend_util.BackendValues;
    texture?: WebGLTexture;
    // For complex numbers, the real and imaginary parts are stored as their own
    // individual tensorInfos, with a parent joining the two with the
    // complexTensors field. When this is defined, texture will be null.
    complexTensorInfos?: {real: TensorInfo, imag: TensorInfo};
    /** [rows, columns] shape of the texture. */
    texShape?: [number, number];
    usage?: TextureUsage;
    isPacked?: boolean;

    refCount: number;

    // Available when the tensor has been sliced.
    slice?: {
        // Offset in the 'flat index' space.
        flatOffset: number;
        // Used for counting how many sliced tensors point to the same texture.
        origDataId: DataId;
    };
}

Let's dig into some of the fields:

texture stores a WebGLTexture. This is not a TFJS-specific type as it comes WebGL. [LINK]
value stores data that is not uploaded to the GPU yet.
Because texture and value are optional, the backend can deal with data that have been updated to GPU and those that have not been uploaded.
usage stores an enum of type TextureUsage, which has the following definition:
```
export enum TextureUsage {
    RENDER,
    UPLOAD,
    PIXELS,
    DOWNLOAD
}
```
I don't quite understand the meaning of the TextureUsage enum yet.
- One thing that I know right now is that, for data that has not been uploaded to the GPU yet, usage is set to UPLOAD. [LINK]
isPacked gives information about how texture data is laid out. We will cover what this means momentarily.

3.2 GPGPU Program

Computation is carried out by running shaders on the GPU. This is done through the runWebGLProgram of the MathBackendWebGL class. Here's the signature of the method.

runWebGLProgram(
    program: GPGPUProgram, 
    inputs: TensorInfo[], 
    outputDtype: DataType,
    customSetup?: (gpgpu: GPGPUContext, webGLProgram: WebGLProgram) => void,
    preventEagerUnpackingOfOutput = false): TensorInfo

Here's how the GPGPUProgram interface is defined.

export interface GPGPUProgram {
    variableNames: string[];
    outputShape: number[];
    userCode: string;
    /** If true, this program expects packed input textures. Defaults to false. */
    packedInputs?: boolean;
    /** If true, this program produces a packed texture. Defaults to false. */
    packedOutput?: boolean;
    /**
     * Affects what type of texture we allocate for the output. Defaults to
     * `TextureUsage.RENDER`.
     */
    outTexUsage?: TextureUsage;
    /**
     * The type of scheme to use when packing texels for the output values.
     * See `PackingScheme` for details. Defaults to `PackingScheme.SHARED_BATCH`.
     */
    outPackingScheme?: PackingScheme;
}

PackingScheme is defined as follows.

export enum PackingScheme {
    /**
    * All values in a single texel are densely packed without any constraints.
    *
    * This is how the shader encodes a tensor with shape = [2, 3, 4]
    * (indices are [batch, row, col]).
    *
    * 000|001   010|011   020|021
    * -------   -------   -------
    * 002|003   012|013   022|023
    *
    * 100|101   110|111   120|121
    * -------   -------   -------
    * 102|103   112|113   122|123
    *
    */
    DENSE,
    
    /**
    * Single texels contain only values from the same batch, and from adjacent
    * rows and columns.
    *
    * This is how the shader encodes a tensor with shape = [2, 3, 5]
    * (indices are [batch, row, col]).
    *
    * 000|001   002|003   004|xxx   020|021   022|023   024|xxx
    * -------   -------   -------   -------   -------   -------
    * 010|011   012|013   014|xxx   xxx|xxx   xxx|xxx   xxx|xxx
    *
    * 100|101   102|103   104|xxx   120|121   122|123   124|xxx
    * -------   -------   -------   -------   -------   -------
    * 110|111   112|113   114|xxx   xxx|xxx   xxx|xxx   xxx|xxx
    *
    */
    SHARED_BATCH
}

Now, I searched for SHARED_BATCH across the repository, and it seems that it was not used anywhere despite being documented that it is the default value.

3.2.1 Concrete `GPGPUProgram` Examples

Let's see an example of a concrete instance of GPGPUProgram.

3.2.1.1 Unary Operation Programs

The simplest one is the UnaryOpProgram.


export class UnaryOpProgram implements GPGPUProgram {
    variableNames = ['A'];
    userCode: string;
    outputShape: number[];
    
    constructor(aShape: number[], opSnippet: string) {
        this.outputShape = aShape;
        this.userCode = `
        float unaryOperation(float x) {
            ${opSnippet}
        }
        void main() {
            float x = getAAtOutCoords();
            float y = unaryOperation(x);
            setOutput(y);
        }
        `;
    }
}

Note that the user code is not a complete program. It is only a snippet that must be combined with another piece of code in order to make a functioning fragment shader. In particular, two functions are not defined:

getAAtOutCoords(), which seems to derive its name from the variableNames = ['A'] line. The return type is a float.
setOutput(), which should set the output value at the output coordinates.

Here are some sample of opSnippet found within the code file.

export const CHECK_NAN_SNIPPET = `if (isnan(x)) return x;`;

export const LINEAR = `return x;`;

export const ABS = `return abs(x);`;

export function STEP(alpha = 0.0) {
    return CHECK_NAN_SNIPPET + `
    return x > 0.0 ? 1.0 : float(${alpha});
    `;
}

export const ELU = `return (x >= 0.0) ? x : (exp(x) - 1.0);`;

export const RELU = CHECK_NAN_SNIPPET + `
    return (x < 0.0) ? 0.0 : x;
`;

However, there's another version, the UnaryOpPackedProgram.

export class UnaryOpPackedProgram implements GPGPUProgram {
    variableNames = ['A'];
    userCode: string;
    outputShape: number[];
    packedInputs = true;
    packedOutput = true;
    
    constructor(aShape: number[], opSnippet: string) {
        this.outputShape = aShape;
        this.userCode = `
        vec4 unaryOperation(vec4 x) {
            ${opSnippet}
        }
        void main() {
            vec4 x = getAAtOutCoords();
            vec4 y = unaryOperation(x);
            setOutput(y);
        }
        `;
    }
}

This time, getAAtOutCoords outputs a vec4, and setOutput also accepts a vec4 instead of a float. Here are examples of opSnippet found in the same file.

export const LINEAR = `return x;`;
    
export const ELU = `
    vec4 result;
    result.r = (x.r >= 0.0) ? x.r : (exp(x.r) - 1.0);
    result.g = (x.g >= 0.0) ? x.g : (exp(x.g) - 1.0);
    result.b = (x.b >= 0.0) ? x.b : (exp(x.b) - 1.0);
    result.a = (x.a >= 0.0) ? x.a : (exp(x.a) - 1.0);
    return result;
`;

export const RELU = `
    vec4 result = x * vec4(greaterThanEqual(x, vec4(0.0)));
    bvec4 isNaN = isnan(x);
    result.r = isNaN.r ? x.r : result.r;
    result.g = isNaN.g ? x.g : result.g;
    result.b = isNaN.b ? x.b : result.b;
    result.a = isNaN.a ? x.a : result.a;
    return result;
`;

So, it would seem that, if the texture is packed, a texel stores an RGBA value. Otherwise, it stores only a float.

Last modified: 2021/05/08