ComputeWorker

SPMD in the browser

Created by Fabien Cellier

Under Construction

Important

I don't care about a tech in particular, all I want is a solution ( a good one ). If someone has a better idea (with arguments) or want to discuss, I am always available :-).

Plan

  • Why ComputeWorker?

    Because today we don't solve all the issues

  • Add SPMD to webworkers
  • Change asm.js for parallelism? (or other subset?)

  • Implementation details ?

    GPGPU is never far away, transcompiler asm.js -> OpenCL C

    Studies about having the same directly in ionmonkey

Why ComputeWorker?

Parallelism?

kinds of parallelism : data and tasks

  • Task parallelism
    -> webworkers
  • data parallelism (often named SIMD )
    -> paralleljs/rivertrail (,SSE...)
  • single program multiple data (SPMD)
    -> webCL ?, ???

Why ComputeWorker?

Parallelism?

We need parallelism for two purposes:

  • avoid to freeze web pages during long computation

  • effiency on heavy computations:
    • simulations
    • a.i.
    • physics
    • video/audio modifications or detections
    • ...

Why ComputeWorker?

Parallelism?

We need some requirements

  • Simple fallbacks
  • Easy to implement
  • (Compliant with GPGPU for the futur )*

* Interaction with WebGL

Why ComputeWorker?

Why not only webworker

  • Does not share data
  • Split SPMD and task parallelism ( cleaner way imho )
  • Need a way to automatically have the good number of "threads"

    ( depends on the device, number of core... -> should be handle by javascript VM, not the user )

Why ComputeWorker?

Why not only paralleljs?

  • Hard to implement SPMD with "MAP-REDUCE" API
  • Some restriction of the MAP, interaction with DOM...
  • Was not thought to handle interaction with WebGL

Why ComputeWorker?

Why not WebCL?

Big investment :

  • The transcompiler is hard to implement
  • Rely on OpenCL driver
  • hard to do the fallback when the API is not present in the browser (or in the VM)

Good API (a little complex), cost is high

80 % of the features in WebCL could be done in ComputeWorkers, which are much easier to use

Why ComputeWorker?

Why not WebCL? (2) (under consideration by me)

  • The transcompiler is hard to implement

    ->WebCL could compile to a language which is safe, but compiler should also be safe!

  • Rely on OpenCL driver

    -> with ComputeWorker we already have 80% of the fallback

  • hard to do the fallback when the API is not present in the browser (or in the VM)

    -> with ComputeWorker we already have 80% of the fallback

ComputeWorker seems to be the best first step ( risk is limited, investment too, will procure a good security to webCL )

Why ComputeWorker?

Sum up goals

  • add spmd to the browsers
  • be efficient (because spmd is only used for efficiency)
  • be able to use many implementations ( JIT, OpenCL, cuda, directCompute )
  • have a fallback
  • be able to interact with WebGL ( -> OpenGL, or directX/Angle )
  • be easy to maintain
  • introduce no changes in javascript (no new concept / no concurrent accesses...)
  • will allow emscripten to use more kinds of threads

Plan

Add SPMD to webworkers

ComputeWorker

API similar to WebWorkers

Code use asm.js with some limitations

Add SPMD to webworkers

Advantages

  • easy fallback (sequential : less than 100 lines of javascript)
  • does not depends on one tech

    could be implement in OpenCL, directCompute,cuda(,jit?)

  • can interact with WebGL directly

Add SPMD to webworkers

the API, main page code


var pw = new ComputeWorker(source.js);
pw.post(data/*typedArray or scalar 
              typedArray with mutex if it's a sharedMemory (new type)*/,
        n/* number of tasks which could be launched in parallel*/
        ownership);
pw.onmessage = function (oEvent) {
   oEvent.data /*typedArray or scalar */;
};
            

Add SPMD to webworkers

the API, main page code (2)


//interaction with WebGL
var pw = new ComputeWorker(source.js,webGLContext);
pw.post(webGLMemoryObject,n);
pw.onmessage = function (oEvent) {
   oEvent.data /*typedArray or scalar */;
};
            

Add SPMD to webworkers

the API, worker


// only asm.js API, postMessage and onMessage are provided in the worker
"use pasm"; // parallel asm.js
"use GPU"; //hint to use GPGPU if possible
var priv = new Int32Array(16);// private memory only accessible to one "thread"
//could be seen as the heap/stack
// no need to add for a copy, we already know the accessibility
var sharedMem = new BufferArray(16|0);// global memory shared between threads 
//(transactional memory?)

    function test(array, scalar) {
        scalar = scalar | 0;// hint for the type
        // bufferArray from main page are in "global memory"
        //and shared between "threads"
        array = new Int16Array(array);// hint for the type
//to know the id of the thread, between 0 and n-1, like opencl
        array[id] += id;       
        return array; // no post message 
    }
onmessage = test;
            

Plan

Change asm.js for parallelism?

Restrictions

  • no ffi

    hard to interact with binaries with GPGPU drivers

  • no dataView on the heap

    not a problem because heap is IN the worker, could be sent to main page if needed

That's why we are in a worker!

Change asm.js for parallelism?

Restrictions (from OpenCL/directCompute)

  • no function pointers, nested definition

    possible

  • no recursion?

    but allows more optimisation during compilation and no checks for the heap at runtime

  • No label statement
  • Note that we can emulate recursion and function pointer in openCL and directcompute

    with a combination of a "frame pointer" and break/continue and switch to emulate addresses but heap has a static length

Change asm.js for parallelism?

Additions

  • By default all variables are "Thread-local"

  • new functions in sdtlib (getItemId to know the id of the thread in the task)
  • Only typedArray could be in "global" memory and shared between "thread"/work-item

Plan

  • Why ComputeWorker?

  • Add SPMD to webworkers
  • Subset of asm.js for parallelism?

  • Implementation details ?

    GPGPU is never far away, transcompiler asm.js -> OpenCL C

    Studies about having the same directly in ionmonkey

Implementation details ?

Private memory should be as small as possible:

Global memory is used when private memory is too large (much slower)

Implementation details ?