ComputeWorker

SPMD in the browser

Under Construction

Important

I don't care about a tech in particular, all I want is a solution ( a good one ). If someone has a better idea (with arguments) or want to discuss, I am always available :-).

Plan

Why ComputeWorker?

Because today we don't solve all the issues
Add SPMD to webworkers
Change asm.js for parallelism? (or other subset?)
Implementation details ?

GPGPU is never far away, transcompiler asm.js -> OpenCL C

Studies about having the same directly in ionmonkey

Why ComputeWorker?

Parallelism?

kinds of parallelism : data and tasks

Task parallelism
-> webworkers
data parallelism (often named SIMD )
-> paralleljs/rivertrail (,SSE...)
single program multiple data (SPMD)
-> webCL ?, ???

Why ComputeWorker?

Parallelism?

We need parallelism for two purposes:

avoid to freeze web pages during long computation
effiency on heavy computations:
- simulations
- a.i.
- physics
- video/audio modifications or detections
- ...

Why ComputeWorker?

Parallelism?

We need some requirements

Simple fallbacks
Easy to implement
(Compliant with GPGPU for the futur )*

* Interaction with WebGL

Why ComputeWorker?

Why not only webworker

Does not share data
Split SPMD and task parallelism ( cleaner way imho )
Need a way to automatically have the good number of "threads"

( depends on the device, number of core... -> should be handle by javascript VM, not the user )

Why ComputeWorker?

Why not only paralleljs?

Hard to implement SPMD with "MAP-REDUCE" API
Some restriction of the MAP, interaction with DOM...
Was not thought to handle interaction with WebGL

Why ComputeWorker?

Why not WebCL?

Big investment :

The transcompiler is hard to implement
Rely on OpenCL driver
hard to do the fallback when the API is not present in the browser (or in the VM)

Good API (a little complex), cost is high

80 % of the features in WebCL could be done in ComputeWorkers, which are much easier to use

Why ComputeWorker?

Why not WebCL? (2) (under consideration by me)

The transcompiler is hard to implement

->WebCL could compile to a language which is safe, but compiler should also be safe!
Rely on OpenCL driver

-> with ComputeWorker we already have 80% of the fallback
hard to do the fallback when the API is not present in the browser (or in the VM)
-> with ComputeWorker we already have 80% of the fallback

ComputeWorker seems to be the best first step ( risk is limited, investment too, will procure a good security to webCL )

Why ComputeWorker?

Sum up goals

add spmd to the browsers
be efficient (because spmd is only used for efficiency)
be able to use many implementations ( JIT, OpenCL, cuda, directCompute )
have a fallback
be able to interact with WebGL ( -> OpenGL, or directX/Angle )
be easy to maintain
introduce no changes in javascript (no new concept / no concurrent accesses...)
will allow emscripten to use more kinds of threads

Plan

Why ComputeWorker?
Add SPMD to webworkers
Change asm.js for parallelism?
Implementation details ?

Add SPMD to webworkers

ComputeWorker

API similar to WebWorkers

Code use asm.js with some limitations

Add SPMD to webworkers

Advantages

easy fallback (sequential : less than 100 lines of javascript)
does not depends on one tech

could be implement in OpenCL, directCompute,cuda(,jit?)
can interact with WebGL directly

Add SPMD to webworkers

the API, main page code


var pw = new ComputeWorker(source.js);
pw.post(data/*typedArray or scalar 
              typedArray with mutex if it's a sharedMemory (new type)*/,
        n/* number of tasks which could be launched in parallel*/
        ownership);
pw.onmessage = function (oEvent) {
   oEvent.data /*typedArray or scalar */;
};

Add SPMD to webworkers

the API, main page code (2)


//interaction with WebGL
var pw = new ComputeWorker(source.js,webGLContext);
pw.post(webGLMemoryObject,n);
pw.onmessage = function (oEvent) {
   oEvent.data /*typedArray or scalar */;
};

Add SPMD to webworkers

the API, worker


// only asm.js API, postMessage and onMessage are provided in the worker
"use pasm"; // parallel asm.js
"use GPU"; //hint to use GPGPU if possible
var priv = new Int32Array(16);// private memory only accessible to one "thread"
//could be seen as the heap/stack
// no need to add for a copy, we already know the accessibility
var sharedMem = new BufferArray(16|0);// global memory shared between threads 
//(transactional memory?)

    function test(array, scalar) {
        scalar = scalar | 0;// hint for the type
        // bufferArray from main page are in "global memory"
        //and shared between "threads"
        array = new Int16Array(array);// hint for the type
//to know the id of the thread, between 0 and n-1, like opencl
        array[id] += id;       
        return array; // no post message 
    }
onmessage = test;

Plan

Why ComputeWorker?
Add SPMD to webworkers
Change asm.js for parallelism?
Implementation details ?

Change asm.js for parallelism?

Restrictions

no ffi

hard to interact with binaries with GPGPU drivers
no dataView on the heap

not a problem because heap is IN the worker, could be sent to main page if needed

That's why we are in a worker!

Change asm.js for parallelism?

Restrictions (from OpenCL/directCompute)

no function pointers, nested definition

possible
no recursion?

but allows more optimisation during compilation and no checks for the heap at runtime
No label statement
Note that we can emulate recursion and function pointer in openCL and directcompute

with a combination of a "frame pointer" and break/continue and switch to emulate addresses but heap has a static length

Change asm.js for parallelism?

Additions

By default all variables are "Thread-local"
new functions in sdtlib (getItemId to know the id of the thread in the task)
Only typedArray could be in "global" memory and shared between "thread"/work-item

Plan

Why ComputeWorker?
Add SPMD to webworkers
Subset of asm.js for parallelism?
Implementation details ?

GPGPU is never far away, transcompiler asm.js -> OpenCL C

Studies about having the same directly in ionmonkey

Implementation details ?

Private memory should be as small as possible:

Global memory is used when private memory is too large (much slower)