Avoiding the Blue-Bubble: Why VR Collaboration Needs an Open Protocol
TL;DR: If tomorrow’s workplaces depend on VR, we can’t repeat the phone-messaging mistake where features only work inside one vendor’s walled garden. “Zoom-for-VR” or “WhatsApp-for-VR” will always leave someone out—or lock everyone in. The fix isn’t another app; it’s a protocol: an open, interoperable way to discover sessions, exchange avatars, synchronize shared 3D objects, and protect privacy across any headset or runtime. Think RCS-for-VR, not another silo.
The problem: VR’s looming “blue bubble”
In VR, the value is being able to stand around the same object—a design mock, a 3D scan of an arena, a data viz—and collaborate in real time. If that only works when everyone buys the same headset and same meeting app, the tech fails the moment a teammate, client, or vendor uses different gear.
We’ve seen this movie: modern messaging took years to claw back from proprietary silos until the industry moved toward RCS and standardized features like high-quality media, typing indicators, and (increasingly) interoperable E2EE. That change came from standards, not yet another app. (GSMA)
VR must follow the same path. “Let’s just make a Zoom-for-VR” sounds sensible—until procurement asks whether your counterpart is on Teams, Zoom, or something else, and legal asks who holds the data keys. A protocol-first approach lets different apps, engines, and devices interoperate without forcing everyone into the same subscription or hardware.
What exists we can stand on
We already have strong building blocks:
- OpenXR gives apps a common API to talk to many XR runtimes and devices. It reduces per-device porting and is supported across major ecosystems. (The Khronos Group)
- glTF 2.0 (and OpenUSD) are widely adopted 3D asset formats; they’re efficient, extensible ways to ship meshes, materials, animations. (Khronos Registry)
- WebRTC data channels provide low-latency, encrypted, reliable/ordered or unreliable/unordered transport for multi-peer state sync. (IETF Datatracker)
- Matrix demonstrates open, federated, E2EE-capable real-time messaging with room semantics and bridging—useful inspiration for session discovery and federation. (Matrix Specification)
- MPEG-I Scene Description (ISO/IEC 23090-14) is formalizing how complex immersive scenes reference media and interact—another anchor for consistency. (ISO)
- Industry groups are actively aligning USD & glTF and mapping the standards landscape for metaverse interoperability. (Metaverse Standards Forum)
These pieces don’t yet form a complete “collaboration protocol,” but they’re close.
Proposal: VRCSP — VR Collaboration Session Protocol (sketch)
A protocol, not a product. Vendors can compete on UX while interoperating at the wire level.
Design goals
- Hardware-agnostic sessions: any OpenXR/engine runtime can join. (The Khronos Group)
- Asset-format neutrality with glTF 2.0 / OpenUSD as first-class, and room for MPEG-I SD alignment. (Khronos Registry)
- Real-time, encrypted state sync over WebRTC data channels (with MLS-style group E2EE for control/state). (MDN Web Docs)
- Federated discovery: Matrix-like rooms so organizations can host their own servers yet collaborate across domains. (Matrix Specification)
- Conflict-free collaboration: CRDT-based scene graph edits so concurrent changes merge without foot-guns.
- Extensible capabilities: hand tracking, passthrough/AR, haptics negotiated per session.
- Privacy & enterprise posture: org-managed identity, audit, and key control.
High-level architecture
- Signaling & Discovery:
- Federation layer (Matrix-style) to create/join rooms, exchange offers, publish capabilities, and manage access control. (Matrix Specification)
- Transport:
- WebRTC for media + RTCDataChannel for state (unordered/unreliable for fast transforms; ordered/reliable for scene edits). (IETF Datatracker)
- Assets & Scene:
- Shared scene graph references glTF/OpenUSD assets; optional mapping to MPEG-I Scene Description for media integration. (Khronos Registry)
- Runtime:
- Clients render via their native engine/runtime (OpenXR where available). (The Khronos Group)
- Security:
- DTLS/SRTP on WebRTC plus MLS group keys for control/state channels; enterprise can host the MLS/identity service (RCS’s movement toward interoperable E2EE shows the path). (MDN Web Docs)
Wire-level sketch
Below is a conceptual message model for VRCSP (JSON for readability). This is not a finalized spec—it shows how a fully open, vendor-neutral session might work.
1) Session discovery & join
// Matrix-like room create (federated) — discovery layer
{
"type": "room.create",
"room_id": "!abc123:example.com",
"visibility": "private",
"purpose": "design-review",
"policy": { "invite_only": true, "orgs_allowed": ["example.com", "partner.org"] }
}
// Capability handshake over reliable control channel
{
"type": "hello",
"protocol": "VRCSP/0.1",
"identity": { "user": "alice@example.com", "org": "example.com" },
"runtime": {
"api": "OpenXR",
"version": "1.0",
"features": { "handTracking": true, "passthrough": false, "eyeGaze": true }
},
"formats": { "assets": ["glTF2.0", "OpenUSD"], "scene": ["MPEG-I-SD"] },
"transport": { "webrtc": { "sdp": "…", "iceServers": ["stun:stun.example.org"] } },
"crypto": { "mls": { "cipherSuite": "MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519" } }
}
2) Asset manifest & content-addressing
// What the session will render; URIs may be HTTPS, S3, IPFS, etc.
{
"type": "asset.manifest",
"scene": "usd://designs/lockerroom.usd",
"assets": [
{ "id": "chair01", "uri": "https://cdn.example.com/chair.glb",
"hash": "sha256-5e3d…", "format": "glTF2.0" },
{ "id": "flooring", "uri": "s3://bucket/flooring.usdz",
"hash": "sha256-8a70…", "format": "OpenUSD" }
]
}
Clients cache/verify by hash for integrity; fall back to alternative formats if needed.
3) Avatar description & presence
// Avatar descriptor (could also allow VRM or USD-based rigs)
{
"type": "avatar.descriptor",
"user": "alice@example.com",
"humanoidRig": "humanoid-v1",
"asset": { "uri": "https://cdn.example.com/avatars/alice.glb", "hash": "sha256-1c2d…" },
"attachments": [{ "type": "badge", "label": "Architect" }]
}
4) Real-time pose & input (fast, unordered data channel)
{
"type": "avatar.pose",
"user": "alice@example.com",
"t": 1726071584.234,
"head": { "p": [0,1.62,0], "q": [0,0,0,1] },
"hands": {
"left": { "visible": true, "joints": [[…], […]] },
"right": { "visible": true, "joints": [[…], […]] }
}
}
5) Shared scene edits (CRDT patch, reliable channel)
// Create a shared object
{
"type": "scene.op",
"doc": "scene-graph",
"op_id": "d6c9-1",
"crdt": {
"op": "add",
"path": "/objects/cube42",
"value": { "kind": "mesh", "asset": "chair01", "transform": {
"t": [2.0, 0.0, -1.5], "r": [0, 0.707, 0, 0.707], "s": [1,1,1]
}}
}
}
// Update transform concurrently — merges deterministically
{
"type": "scene.op",
"doc": "scene-graph",
"op_id": "d6c9-2",
"crdt": {
"op": "set",
"path": "/objects/cube42/transform/t",
"value": [2.3, 0.0, -1.2]
}
}
6) Media & annotations
// Attach a live video note to an object (ties to MPEG-I SD ideas)
{
"type": "annotation.media",
"target": "/objects/cube42",
"media": {
"kind": "video-note",
"uri": "webrtc://stream/uuid-77",
"thumbnail": "https://cdn.example.com/notes/77.jpg"
}
}
7) Permissions & export
{
"type": "policy.update",
"rules": [
{ "role": "viewer", "allow": ["read.pose", "read.scene"] },
{ "role": "editor", "allow": ["write.scene", "annotate"] }
]
}
// Export a point-in-time bundle for audit/share
{
"type": "export.request",
"format": "usd",
"include": ["sceneGraph", "annotations", "provenance"],
"destination": "s3://org-archive/design-review-2025-09-11.usdz"
}
Why this beats “Teams-for-VR”
- Freedom of client: any app that speaks the protocol can join—Unity, Unreal, a web client, native engines—on any headset/runtime. OpenXR helps with the runtime side. (The Khronos Group)
- No asset traps: teams keep using glTF/OpenUSD pipelines and still render the same things together. (Khronos Registry)
- Network & security done right: real-time over WebRTC data channels, E2EE group keys via MLS-style schemes as RCS is moving toward—so no one vendor is the gatekeeper of your collaboration data. (MDN Web Docs)
- Ecosystem leverage: MPEG-I SD gives a cross-media reference model, and industry forums are already aligning the 3D stack. (ISO)
Implementation notes (for the adventurous)
- Signaling: start with Matrix rooms for federation, invites, and presence; bridge to enterprise identity. (Matrix Specification)
- Transport: one reliable control channel (ordered), one reliable scene channel (ordered), and one or more pose channels (unordered, partial reliability) over WebRTC. (IETF Datatracker)
- Scene sync: use a CRDT library (e.g., Automerge/Yjs) to represent a shared scene document; apply diffs to the local engine’s scene graph.
- Assets: prefer content-addressed URIs (hashes) + streaming decompression; support both glTF(.glb) and USD(.usd/.usdz). (Khronos Registry)
- Interop testing: build conformance scenes (materials, skeletal rigs, IK, physics), much like glTF’s sample models. (Khronos Registry)
- Privacy: org-scoped MLS groups with hardware-rooted key storage where available; auditable exports that don’t leak raw user biometrics.
Call to action
Vendors: compete on clients and UX, not on lock-in. Standards bodies: help converge on a minimal, testable VRCSP that rides OpenXR, glTF/OpenUSD, WebRTC, and MPEG-I SD. Enterprises: insist on protocol-level interoperability in RFPs—just as you did for messaging when the industry matured toward RCS.
If we get this right, collaborating in VR will feel like joining a meeting link—not switching phones just to get a blue bubble.