Back to list

An application that detects and tracks objects from iPhone camera input, then projects them onto a ground plane to visualize world coordinates.

1. Problem and Goal

Typical object detection screens show what is visible, but they do not adequately explain how the same object moves over time or where it is in real space.
On iPhones without LiDAR, perspective and height errors combined with high resource usage make it difficult to build a practical pipeline.

The project has three goals.

  • Integrate real-time object detection, object tracking, and plane-based world coordinate interpretation and visualization into a single stable pipeline on a standalone iPhone.
  • Find a balance between achievable accuracy and latency under limited sensor and compute resources.
  • Record bottlenecks and error characteristics reproducibly through per-stage performance measurement and establish improvement baselines.

2. Demo

3. Key Features

  • Object detection:
    • On-device machine learning inference to compute object class and location
  • Object tracking:
    • Maintains object IDs to link the same object across frames
  • Plane-based world coordinate transform:
    • Projects the bottom-center point of each object onto the ground plane and converts it to plane-based world coordinates

4. Why This Approach Over Alternatives

SLAM and visual odometry–based approaches were also considered.
However, the primary goal of this project was to stably integrate real-time detection, tracking, and plane-based world coordinate interpretation and visualization on a standalone iPhone.

The key reasoning was as follows.

  • Computationally intensive alternative paths could impose significant compute and memory overhead on mobile devices.
  • Additional spatial sensors (depth cameras, LiDAR) could reduce error and compute burden in some paths.

This project is largely experimental in nature, testing how far plane-based world coordinate interpretation can go under limited sensor and resource constraints.
The test device was limited to iPhone 16e, so LiDAR-based paths were excluded. The project relied on Apple’s augmented reality framework for pose estimation and plane ray casting.

5. Performance Optimization and Test Environment

Key optimizations applied:

  • Eliminated redundant plane-based world coordinate transform calls
  • Per-object ray casting cache with motion-based reuse
  • Configurable frame interval for re-verification
  • Scene update rate limiting
  • Latency sample synchronization improvements

Test environment:

  • Device: iPhone 16e (physical device)
  • OS: iOS 26
  • Build: Debug
  • Camera frame rate: 30 or 60
  • Object detection frame rate: 10–60
  • Measured across combinations of object tracking, ground plane computation, and world view mode

6. Results and Limitations

Real-time detection, object tracking, and plane-based world coordinate visualization are operational, and per-stage performance measurement enables bottleneck identification.

Key observations (based on current testing):

  • A model performance verification workflow was established for mobile environments.
  • iPhone-estimated object positions on the plane showed relatively small errors in some scenes.
  • Errors increased when real-world height changes were interpreted as in-plane movement.
  • Perspective inference and height estimation based on the ground plane were prone to error accumulation.
  • Augmented reality framework–based plane estimation, pose estimation, and world projection had high resource usage.
  • For top-down 2D views, homography-based transforms could be an alternative, though a definitive comparison requires further controlled experiments.

In summary, the system works even on devices like iPhone 16e without LiDAR, but there is a clear trade-off between perspective/height accuracy and resource usage.
On higher-end devices with LiDAR, different pose estimation and depth utilization paths could reduce errors.