Skip to content

Gesture-Based Computer Control

DifficultyAdvanced
Team Size2-3 people
Time~30-35 hours
Demo-ready byStep 5
PrerequisitesPython, OpenCV basics, basic ML concepts
Built byLeap Motion, Google MediaPipe, Ultraleap, Kinect

Skills you'll earn: Computer vision, MediaPipe hand tracking, gesture classification, OS input simulation, real-time inference

Start by detecting a hand in a webcam feed. End with a system that maps hand gestures to mouse movement, clicks, and keyboard shortcuts.

(Assumes basic Python knowledge. Includes learning OpenCV, MediaPipe, and OS input injection.)

Step 1: Capture webcam frames (~1 hour)

You need a live video feed before you can detect anything.

  • Open the default camera with OpenCV
  • Display frames in a window at 30fps
  • Print frame dimensions and FPS to confirm the pipeline works

You now have: A live video feed.

Step 2: Detect a hand (~2-3 hours)

A raw frame is just pixels. You need to find the hand in it.

  • Use MediaPipe Hands (or train a simple skin-color segmentation as a baseline)
  • Draw the 21 hand landmarks on each frame
  • Print landmark coordinates for the index finger tip

You now have: Real-time hand tracking.

Step 3: Map finger position to mouse movement (~3-4 hours)

Landmark coordinates are in camera space. The mouse lives in screen space.

  • Map the index finger tip (landmark 8) to screen coordinates
  • Scale camera resolution to screen resolution
  • Move the system cursor using pyautogui or platform APIs
  • Add smoothing (rolling average of last N positions) to eliminate jitter

You now have: Gesture-controlled mouse movement.

Step 4: Detect gestures for clicks (~3-4 hours)

Moving the cursor is useless without clicking.

  • Define gestures: pinch (thumb + index close) = left click, pinch (thumb + middle) = right click
  • Measure distance between fingertips to detect a pinch
  • Add a cooldown period to prevent rapid-fire clicks
  • Visual feedback: draw a circle on the frame when a click triggers

You now have: Click detection.

Step 5: Scroll and drag (~3-4 hours)

  • Fist gesture = drag mode (hold left click while moving)
  • Two-finger spread/pinch = scroll up/down
  • Track gesture state transitions: idle → dragging → idle

You now have: Full mouse replacement.

Step 6: Keyboard shortcuts via gestures (~4-5 hours)

  • Map specific hand poses to actions: open palm = play/pause, thumbs up = volume up
  • Build a gesture classifier: extract finger angles from landmarks, classify into N poses
  • Make the mapping configurable via a JSON file

Step 7: Reduce latency and improve accuracy (~4-5 hours)

  • Run detection in a separate thread from rendering
  • Reduce camera resolution to 640x480 for speed
  • Add a deadzone in the center to prevent drift when the hand is still
  • Profile and optimize: target under 50ms per frame

Useful Resources

Useful Resources

Where to go from here

  • Two-hand gestures (zoom, rotate)
  • Custom gesture training (let users define their own)
  • OS integration (virtual touchpad driver instead of pyautogui hacks)
  • Presentation mode (next slide, laser pointer, annotations)