Autonomous Shopping Robot

PythonNeural Networks

Jul 2024 - Oct 2024

A fully autonomous robot capable of collecting items on a shopping list through heavy use of computer vision, SLAM and path planning.

Autonomous Shopping Robot

The Autonomous Shopping Robot was made with a simple goal. Given a shopping list, find and collect all the items in the shortest amount of time without crashing into the supermarket environment. The list is random, the supermarket layout changes every time and the robot goes into the challenge with no human intervention. What follows is a lot of interesting maths, computer vision techniques and machine learning to navigate and identify fruits and vegetables in the supermarket.

Simultaneous Localisation & Mapping (SLAM)

SLAM is the process of positioning yourself in the world while also mapping the world around you with a global source of truth. A combination of a physics model (predicting where the robot should be given a duty cycle applied to its motors over time) and sensors provide the information used to estimate where the robot is relative to the objects around it. In this environment, the ARUCO markers and the target produce act as waypoints, although their pose and position needs to be determined at runtime.

The robot simulation environment.
The physics of the real world is simulated to assist in rapidly testing and improving the SLAM algorithms. The image shows the simulated environment and the background and the robots vision, along with ARUCO detections and the generated map. Note the covariance rings surrounding the robot and the sighted markers.

As previously mentioned, there are two parts to SLAM; the physics model and the measurements. Given that the robot only has two wheels, the equations giving the motion model are quite simple. If the wheels are being driven at the same speed, we can assume linear motion with acceleration being negligible as the motors are electric and the period of acceleration is minimal compared to the time driven at speed. This effectively means that we can simply multiply the linear velocity by the time travelled to find the distance travelled. Mapping the linear velocity to the motor duty cycles is simply a matter of finding calibration constants (constants not constant as friction is a non-linear force) for difference linear velocities.

v=k(l+r)2ω=k(rl)bv=\frac{k(l+r)}{2}\\\\ \omega=\frac{k(r-l)}{b}

Where kk is the scaling factor and bb is the base width, the distance between the two wheels.

Target Detection

The identification of targets (i.e. fruits and vegetables) is completed through the industry standard object recognition and segmentation model, YOLOv8. The model was trained on 10,000 images of the fresh produce on a variety of backgrounds, however these images weren't manually labelled as this would be incredibly time consuming. Instead, a custom dataset generator was built to superimpose the fresh produce on a variety of backgrounds, while applying a variety of augmentations to improve the robustness of detections.

A variety of detections on the generated test dataset.
A variety of detections on the generated test dataset.
  1. Approximately 100 photos of the fresh produce were taken from the robot camera against a plain white background. Only the angle of the targets relative to the robot was varied in between shots.
  2. The backgrounds were automatically removed using a simple Python script, resulting in hundreds of labelled produce images with a transparent background.
  3. Various background photos were taken, ensuring that none of the images included the target. The produce images were superimposed onto these backgrounds, resulting image that resembled the authentic environment, except that the class and bounding box can be automatically assigned. While the images were superimposed, various augmentations were randomly applied, such as:
    • Brightness
    • Rotation
    • Flipping
    • Scaling
    • Translation
    • Blurring
    • Hue-shifting

Once trained on these generated images, the model was incredibly reliable with >99% recall on an unseen test dataset and strong realtime performance in the trial environments.

Clustering & Pose Estimation

As explored earlier in the section on SLAM, we map out the produce as the robot drives around by combining estimates of the robots position and measurements taken from the camera feed. However, the next question that follows is how do we measure the relative position of the produce? Assuming fixed size produce, we can use similar triangles and the intrinsic matrix to find the distance and angle relative to the robot. This measurement can then be projected from the robot frame to the world frame to provide a position estimate for our produce.

Over time, we gather hundreds of unique estimates as to where a class of produce is, but this doesn't mean that there are equally many instances of this item in the real world. In fact, we may only have 1 or 2 pieces of garlic in a given space, but 300 slightly different position estimates. To resolve this, we employ DBSCAN clustering to group these estimates into a handful of usable positions.

AngaBlue

Made with by AngaBlue