Tutorial: Create classification - Wanguy/Core-ML-and-Inception-v3 GitHub Wiki

iOS Machine Learning

Machine learning is a type of artificial intelligence where computers “learn” without being explicitly programmed. Instead of coding an algorithm, machine learning tools enable computers to develop and refine algorithms, by finding patterns in huge amounts of data.

Deep Learning

Since the 1950s, AI researchers have developed many approaches to machine learning. Apple’s Core ML framework supports neural networks, tree ensembles, support vector machines, generalized linear models, feature engineering and pipeline models. However, neural networks have produced many of the most spectacular recent successes, starting with Google’s 2012 use of YouTube videos to train its AI to recognize cats and people. Only five years later, Google is sponsoring a contest to identify 5000 species of plants and animals. Apps like Siri and Alexa also owe their existence to neural networks.

A neural network tries to model human brain processes with layers of nodes, linked together in different ways. Each additional layer requires a large increase in computing power: Inception v3, an object-recognition model, has 48 layers and approximately 20 million parameters. But the calculations are basically matrix multiplication, which GPUs handle extremely efficiently. The falling cost of GPUs enables people to create multilayer deep neural networks, hence the term deep learning.

ios machine learning

(A neural network, circa 2016)

Neural networks need a large amount of training data, ideally representing the full range of possibilities. The explosion in user-generated data has also contributed to the renaissance of machine learning.

Training the model means supplying the neural network with training data, and letting it calculate a formula for combining the input parameters to produce the output(s). Training happens offline, usually on machines with many GPUs.

To use the model, you give it new inputs, and it calculates outputs: this is called inferencing. Inference still requires a lot of computing, to calculate outputs from new inputs. Doing these calculations on handheld devices is now possible because of frameworks like Metal.

As you’ll see at the end of this tutorial, deep learning is far from perfect. It’s really hard to construct a truly representative set of training data, and it’s all too easy to over-train the model so it gives too much weight to quirky characteristics.

Integrating a Core ML Model Into Your App

This tutorial uses the Inception v3 model, which you can download from Apple’s Machine Learning page. Scroll down to Working with Models, and download. While you’re there, take note of the other three models, which all detect objects — trees, animals, people, etc. — in an image.

Note: If you have a trained model created with a supported machine learning tool such as Caffe, Keras or scikit-learn, Converting Trained Models to Core ML describes how you can convert it to Core ML format.

Adding a Model to Your Project

After you download GoogLeNetPlaces.mlmodel, drag it from Finder into the Resources group in your project’s Project Navigator. Select this file, and wait for a moment. An arrow will appear when Xcode has generated the model class:

Click the arrow to see the generated class:

Xcode has generated input and output classes, and the main class Inceptionv3, which has a modelproperty and two prediction methods.

Inception3Input has an Image property of type CVPixelBuffer. The Vision framework will take care of converting our familiar image formats into the correct input type.

The Vision framework also converts Inceptionv3Output properties into its own results type, and manages calls to prediction methods, so out of all this generated code, your code will use only the modelproperty.

Wrapping the Core ML Model in a Vision Model

In ViewController.swift, import the two frameworks, just below import UIKit:

import CoreML
import Vision

Next, add the following extension:

extension ViewController {
    
    func detectScene(image: CIImage) {
        // Load the ML model through its generated class
        guard let model = try? VNCoreMLModel(for: Inceptionv3().model) else {
            fatalError("Couldn't initialize Model")
        }
    }
}

The designated initializer of Inceptionv3 throws an error, so you must use try when creating it.

VNCoreMLModel is simply a container for a Core ML model used with Vision requests.

The standard Vision workflow is to create a model, create one or more requests, and then create and run a request handler. You’ve just created the model, so your next step is to create a request.

Add the following lines to the end of detectScene(image:):

// Create a Vision request with completion handler
let request = VNCoreMLRequest(model: model) { [weak self] request, error in
            guard let results = request.results as? [VNClassificationObservation],
                let topResult = results.first else {
                    fatalError("unexpected result type from VNCoreMLRequest")
            }
            
            // Update UI on main queue
            let article = (self?.vowels.contains(topResult.identifier.first!))! ? "an" : "a"
            DispatchQueue.main.async { [weak self] in
                self?.InformationLabel.text = "\(Int(topResult.confidence * 100))% it's \(article) \(topResult.identifier.components(separatedBy: ",")[0])"
                print("\(Int(topResult.confidence * 100))% it's \(article) \(topResult.identifier)")
            }
            
        }

VNCoreMLRequest is an image analysis request that uses a Core ML model to do the work. Its completion handler receives request and error objects.

You check that request.results is an array of VNClassificationObservation objects, which is what the Vision framework returns when the Core ML model is a classifier, rather than a predictor or image processor. And Inceptionv3 is a classifier, because it predicts only one feature: the image’s scene classification.

A VNClassificationObservation has two properties: identifier — a String — and confidence — a number between 0 and 1 — it’s the probability the classification is correct. When using an object-detection model, you would probably look at only those objects with confidence greater than some threshold, such as 30%.

You then take the first result, which will have the highest confidence value, and set the indefinite article to “a” or “an”, depending on the identifier’s first letter. Finally, you dispatch back to the main queue to update the label.

The classification work happens off the main queue, because it can be slow.

Now, on to the third step: creating and running the request handler.

Add the following lines to the end of detectScene(image:):

// Run the Core ML Inceptionv3 classifier on global dispatch queue
let handler = VNImageRequestHandler(ciImage: image)
DispatchQueue.global(qos: .userInteractive).async {
	do {
		try handler.perform([request])
	} catch {
		print(error)
	}
}

VNImageRequestHandler is the standard Vision framework request handler; it isn’t specific to Core ML models. You give it the image that came into detectScene(image:) as an argument. And then you run the handler by calling its perform method, passing an array of requests. In this case, you have only one request.

The perform method throws an error, so you wrap it in a try-catch.