Tutorial: Create classification - Wanguy/Core-ML-and-Inception-v3 GitHub Wiki
iOS Machine Learning
Machine learning is a type of artificial intelligence where computers “learn” without being explicitly programmed. Instead of coding an algorithm, machine learning tools enable computers to develop and refine algorithms, by finding patterns in huge amounts of data.
Deep Learning
Since the 1950s, AI researchers have developed many approaches to machine learning. Apple’s Core ML framework supports neural networks, tree ensembles, support vector machines, generalized linear models, feature engineering and pipeline models. However, neural networks have produced many of the most spectacular recent successes, starting with Google’s 2012 use of YouTube videos to train its AI to recognize cats and people. Only five years later, Google is sponsoring a contest to identify 5000 species of plants and animals. Apps like Siri and Alexa also owe their existence to neural networks.
A neural network tries to model human brain processes with layers of nodes, linked together in different ways. Each additional layer requires a large increase in computing power: Inception v3, an object-recognition model, has 48 layers and approximately 20 million parameters. But the calculations are basically matrix multiplication, which GPUs handle extremely efficiently. The falling cost of GPUs enables people to create multilayer deep neural networks, hence the term deep learning.
(A neural network, circa 2016)
Neural networks need a large amount of training data, ideally representing the full range of possibilities. The explosion in user-generated data has also contributed to the renaissance of machine learning.
Training the model means supplying the neural network with training data, and letting it calculate a formula for combining the input parameters to produce the output(s). Training happens offline, usually on machines with many GPUs.
To use the model, you give it new inputs, and it calculates outputs: this is called inferencing. Inference still requires a lot of computing, to calculate outputs from new inputs. Doing these calculations on handheld devices is now possible because of frameworks like Metal.
As you’ll see at the end of this tutorial, deep learning is far from perfect. It’s really hard to construct a truly representative set of training data, and it’s all too easy to over-train the model so it gives too much weight to quirky characteristics.
Integrating a Core ML Model Into Your App
This tutorial uses the Inception v3 model, which you can download from Apple’s Machine Learning page. Scroll down to Working with Models, and download. While you’re there, take note of the other three models, which all detect objects — trees, animals, people, etc. — in an image.
Note: If you have a trained model created with a supported machine learning tool such as Caffe, Keras or scikit-learn, Converting Trained Models to Core ML describes how you can convert it to Core ML format.
Adding a Model to Your Project
After you download GoogLeNetPlaces.mlmodel, drag it from Finder into the Resources group in your project’s Project Navigator. Select this file, and wait for a moment. An arrow will appear when Xcode has generated the model class:
Click the arrow to see the generated class:
Xcode has generated input and output classes, and the main class Inceptionv3
, which has a model
property and two prediction
methods.
Inception3Input
has an Image
property of type CVPixelBuffer
. The Vision framework will take care of converting our familiar image formats into the correct input type.
The Vision framework also converts Inceptionv3Output
properties into its own results
type, and manages calls to prediction
methods, so out of all this generated code, your code will use only the model
property.
Wrapping the Core ML Model in a Vision Model
In ViewController.swift, import the two frameworks, just below import UIKit
:
import CoreML
import Vision
Next, add the following extension:
extension ViewController {
func detectScene(image: CIImage) {
// Load the ML model through its generated class
guard let model = try? VNCoreMLModel(for: Inceptionv3().model) else {
fatalError("Couldn't initialize Model")
}
}
}
The designated initializer of Inceptionv3
throws an error, so you must use try
when creating it.
VNCoreMLModel
is simply a container for a Core ML model used with Vision requests.
The standard Vision workflow is to create a model, create one or more requests, and then create and run a request handler. You’ve just created the model, so your next step is to create a request.
Add the following lines to the end of detectScene(image:)
:
// Create a Vision request with completion handler
let request = VNCoreMLRequest(model: model) { [weak self] request, error in
guard let results = request.results as? [VNClassificationObservation],
let topResult = results.first else {
fatalError("unexpected result type from VNCoreMLRequest")
}
// Update UI on main queue
let article = (self?.vowels.contains(topResult.identifier.first!))! ? "an" : "a"
DispatchQueue.main.async { [weak self] in
self?.InformationLabel.text = "\(Int(topResult.confidence * 100))% it's \(article) \(topResult.identifier.components(separatedBy: ",")[0])"
print("\(Int(topResult.confidence * 100))% it's \(article) \(topResult.identifier)")
}
}
VNCoreMLRequest
is an image analysis request that uses a Core ML model to do the work. Its completion handler receives request
and error
objects.
You check that request.results
is an array of VNClassificationObservation
objects, which is what the Vision framework returns when the Core ML model is a classifier, rather than a predictor or image processor. And Inceptionv3
is a classifier, because it predicts only one feature: the image’s scene classification.
A VNClassificationObservation
has two properties: identifier
— a String
— and confidence
— a number between 0 and 1 — it’s the probability the classification is correct. When using an object-detection model, you would probably look at only those objects with confidence
greater than some threshold, such as 30%.
You then take the first result, which will have the highest confidence value, and set the indefinite article to “a” or “an”, depending on the identifier’s first letter. Finally, you dispatch back to the main queue to update the label.
The classification work happens off the main queue, because it can be slow.
Now, on to the third step: creating and running the request handler.
Add the following lines to the end of detectScene(image:)
:
// Run the Core ML Inceptionv3 classifier on global dispatch queue
let handler = VNImageRequestHandler(ciImage: image)
DispatchQueue.global(qos: .userInteractive).async {
do {
try handler.perform([request])
} catch {
print(error)
}
}
VNImageRequestHandler
is the standard Vision framework request handler; it isn’t specific to Core ML models. You give it the image that came into detectScene(image:)
as an argument. And then you run the handler by calling its perform
method, passing an array of requests. In this case, you have only one request.
The perform
method throws an error, so you wrap it in a try-catch.