Record Linkage - AbhayAg/MachineLearning GitHub Wiki

After practising examples from this chapter there will be couple of things that can be understood for production level code:-

  1. While reading file from anywhere as dataframe, use StructType class(if data type is known before) that saves lot of time if we have to deal with the mammoth size data. Otherwise by default for spark it will be double traversing of the data.
  2. Cached the data, that also saves time.
  3. Use case class (while using any algorithm for production level code) instead dynamically typed field name. That will help us in abstracting the spark-specific model components.
  4. Wait for more to come... :)