data leakage - taoualiw/My-Knowledge-Base GitHub Wiki

Data Leakage

Data leakage is when information from outside the training dataset is used to create the model.

Data Leakage is the presence of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions.

Leakage is a pervasive challenge in applied machine learning, causing models to over-represent their generalization error and often rendering them useless in the real world. It can be caused by human or mechanical error, and can be intentional or unintentional in both cases.

Competitions

Types of Competitions Competition Formats Joining a Competition Forming a Team Making a Submission Leakage Resources for Getting Started Datasets

Kernels

Public API

Competitions Find challenges for every interest level Types of Competitions Kaggle Competitions are designed to provide challenges for competitors at all different stages of their machine learning careers. As a result, they are very diverse, with a range of broad types.

Common Competition Types Featured Featured competitions are the types of competitions that Kaggle is probably best known for. These are full-scale machine learning challenges which pose difficult, generally commercially-purposed prediction problems. For example, past featured competitions have included:

Allstate Claim Prediction Challenge - Use customers’ shopping history to predict which insurance policy they purchase

Jigsaw Toxic Comment Classification Challenge - Predict the existence and type of toxic comments on Wikipedia

Zillow Prize - Build a machine learning algorithm that can challenge Zestimates, the Zillow real estate price estimation algorithm

Featured competitions attract some of the most formidable experts, and offer prize pools going as high as a million dollars. However, they remain accessible to anyone and everyone. Whether you’re an expert in the field or a complete novice, featured competitions are a valuable opportunity to learn skills and techniques from the very best in the field.

Research Research competitions are another common type of competition on Kaggle. Research competitions feature problems which are more experimental than featured competition problems. For example, some past research competitions have included:

Google Landmark Retrieval Challenge - Given an image, can you find all the same landmarks in a dataset?

Right Whale Recognition - Identify endangered right whales in aerial photographs

Large Scale Hierarchical Text Classification - Classify Wikipedia documents into one of ~300,000 categories

Research competitions do not usually offer prizes or points due to their experimental nature. But they offer an opportunity to work on problems which may not have a clean or easy solution and which are integral to a specific domain or area in a slightly less competitive environment.

Getting Started Getting Started competitions are the easiest, most approachable competitions on Kaggle. These are semi-permanent competitions that are meant to be used by new users just getting their foot in the door in the field of machine learning. They offer no prizes or points. Because of their long-running nature, Getting Started competitions are perhaps the most heavily tutorialized problems in machine learning - just what a newcomer needs to get started!

Digit Recognizer

Titanic: Machine Learning from Disaster - Predict survival on the Titanic

Housing Prices: Advanced Regression Techniques

Getting Started competitions have two-month rolling leaderboards. Once a submission is more than two months old, it will be invalidated and no longer count towards the leaderboard. This gives new Kagglers the opportunity to see how their scores stack up against a cohort of competitors rather than many tens of thousands of users.

Additionally, the Kaggle Learn platform has several tracks for beginners interested in free hands-on data science learning from pandas to deep learning. Lessons within a track are separated into easily digestible chunks and contain Kernel exercises for you to practise building models and new techniques. You’ll learn all the skills you need to dive into Kaggle Competitions.

Playground Playground competitions are a “for fun” type of Kaggle competition that is one step above Getting Started in difficulty. These are competitions which often provide relatively simple machine learning tasks, and are similarly targeted at newcomers or Kagglers interested in practicing a new type of problem in a lower-stakes setting. Prizes range from kudos to small cash prizes. Some examples of Playground competitions are:

Dogs versus Cats - Create an algorithm to distinguish dogs from cats

Leaf Classification - Can you see the random forest for the leaves?

New York City Taxi Trip Duration - Share code and data to improve ride time predictions

Other Competition Types Recruitment In Recruitment competitions, teams of size one compete to build machine learning models for corporation-curated challenges. At the competition’s close, interested participants can upload their resume for consideration by the host. The prize is (potentially) a job interview at the company or organization hosting the competition.

Some examples of recruiting competitions are:

Walmart Recruiting - Store sales forecasting

Airbnb Recruiting - New user booking prediction

Annual While not a strict competition type per se, Kaggle maintains two annual competition traditions.

The first is the March Machine Learning Competition, which has been run during the US College Basketball Tournaments every year since 2014.

The second is a Santa-themed optimization competition that is run once per year around Christmas time.

Limited Participation Kaggle rarely hosts competitions with limited participation. These competitions are either private or invite-only.

An example of limited participation competitions are Master’s competitions, which are private competitions that limit visibility and submissions only to invited users, generally Kaggle Masters and Grandmasters.

Competition Formats In addition to the different categories of competitions (e.g., “featured”), there are also a handful of different formats competitions are run in.

Simple Competitions Simple (or “classic”) competitions are those which follow the standard Kaggle format. In a simple competition, users can access the complete datasets at the beginning of the competition, after accepting the competition’s rules. As a competitor you will download the data, build models on it locally or in Kernels, generate a prediction file, then upload your predictions as a submission on Kaggle. By far most competitions on Kaggle follow this format.

One example of a simple competition is the Porto Seguro Safe Driver Prediction Competition.

Two-stage Competitions In two-stage competitions the challenge is split into two parts: Stage 1 and Stage 2, with the second stage building on the results teams achieved in Stage 1. Stage 2 involves a new test dataset that is released at the start of the stage. Eligibility for Stage 2 typically requires making a submission in Stage 1. In two-stage competitions, it’s especially important to read and understand the competition’s specific rules and timeline.

One example of such a competition is the Nature Conservancy Fisheries Monitoring Competition.

Kernels-only Competitions Some competitions are Kernels-only or code competitions. In these competitions all submissions are made from inside of a Kaggle Kernel, and it is not possible to upload submissions to the Competition directly.

These competitions have two attractive features. The competition is more balanced, as all users have the same hardware allowances. And the winning models tend to be far simpler than the winning models in other competitions, as they must be made to run within the compute constraints imposed by the Kernels platform.

Kernels-only competitions are configured with their own unique constraints on the kernels you can submit. These can be restricted by characteristics like: CPU or GPU runtime, ability to use external data, and access to the internet. To learn the constraints you must adhere to, review the Kernels Requirements for that specific competition.

An example of a Kernels-only competition is Quora Insincere Questions Classification.

Kernels-only FAQ How do I submit using Kernels?

  1. Once you have an algorithm capable of making predictions, write the predictions generated by your code to a .csv file. Ensure this submission file conforms to the format reflected in the sample_submission.csv file and described on the competition evaluation page.

  2. Commit your Kernel. This saves your code, runs it, and creates a version of the code and output. Once your commit finishes, you will be presented a link to view the version in the Kernel viewer.

  3. In the viewer, navigate to the Output section, find the submission file you created, and click the "Submit to Competition" button.

Can I upload external data?

Some competitions allow external data and some do not. If a competition allows external data, you can attach it to your kernel by adding it as a data source. If a competition does not allow external data, attaching it to your kernel will deactivate the "Submit to Competition" button on the associated commit.

What are the compute limits of Kernels?

The compute limits of the Kernels workers are constantly expanding. You can view the site-wide memory, CPU, runtime limits, and other limits from the Kernel editor.

Kernels-only competitions come in many shapes and sizes, and will often impose limits specific to a competition. You should view the competition description to understand if these limits are activated and what they are. Example variations include:

  • Specific runtime limits
  • Specific limits that apply to Kernels using GPUs
  • Internet access allowed or disallowed
  • External data allowed or disallowed
  • Custom package installs allowed or disallowed
  • Submission file naming expectations

How do I team up in a Kernels-only competition?

All the competitions setup is the same as normal competitions, except that submissions are only made through Kernels. To team up, go to the "Team" tab and invite others.

How will winners be determined?

In some Kernels-only competitions, winners will be determined by re-running selected submissions’ associated Kernels on a private test set.

In such competitions, you will create your models in Kernels and make submissions based on the test set provided on the Data page. You will make submissions from your kernels using the above steps and select submissions for final judging from the “My Submissions” page, in the same manner as a regular competition.

Following the competition deadline, your kernel code will be rerun by Kaggle on a private test set that is not provided to you. Your model's score against this private test set will determine your ranking on the private leaderboard and final standing in the competition.

Joining a Competition Kaggle runs a variety of different kinds of competitions, each featuring problems from different domains and having different difficulties. Before you start, navigate to the Competitions listing. It lists all of the currently active competitions.

If you click on a specific Competition in the listing, you will go to the Competition’s homepage.

The first element worth calling out is the Rules tab. This contains the rules that govern your participation in the sponsor’s competition. You must accept the competition’s rules before downloading the data or making any submissions. It’s extremely important to read the rules before you start. This is doubly true if you are a new user. Users who do not abide by the rules may have their submissions invalidated at the end of the competition or banned from the platform. So please make sure to read and understand the rules before choosing to participate.

If anything is unclear or you have a question about participating, the competition’s forums are the perfect place to ask.

The information provided in the Overview tabs will vary from Competition to Competition. Five elements which are almost always included and should be reviewed are the “Description,” “Data”, “Evaluation,” “Timeline,” & “Prizes” sections.

The description gives an introduction into the competition’s objective and the sponsor’s goal in hosting it.

The data tab is where you can download and learn more about the data used in the competition. You’ll use a training set to train models and a test set for which you’ll need to make your predictions. In most cases, the data or a subset of it is also accessible in Kernels.

The evaluation section describes how to format your submission file and how your submissions will be evaluated. Each competition employs a metric that serves as the objective measure for how competitors are ranked on the leaderboard.

The timeline has detailed information on the competition timeline. Most Kaggle Competitions include, at a minimum, two deadlines: a rules acceptance deadline (after which point no new teams can join or merge in the competition), and a submission deadline (after which no new submissions will be accepted). It is very, very important to keep these deadlines in mind.

The prizes section provides a breakdown of what prizes will be awarded to the winners, if prizes are relevant. This may come in the form of monetary, swag, or other perks. In addition to prizes, competitions may also award ranking points towards the Kaggle progression system. This is shown at the bottom of the Overview page.

Once you have chosen a competition, read and accepted the rules, and made yourself aware of the competition deadlines, you are ready to submit!

Forming a Team Everyone that competes in a Competition does so as a team. A team is a group of one or more users who collaborate on the competition. Joining a team of other users around the same level as you in machine learning is a great way to learn new things, combine your different approaches, and generally improve your overall score.

It’s important to keep in mind that team size does not affect the limit on how many submissions you may make to a competition per day: whether you are a team of one or a team of five, you will have the same daily submission limit.

When you accept the rules and join a Competition, you automatically do so as part of a new team consisting solely of yourself. You can then adjust your team settings in various ways by visiting the “Team” tab on the Competition page:

You can perform a number of different team-related actions on this tab.

Types of Team Memberships There are two team membership statuses. One person serves as the Team Leader. They are the primary point of contact when we need to communicate with a team, and also have some additional team modification privileges (to be discussed shortly). Every other person in the team is a Member.

If you are the Team Leader you will see a box next to every other team member’s name on the Team page that says “Make Leader”. You may click on this at any time to designate someone else on your team the Team Leader.

Changing your Team Name The team name is distinct from the names of its members, even if the team only consists of a single person (yourself). You can always change your team name to something custom, and other users will see that custom name when they visit the competition leaderboard. Most teams customize their names!

Anyone in the team can modify the team name by visiting the Team tab.

Merging Teams You may invite another team to your team or, reciprocally, accept a merge request from another team. If you propose a merger, the merger can be accepted or rejected by the Team Leader of the other team. If you are proposed a merger, the Team Leader may choose to accept or reject it.

There are some limits on when you can merge teams:

Most competitions have a team merger deadline: a point in time by which all teams must be finalized. No mergers may occur after this date

Some competitions specify a maximum team size; you will not be able to merge teams whose cumulative number of members exceeds this cap

You will not be able to merge teams whose combined daily submission count exceeds the daily submission limit

You must make at least one submission to the competition before you can merge tests

All of this can be managed through the Team tab.

Disbanding a Team Choose your teammates wisely as only teams that have not made any submissions can be disbanded. This can be done through the Team tab

Making a Submission You will need to submit your model predictions in order to receive a score and a leaderboard position in a Competition. How you go about doing so depends on the format of the competition.

Either way, remember that your team is limited to a certain number of submissions per day. This number is five, on average, but varies from competition to competition.

Leaderboard One of the most important aspects of Kaggle Competitions is the Leaderboard:

The Competition leaderboard has two parts.

The public leaderboard provides publicly visible submission scores based on a representative sample of the submitted data. This leaderboard is visible throughout the competition.

The private leaderboard, by contrast, tracks model performance on data unseen by participants. The private leaderboard thus has final say on whose models are best, and hence, who the winners and losers of the Competition will be. Which subset of data is calculated on the private leaderboard or a submission’s performance on the private leaderboard is not released to users until the competition has been closed.

Many users watch the public leaderboard closely, as breakthroughs in the competition are announced by score gains in the leaderboard. These jumps in turn motivate other teams working on the competition in search of those advancements. But it’s important to keep the public leaderboard in perspective. It’s very easy to overfit a model, creating something that performs very well on the public leaderboard, but very badly on the private one. This is called overfitting.

Submitting Predictions Submitting by Uploading a File For most competitions, submitting predictions means uploading a set of predictions (known as a “solution file”) to Kaggle.

Any competition which supports this submission style will have “Submit Predictions” and “My Submissions” buttons in the Competition homepage header.

To submit a new prediction use the Submit Prediction button. This will open a modal that will allow you to upload your submission file. We will attempt to score this file, then add it to My Submissions once it is done being processed.

Note that to count, your submission must first pass processing. If your submission fails during the processing step, it will not be counted and not receive a score; nor will it count against your daily submission limit. If you encounter problems with your submission file, your best course of action is to ask for advice on the Competition’s discussion forum.

If you click on the My Submissions tab you will see a list of every submission you have ever made to this competition. You may also use this tab to select which submission file(s) to submit for scoring before the Competition closes. Your final score and placement at the end of the competition will be whichever selected submission performed best on the private leaderboard. If you do not select submission(s) to be scored before the competition closes, the platform will automatically select those which performed the highest on the public leaderboard, unless otherwise communicated in the competition.

Submitting by Uploading from a Kernel In addition to our usual Competitions, Kaggle may also allow competition submissions from Kaggle Kernels. Kernels are an interactive in-browser code editing environment; to learn more about them, see the documentation sections on Kernels.

To build a Kernel-based model, start by initializing a new Kernel with the Competition Dataset as a data source. This is easily done by going to the “Kernels” tab within a competition’s page and then clicking “New Kernel.” That competition’s dataset will automatically be used as the data source. New Kernels will default as private but can be toggled to public or shared with individual users (for example, others on your team).

Build your model and test its performance using the interactive editor. Once you are happy with your model, use it to generate a solutions file within the Kernel, and write that solutions file to disk. Then click Commit & Run to build a new Kernel version using your code.

Once the new Kernel version is done (it must run top-to-bottom within the Kernels platform constraints), you will be able to see and click on a “Click to submit” button to submit your results to the Competition.

Leakage What is Leakage? Data Leakage is the presence of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions.

Leakage is a pervasive challenge in applied machine learning, causing models to over-represent their generalization error and often rendering them useless in the real world. It can be caused by human or mechanical error, and can be intentional or unintentional in both cases.

Some types of data leakage include:

Leaking test data into the training data

Leaking the correct prediction or ground truth into the test data

Leaking of information from the future into the past

Retaining proxies for removed variables a model is restricted from knowing

Reversing of intentional obfuscation, randomization or anonymization

Inclusion of data not present in the model’s operational environment

Distorting information from samples outside of scope of the model’s intended use

Any of the above present in third party data joined to the training set

Example

One concrete example we’ve seen occurred in a dataset used to predict whether a patient had prostate cancer. Hidden among hundreds of variables in the training data was a variable named PROSSURG. It turned out this represented whether the patient had received prostate surgery, an incredibly predictive but out-of-scope value.

The resulting model was highly predictive of whether the patient had prostate cancer but was useless for making predictions on new patients.

References

⚠️ **GitHub.com Fallback** ⚠️