Evolving Scoring - cdig/docs GitHub Wiki

We have long wanted LBS to offer a richness of assessment and scoring that exceeds the commonplace. This design document will explore specific ideas to that effect.

1. The State of The Art

The below quote comes from a podcast interview with one of the developers at Khan Academy, talking about their new iPad app. This is, effectively, the crib sheet for what we need to do.

There's no grade in the iPad app. All of the assessment is really about the student growing their own skill, and understanding where they are.

The goal is to help the student take ownership of their own learning, to be aware of what they know and what they don't know. And to make it clear that with effort (there's actually these little growth mindset interventions throughout the app) they can turn all of those places where they still are needing practice, into places where they're practiced, and finally into places where they have mastered the material comfortably.

There's no place where you can go and see your "grade".

When we say mastered, that doesn't just mean that you studied for it once and got it right once. It means that you got it right, and then we waited a while, and came back to you and asked you again, and you got it right again. And then we waited a while, and came back to you, and you got it right again.

There are (at least) two fantastic ideas in here, both of which come from different interpretations of the notion: you don't have a grade.

2. You don't have a grade.

In traditional learning environments, your individual decisions and actions are reduced (averaged, aggregated) down to a single number, again and again.

This is how the "grade" is made. It's a hazy summary of your decisions, good only for a rough comparison. By the same process, let's compare steak and salmon. Add together the calories, protein, sodium, and fat, and you'll have a "score" you can use to compare them. But this is obviously meaningless, and a fruitless way to decide what to eat.

SCORM, by design, forces exactly this sort of meaningless comparison on all content, and on all learners.

Modules can be arbitrarily large or small, and about any topic. But no matter the size, complexity, or topic, each module reports only one score — one single number to represent all the thinking and learning that took place. A module that's 10 hours with a hundred activities is equal to a module that's 30 seconds with no assessment. This makes it easier to compare different modules, and different people, but it's a vapid comparison.

We can easily do much better than that.

3. The Many-Grades Theorem

In LBS, you don't have a grade. You have many grades.

When you answer a question, that question is about one or more subjects. If you answer correctly, then you understood the subject. If you answer incorrectly, you didn't understand the subject of the question, and you didn't understand the subject of the (incorrect) answer you picked.

For LBS, quite simply, we will track all this data. The data about which subjects you did well with, and which ones you struggled with, on a per-answer (or per-action) basis.

A 10 question quiz doesn't have exactly one topic. A quiz might have an overarching theme, but that's only relevant to the creator of the quiz. When you complete it, we'll show you the aspects where you did best, and the aspects where you did poorest, regardless of how closely those aspects align to the broad theme of the quiz.

When we show you your grade, we're not going to show you a number. We're going to show you a multi-dimensional picture.

Then, we can use that fine-grained information to recommend what you should do next.

Now, the subject of each question isn't what we're interested in. Like the broad theme of a quiz, the subject of a question or answer is only relevant to the creator of the quiz. What we actually want to track are various qualities possessed the learner. These qualities could be particular domains of knowledge, like pumps or pressure or physics, or they could be much broader notions like creativity, empathy, eagerness, or patience.

A person's qualities evolve in parallel, weaving a unique tapestry of their knowledge and character. There's truly no such thing as an "A" student or a "C" student — that's a reduction to the point of meaninglessness for the sake of lazy comparison. When we measure them, we find that each person has a multidimensional shape; when comparing two people, we must do a multidimensional comparison. It's up to us to learn the shape of their knowledge and character, and show it to them (and their supervisors) in the clearest and richest way possible.

A single grade is a reduction in space, collapsing a multifaceted shape into a single point. Next, we're going to look at how a single grade is a reduction in time.

4. You don't have a grade.

Traditionally, when you get a "grade", it's a statement of how well you did on an individual activity. Then, all the activity grades are reduced into a single grade for the chapter. Those chapter grades are reduced to a single grade for the module. And the module grade is etched onto your tombstone.

Whether they realize it or not, most people treat a grade as a permanent statement. You never outgrow your high school transcript. You are defined by what you have done, not by what you can do.

In reality, a single grade should only matter in the instant it is achieved. Rather, if you want to understand the real knowledge or skill possessed by a person, you need to look across time. You need to see how their thoughts, or their actions, have been evolving as they learn and practice. Only with this temporally-motivated awareness, can you make useful predictions about how they'll do in the future.

When a score is a single number, it captures only a single moment in time. A single score, regardless of how recent it is, tells you almost nothing about the worth of a person.

Quick — AAPL is trading at $129.62! Buy or sell?

It's impossible to decide whether that's a good price, or a bad price. It would depend entirely on whether the graph of their recent history looked like this...

or like this...

Because knowledge and skill evolve over time, a single score is meaningless without historical context. Individual, absolute numbers make a statement of value that simply isn't truthful, or fair.

But again, that's exactly how traditional learning environments work, and that's especially the case for SCORM. Again, we can do much better.

5. I'll See It When I Believe It

In LBS, we will never show a numerical / absolute score value to a learner. We will only show them a relative change over time, and show that change visually. Your knowledge and understanding is a shape, and that shape evolves with time. How "good" or "bad" you are at something right now is only truly meaningful in the context of your own personal history, and what is typical of people in your position.

Furthermore, our information about the progress of a learner comes from the thousands of individual actions they take on the site — buttons clicked, questions answered, games played. The further away we get from those individual moments in time, and the more we look at their learning as an evolving aggregate shape over time, the more important it is that we not reduce it down to a number. The more we must push ourselves to represent it truthfully as an approximation, not a concrete value. Which means, we need to reach for the best tools that make sense of approximate, chaotic, stochastic information.

These aren't just pretty. They're tools for presenting aggregated information in a way that leverages our visual/spatial ability. When we reduce something to a numeric form, we're constraining it to be reasoned about analytically and in terms of language. But when we present information visually, it's able to be understood using more powerful parts of the brain. The numerical values are still in there, but they're presented in a way that opens up intuitive and emotional understanding, and offers a far more powerful basis for comparison.

The richness of the data we collect is only as good as the richness with which it is presented. And when we present information about time, it should show evolution across time, and it should be alive (animated) itself.

Again, we see that a single grade is a reduction. It's a reduction in time, freezing a fluidly evolving shape, etching it onto a tombstone.

6. Short Ride In A Fast Machine

We will very soon have the technology to implement these ideas. But before that time comes, we need to make sure we know what to use it for. We will want to think deeply about...

what qualities we want to track, and what they really mean
activities that challenge or demand or stimulate those qualities
new forms of assessment that measure how those qualities are exhibited
ways of reporting the evolution of those qualities both for individuals and in the aggregate.

Which qualities do we care about?

Determining an effective set of qualities will be an ongoing struggle. "Actuators", "Systems Thinking", and "Creativity" are all very different kinds of things. Each will offer a very different view into the development of a learner. Some may be essential for us to stimulate, measure, and respond to. Others may be superfluous, or too tricky, and fall out of scope.

We may want to avoid notions like "Level 1" or "Basic", as those are judgements about the material, not about the learner. Or, we may wish to embrace them, since after all, we will be tagging our material for the qualities they stimulate and measure — and the ramifications for the learner follow from that. In other words, the qualities we select may be a reflection on the material, or on the learner, or both, or neither — we don't know yet.

As a starting point, our trades competency grid is a very well thought-out assessment system that we should build on. It is performance and rubric based, making it a useful tool for us as we consider new forms of activities and assessment. Here are Carl's thoughts on it:

That was a bit of my CTS model already influencing where this was all going. If it is limited, I don't think it so much the issue that I defined some levels (qualitatively), but rather that it is mostly two-dimensional, and doesn't yet make room for the soft skills and other abilities. And of course, once those dimensions are added, detecting improved performance levels maybe a bit more threaded (weaving) and less stratified - which is the more multi-faceted progress reporting model you are advocating. Right now, in my Master Skill Level (quite an achievement if you have this bestowed up on you), one of the rubrics is: "Para-engineering skills are evident". Well how do we assess this in LBS?

What sort of activities let us challenge these qualities?

An activity may be used to teach or to test — it's simply a matter of framing. Carl offers a another great anecdote:

In John Taylor Gatto's book "Dumbing us Down", a book which had significant influence on me before the start of this company, he suggested that if the social studies curriculum required students to learn some basic economics and the system of commerce, then they should be sent out into the streets to start and run an enterprise of some sort for a few weeks, and then report back, and not sit and listen to a lecturer. Couldn't agree more. That's why Mark and I do less and less lecturing every year, to the point where what's left is really just a review discussion with the class.

Moving away from static, textual learning and assessment material should be our top priority. Textual lessons and multiple-choice quizzes work well to cover basic fact-based knowledge, and are comfortable for cultures that emphasize this formal/traditional style (eg: India, the middle east, traditional "business" mindset). However, textbooks and tests — whether digital or not — only engage the learner through language and reading. These skills bear little resemblance to our actual subject matter and the qualities we wish to stimulate and measure. For many people, text is a sedative. For all people, text is a narrow-bandwidth channel for information. We should be broadcasting our learning messages through as many media as possible, to reach people however they are best able to listen, and to communicate at a maximized bandwidth and throughput:

Text
2D/3D Imagery
Animation
Video
Voice over
Photography
Interaction
Simulation

... and on, and on.

By using every communication medium at our disposal, we'll have great one-way or two-way interactions with our learners through our website. But their learning experience should extend past what they do on their computer, and happen in realtime, not just through canned transmissions of prerecorded media.

As LBS advances, we may offer learners opportunities to directly interact with another person as part of their learning process — either another learner on LBS, or an expert from CDIG. Peers may discuss specific questions or challenges on the site, grade one another, or collaborate on multiplayer activities. As they gain mastery, we should push them toward creative writing or media broadcasts of their own, which synthesize what they've come to understand and give the a leadership role in the learning of others.

Ultimately, we should pursue ideas like in the quote at the top of this section — sending the learner away from the computer, out to interact with their actual equipment or teammates. To truly get the psychomotor (aka: skills) domain engaged, they might...

Sketch a schematic and describe its behaviour in a voiceover
Photograph and upload an example of a pressure reducing valve
Do a selfie video clip in front of a hydraulic machine, describing its function
Complete a maintenance task on a hydraulic machine, with smartphone photo documentation of each step

These activities can tie in with our media / simulations, so that they're encouraged to build a connection between the material they're learning in our virtual environment, and the real work they perform.

To view and contribute to our growing collection of activity ideas, please go to the Idea Bucket.

Assessment

Assessing some of these activities will come easily. Some will require sophisticated software and design. Others will surely require manual effort, putting us in the subjective / hard-work mode of a teacher, carefully deliberating and evaluating the student's work. Perhaps, with enough users in the system, there could be a showcasing and collegial assessment model where learners get to critique another's work. Of course, if we need to, we can always just apply a completion score of some kind.

In hydraulic system maintenance and troubleshooting, the ultimate test or assessment, will always be:

Can you maintain the system to the highest level of reliability and efficiency?
Are you keeping a close eye for any deviation from optimal performance?
Are you taking preemptive action as needed?
Then when the system does have a fault, did you resolve the issue in a safe way?
Did it utilize the least amount of time and resources (parts, money, fluid)?

To cover all of these concerns, in addition to the simple fact/knowledge testing we require as a foundation, we would do well to devise a spectrum of activities that evenly span the range from very primitive/traditional testing, through rich interactivity in simulation, and finally out onto the real equipment to perform and document actual on-the-job performance. By assessing learners at every level up to and including on-the-job performance, we'll be able to prove how they've grown, that they have progressed in their ability to cover the above listed requirements, and ultimately offer them meaningful rewards.

Motivation

The more sophisticated we'd like to make our material, the more it will demand of learners. Many of the ideas mentioned above will require significant buy-in from both the learners and their superiors. It will be important to make sure that the motivation and rewards are waiting for those who "play along". We will want to reward cooperative learners with elevated status and responsibility and, if possible, pay. We will want to reward cooperative companies with whatever they might prefer — perhaps, additional control over the material that gets developed for LBS, or on-site visits from members of our team for expert-level learning initiatives.

Everyone hates homework. Look at how hard it is for us to take the time, on-the-job, to brush up on programming skill. We're asking for considerable buy-in, so we'd better figure out how to dangle a very juicy carrot in front of all involved.

Technology Roadmap

Achieving these goals will not be a 1-step process. A number of new systems and services will need to be developed. We will need to design new activities, and implement them. We'll need new processes in place to ensure that the activities are created and deployed effectively — that they're actually testing and reporting with efficacy.

Here's an approximate ordering of sub-projects:

Develop a database system that will effectively track learner history, including scoring information associated with arbitrary qualities.
Switch Quizzer over to this system, and away from the (fatally unreliable) Firebase.
Improve Quizzer, both in content and functionality. Quizzer will live on as an effective means of testing Hydraulic Facts and Basic Reasoning, and may be favoured in certain cultures that emphasize rote learning.
Develop new activities that test for advanced knowledge or more subtle qualities (Creativity, Innovation, Initiative). These activities may be "game"-like, or they may require documenting actual work on real equipment, or other initiatives that escape the confines of the computer.
Design an effective means of visualizing the multidimensional, evolving shape of our learners.

7. Conclusion

The above document may seem like an overblown way to explain these same old ideas that we've been kicking around for years. However, it serves two other purposes. By reasoning through these ideas in depth, it will make them easier to formally design and implement technically — now's the time. Also, by spending such an effort on expressing these ideas in a clear and potent form, it's the beginning of our introspection into how to market these features to customers and users.

This is a living document. Please issue suggestions for improvement to the document itself, and to the ideas contained herein. Thank you.

A. Comments & Feedback

Because knowledge and skill evolve over time, a single score is meaningless without historical context. Individual, absolute numbers make a statement of value that simply isn't truthful, or fair.

Carl: Absolutely true. In the construction of AB's CTS curriculum in the 90s, we made a huge effort to build an assessment system that was rubric/examplar/growth based. I loved it, and used it, because it was a reflective and progressive process, conducted in collaboration with the student at a few junctures in time. Few other CTS teachers in the province ever used it, seeing it as a waste of time either because it took an unusual amount of time to do, or because AB Ed in the end, would force us to reduce all of the qualitative assessment work into a single report card mark (for the sake of the credits earned/failed database). I had to do some very perverse, contrived and guilt inducing calculations to convert/reduce all of the wonderful qualitative assessments into one mark.

When we reduce something to a numeric form, we're constraining it to be reasoned about analytically and in terms of language. But when we present information visually, it's able to be understood using more powerful parts of the brain. The numerical values are still in there, but they're presented in a way that opens up intuitive and emotional understanding, and offers a far more powerful basis for comparison.

Carl: I like it. If it brings forth, replaces or possibly even improves upon, some of the beauty of the collaborative/conversation/rubric assessment system I used for CTS, then fantastic.

Chris: I agree with the idea of a visual, multi-dimensional scoring system. Not only does it offer a much deeper look into the students skills, and keeps them aware, its easily understood too. Something like a 'player attribute web' (usually in video games) came to mind right away while reading. Having a visual that a 'skill bubble' is lacking in size, or a piece of 'web' is not extruded as far as it could be, is motivation in itself. Motivation that wouldn't exist if given only a single grade.

Carl: I like the notion on http://www.digitalchalk.com/ of the "assignment" where is says do an essay or or "ask them to complete a task". But when you click on it, the idea of sending the student out to do an assignment is gone, and we're just talking about an essay, and then purely catering to the cognitive (knowledge) domain only.

Quizzer Tags and Trades Competencies are a good place to start. But, we probably want to avoid notions like "Level 1" or "Beginner". Those are judgements about the material, not about the learner.

Carl: I'm not sure yet that I with your comment. I'll give that one some more thought.

Carl: If we're catering to corporate customers, perhaps we would give the corp some influence over assessment style and weighting? Maybe even some cultural factors to think about. India (traditionally a world of rote learning revered, oh so different from the average school in Seattle and San Francisco).

Ivan: Here's a talk from StrangeLoop last year that gives an overview of using space-filling curves as a way of reducing multidimensional data while preserving locality — useful for comparison (search, recommendation, etc). Here are a few links that were kicking around in my Evernote that might be related to collecting metrics:

Here's a neat visualization (oscillating rotation of 3d point plot inside an axis cube): https://marcinciura.wordpress.com/2015/07/01/the-vector-space-of-the-polish-parliament-in-pictures/ As suggested by that article, doing a Principal Component Analysis might be a good way to manage the plurality of dimensions, reducing them fairly for the sake of comparison.

Ivan: For motivation, in addition to virtual points, we might want to award physical swag for top performers on the site. Certification is one idea we've discussed. Ball caps and jackets and other items of (tastefully) branded apparel have come up. I was listening to an episode of 99% Invisible where they were discussing challenge coins, and it occurred to me that this might be another way to offer a meaning reward for achievement on the site. We could even start with virtual coins for simple achievements, and then award physical coins as a higher level.

While we're at it, here's another link: Vega (HN)

Ivan: Formative Assessment (or Evaluation) is an evidence-based teaching technique that we should study and implement if possible. It dovetails nicely with "just-in-time" learning and inquiry-based learning. We're already flirting with the technique, but only in a casual way, due to lack of proper awareness, terminology, and discussion.

Summative assessment evaluates what students know or have learned at the end of the teaching, after all is done. Formative evaluation refers to any activity used as an assessment of learning progress before or during the learning process itself. -Paraphrased from http://visible-learning.org/hattie-ranking-influences-effect-sizes-learning-achievement

With formative assessment, the first priority is to serve the purpose of promoting learning. It differs from assessment designed primarily to serve the purposes of accountability, or of ranking, or of certifying competence. An assessment activity can help learning if it provides information to be used as feedback by teachers, and by their pupils, in assessing themselves and each other, to modify the teaching and learning activities in which they are engaged. -Paraphrased from http://www.octm.org/files/8013/4379/6051/_Formative_Assessment_Brief_copy.pdf

Alexander Coward is (or was, by the time you read this) a Mathematics teacher at UC Berkley. He tore a hole in their tired, stodgy, regressive approach to teaching math by applying formative assessment, marking homework based on quality of work done and disregarding the degree of completion, and other progressive techniques. Students came away with a statistically significant advantage in future math classes, an atypical appreciation for the material, and a tremendous fondness for the teacher. The department, unable to coerce him into using more "ordinary" techniques, fired him.

I include this information here so that I am reminded to follow-up on the path of his career down the road.

Here's yet another neat way of visualizing multidimensional data — the Munsell Color System

More display types:

https://en.wikipedia.org/wiki/Streamgraph

https://en.wikipedia.org/wiki/Chord_diagram

https://en.wikipedia.org/wiki/Treemapping

This is probably not useful, but it's related and cool and might be good for other things: https://en.wikipedia.org/wiki/Sankey_diagram

Here's a nice info-vis: http://onformative.com/work/skype-visualization

The "skill bubble" is known as a Radar Chart or a Spider Chart.

Here's a really nice idea for presenting learning records: http://www.mastery.org/a-new-model/

Carl recommended this site.

This might be a neat way to show the accumulation of points over time (spiralling outward, not inward, of course). The period could change depending on the age of the account, 1 month at first and then later showing 1 year (like this image). Source: https://www.economist.com/node/12798595

https://morphocode.com/location-time-urban-data-visualization/

Excellent time-series datavis examples in this article.

Catalog of data vis types: https://datavizcatalogue.com/index.html

https://stackoverflow.blog/2022/03/03/stop-aggregating-away-the-signal-in-your-data/

Found via: https://twitter.com/zanstrong/status/1499487433557098501

Beautiful blog post and imagery about preserving the full fidelity of data and not reducing everything to an average.