Week 02 Coding and Probability - pointOfive/stat130chat130 GitHub Wiki
Chance is intuitive and use AI ChatBots to make coding and understanding code easier
Tutorial/Homework: Topic
- python object types... tuple, list, dict
- another key data type... np.array (and
np.random.choice
) - for loops... for i in range(n):
- print()
- for x in some_list:
- for i,x in enumerate(some_list):
for key,val in dictionary.items()
anddictionary.keys()
anddictionary.values()
- logical flow control... if, elif, else
Tutorial/Homework: Lecture Extensions
- more object types... type()
- more indexing for "lists"
- more np.array with .dtype
- more "list" behavior with str and .split()
text manipulation with.apply(lambda x: ...)
,.replace()
, andre
- operator overloading
- What are pandas DataFrame objects?
- for word in sentence.split():
Lecture: New Topics
- from scipy import stats, stats.multinomial, and probability (and
np.random.choice
)
Out of Scope
- Material covered in future weeks
- Anything not substantively addressed above...
- ...such as modular code design (with
def
based functions orclasses
) - ...such as dictionary iteration (which has been removed from the above material)
- ...such as text manipulation with
.apply(lambda x: ...)
,.replace()
,re
(which are introduced but are generally out of scope for STA130)
Tutorial/Homework: Topics
Types
A tuple
is an object containing an "sequential collection" of "things" that is created and represented with round brackets.
Tuples are immutable, which means that after a tuple is created you cannot change, add, or remove items; so, tuples's are ideal for representing data that shouldn’t really change, like birthday dates or geolocation coordinates of natural landmarks.
example_tuple = (1, 'apple', 3.14, 1) # tuples can contain duplicate elements
example_tuple
A list
is another kind of object containing a "sequential collection" of "things" that is created and represented with square brackets.
Unlike tuples, lists are mutable, which means they can be altered after their creation. If you don't want to recreate a tuple from scratch each time you need to change your collection of things, then you should use a list!
example_list = [1, 'banana', 7.77, 1] # lists can also contain duplicate elements like tuples
example_list.append('new item') # here we add a new element onto the list
# to do the same thing with a tuple you'd have to completely create a completely new tuple
# as `example_tuple_update = (1, 'banana', 7.77, 1, 'new item')`
example_list
A dict
("dictionary") is an object that uses a "key-value" pairs "look up structure" of "things" instead of a "sequential collection" organization that are created and represented using "key-value" pairs inside of curly braces (as demonstrated below).
Dictionaries are mutable like lists, but each "key" of a dictionary is unique (so it uniquely references its corresponding "value"). Since a dict
is based on a "look up" system, it is a so-called unordered object. This means that unlike tuples and lists which always remember their sequential order, dictionaries do not maintain their order of insertion and can change the order of items when the dictionary is modified.
example_dict = {'id': 1, 'name': 'orange', 'price': 5.99} # There cannot be duplicate "keys" but there could be duplicate "values"
example_dict['quantity'] = 10 # adds a new "key-value" pair
del example_dict['quantity'] # removes a "key-value" pair
The use of dictionaries to rename the columns of pandas DataFrame objects
was previously seen in the Variables and Observations section of Week 01 of the course wiki-textbook; and an example of a more elaborate dictionary object and its extension (again related to the pandas DataFrame objects
context) is given in the
"What are pd.DataFrame objects?" section below.
np.array
NumPy is a Python
library that contains the most efficient versions of standard numerical routines.
For example, a NumPy np.array
is preferred over a list for its speed and functionality in numerical tasks.
The NumPy library is imported and a np.array
is created from a list object as follows.
import numpy as np
example_array = np.array([1, 2, 3])
An example numerical task that can be done with NumPy is to select a random value from an np.array
object as follows.
random_element = np.random.choice(example_array)
random_element
for loops
The range(n)
function in Python
generates numbers from 0
to n-1
.
If you try to run the
range(n)
function to produce these values, it won't do anything because it is a so-called generator which means it will only produce the actual values within the context of a looping structure which sequentially requests the values. This is actually clever because it means the actual numbers themselves don't actually have to be stored in memory anywhere, and can instead just be sequentially produced one at a time as needed.
The print()
function outputs a displays of its object argument. Since (as discussed above) range(5)
is a so-called generator, if you run the code print(range(5))
, you will get the following output.
range(0, 5) # output from running `print(range(5))`
The for i in range(n):
template is the coding construct that is used used to specify the repetition of a block of code n
times.
The block of code that the
for
loop repeats will be executed "silently"; so, if you want to display anything inside of afor
loop you need to explicitly use theprint()
function in the body of yourfor
loop as demonstrated below.
for i in range(5): # "iterates" `i` over the values 0, 1, 2, 3, 4
# the "body" of a `for` loop --
# the "indented code block" below the `for` statement syntax
print(i) # is executed sequentially for each value of the `i` "iterator"
Python
code WILL NOT WORK unless properly indented... this is an interesting "feature" ofPython
coding that helps to make code for readable!
Here's a step by step break down of what the for
loop code above is doing.
- Initialization: the
for
loop starts with the keywordfor
- Iterator Variable:
i
is the "iterator" variable that will sequentially change with each "iteration" of thefor
loop - The
range()
function:range(5)
will "iteratively" generate the sequence of numbers from0
to4
(sincePython
uses "0-indexing"), and these will be sequentially assigned to "iterator"u
- Loop Body: the code block indented under the
for
loop defines what happens during each "iteration" (in this case, the sequentially assigned values ofi
will be printed) - Iteration Process:
- In the first iteration,
i
is set to0
, the first number produced by therange
generator. - The
print(i)
statement is executed, printing0
to the screen. - The
for
loop now iterates by settingi
is set to the next number produced by therange
generator, which is1
. print(i)
is executed again, this time printing1
.- This process repeats until
i
has "iterated" through all of the values produced by therange
generator.
- In the first iteration,
- Termination: once
i
has reached4
there are no morei
"iterations", the loop ends, and the program continues with any code following the loop.
More for Loops
It is sometimes useful to iterate through a custom list rather than a range(n)
generator.
Below, instead of "iterator" i
, we denote the "iterator" as x
to emphasize that it's not a "numerical index iterator".
This is not strictly necessary, since you can name your "iterator" variable whatever you want to and then access it as such in the body of the for loop.
a_list = ['apple', 'banana', 'cherry'] # or, equivalently: `a_list = "apple banana cherry".split()`
for x in a_list: # note that we don't have to use the `range()` function here!
print(x) # we can just "iterate" through the "iterable" list `a_list`!
It is additionally sometimes useful to both iterate through a custom list but also still have a "numerical index iterator" as well.
This is done with by wrapping the enumerate()
function around a list (or tuple) object.
for i,x in enumerate(a_list):
print(f"Index: {i}") # this useful syntax pastes `i` into the displayed string
print(f"Value: {x}") # this useful syntax pastes `x` into the displayed string
print("Iteration Completed")
The enumerate(a_list)
"numerical index iterator" (i
) to the "iterable" list (a_list
) and returns it as an enumerate object which the for
loop understands and unpacks into i
and x
at each "iteration" as indicated by the i,x
syntax.
One more for
loop structure that can sometimes be useful is "iterating" through dictionaries based on the .items()
, .keys()
, or .values()
methods of a dictionary.
my_dict = {'a': 1, 'b': 2, 'c': 3}
for key in my_dict.keys():
print(key)
for key, value in my_dict.items():
print(f"Key: {key}, Value: {value}")
for value in my_dict.values():
print(value)
Logical Flow Control
FizzBuzz is a classic programming challenge that’s often used to teach basic programming concepts like loops and conditionals. In the FizzBuzz problem, you loop through a range of numbers, and then do the follow for each number.
- If the number is divisible by 3, you print “Fizz”.
- If the number is divisible by 5, you print “Buzz”.
- If the number is divisible by both 3 and 5, you print “FizzBuzz”.
- If the number is not divisible by 3 or 5, you print the number itself.
Here’s an example FizzBuzz program that illustrates the use of if
and else
conditional statements for logical flow control with comments explaining each step.
for i in range(1, 101): # Loop from 1 to 100
if i % 3 == 0 and i % 5 == 0: # Check if divisible by both 3 and 5
print("FizzBuzz")
elif i % 3 == 0: # Check if divisible by 3
print("Fizz")
elif i % 5 == 0: # Check if divisible by 5
print("Buzz")
else: # If not divisible by 3 or 5
print(i)
- The
for
loop sets up the iteration (from1
to100
in this example). - The first
if
statement checks for the first condition (divisible by both 3 and 5).- the modulus operation
%
returns the remainder of "i divided by 3"; so, it's just an operation likei+3
,i-3
, ori/3
; but, if the remainder is0
then it means that "i divides by 3 perfectly" - The
and
construction ini % 3 == 0 and i % 5 == 0
means that bothi % 3 == 0
ANDi % 5 == 0
must beTrue
for theif
statement to succeed - If the
if
statement succeeds (because its "argument isTrue
) then the "body" (the indented code block beneath theif
statement) of theif
statement is executed
- the modulus operation
- The next two
elif
("else if") statements each subsequently sequentially check if their respective conditions (divisible by 3, and then 5) are true statements, and execute their code block "bodies" if so- Using
elif
instead ofif
"connects" the logical statements into a single logical control flow sequence whose conditions can be understood in an "else if" manner that can help improve the clarity of the checks.
- Using
- The
else
statement covers the case where none of the above conditions areTrue
and concludes the logical control flow sequence. - The
print()
function outputs the result based on the condition that’s met.
This structure allows the program to make decisions at each iteration, using logical flow control structures within a for
loop to print out a mix of numbers and the words “Fizz”, “Buzz”, and “FizzBuzz” based on the divisibility of each number.
try-except
blocks
A similar logical flow control structure to if
/else
statements is the try-except
block. Rather than checking for True
conditions however, try-except
blocks check for the presence of "run time errors" which are stored as a kind of Python
object known as an Exception
. In the code below, Python
tries to run "one"/"two" but doesn't know how to do this; but, the except Exception as e
construct allows the nature of the error to be captured as an Exception
object and named e
which can then be printed out and examined without cause the code to fail (as it would if you tried to run "one"/"two"
without wrapping a try-except
block around it).
try:
"one"/"two"
except Exception as e:
print(f"An error occurred: {e}")
Tutorial/Homework: Lecture Extensions
More types
While we will consider types of data (like numerical or categorical) to inform decisions about what kind of analysis are going to be the most appropriate for a given dataset (as we saw when discussing .describe() and .value_counts()), this is somewhat different than considering the type of an object in Python. To determine the type of an object in Python we us the type()
function, which returns the specific data structure (called the class) of the object in the Python programming language.
For example:
x = 10
print(type(x)) # <class 'int'>
y = "Hello"
print(type(y)) # <class 'str'>
z = [1, 2, 3]
print(type(z)) # <class 'list'>
import numpy as np
my_array = np.array([1, 2, 3, 4])
print(type(my_array)) # Output: <class 'numpy.ndarray'>
a = True
print(type(a)) # <class 'bool'>
b = (5 > 3) # This expression evaluates to True
print(type(b)) # <class 'bool'>
x2 = 3.14
print(type(x2)) # <class 'float'>
y = float(10) # Converting an integer to a float
print(type(y)) # <class 'float'>
Here, type()
returns the class of the object, which tells you what kind of data structure it is: int
, str
, etc. (and we've previously introduced tuple, list, dict object types). In the bool
examples, True
and the result of a comparison 5 > 3
are both of type bool
. In the float
examples, 3.14
is a numeric (floating-point) decimal number, and converting 10
to a float
gives a float
type as well. And the final line of code explicitly converts (as opposed to implicitly coercing) an int
type into a float
.
Something to consider would be what might happen the other way, if we tried to convert a float
(such as x2
above) to an int
(as in int(x2)
). More generally, what different object types types might naturally convert to other object types? If you recall how coercion automatically converted bool
types to int
types (because there was a well-known rule that made it obvious how this would be done), this might be a good example to start thinking about what possibilities make sense (or don't make sense).
More indexing
Different object types (of course) serve different purposes. It may be intuitive to imagine potential uses a list
object; but, it's interesting to note that we index (and "negative" index) into a list
(just as we iterate through for
loops) using an int
and can rely on bool
objects when constructing logical conditionals for *boolean selection as demonstrated below.
print(z[0]) # Output: 1
print(z[-1]) # Output: 3
print(z[1]) # Output: 2
print(z[-2]) # Output: 2
my_array[my_array<3] # Output: array([1, 2])
The indexing shown above is actually just analogous to the row-based indexing and boolean selection indexing using logical conditionals that we previously. Indeed, you will be able to slice index into list
and np.array
objects just as with pandas
DataFrames. Ask a ChatBot for examples of how to select elements from list
and np.array
objects if you are curious to see some of the options available for doing this.
More np.array
While Python's built-in object types like list
and str
(as discussed below) are powerful, sometimes you need more specialized tools for handling numeric data. The np.array()
function allows you to create "arrays" specifically designed for efficient computation of mathematical operations with large amounts of numeric data compared to a Python list
(of numeric values). In addition to its extensive functionality, the computational speed benefits of numpy
are a big reason why numpy
is (indeed) a popular library for numerical computing. The reason np.array
objects can be offer faster computational performance is because the object type of the items in an np.array
objects must all be identical.
In the code above, my_array
is an np.array
object, and every element of The reason my_array
is of type int
(or, technically, an int64
which explicitly indicates that numpy
is using a 64-bit integer object type). This can be seen using the dtype
attribute of the np.array
object. Notice the similarity, but distinction between type()
and .dtype
(with the latter being the way to see the homogenous object type of an np.array
object).
print(type(my_array)) # Output: <class 'numpy.ndarray'>
print(my_array.dtype) # Output: int64 (or another integer type depending on your system)
Also notice that .dtype
attribute of an np.array
object serves exactly the same purpose as the analogous .dtypes
attribute of a pd.DataFrame
object introduced previously; and similarly; that the type()
(and changing the object type demonstrated above) of an object should remind you of the .astype()
method introduced alongside .dtypes
for setting the type of data with a column of a pd.DataFame
object.
The
numpy
model andnp.arrays
are a key part of working with numeric data in Python, offering more advanced capabilities for mathematical operations and data manipulation than basic Python lists. So keep an eye on the thenumpy
library. It will likely be something that you'll likely come across in different context relatively frequently in the future.
More list behavior with str and .split()
In Python, a string (str
object type) is a sequence of characters enclosed within (either single or double) quotes. Strings are one of the most commonly used object types for representing text (obviously), from single words to entire sentences or paragraphs. For example, sentence
below is a string object.
sentence = "Learning Python is fun!"
You can treat strings as sequences of individual characters, and each character has an index (starting from 0), just as if it was a list
or an np.array
! For instance, to access the first, sixth, eleventh, and ninth, third to last, and last characters of the string we would use the following.
first_char = sentence[0]
print(first_char) # Output: 'L'
first_char = sentence[5]
print(first_char) # Output: 'i'
first_char = sentence[10]
print(first_char) # Output: 't'
first_char = sentence[8]
print(first_char) # Output: ' '
first_char = sentence[-3]
print(first_char) # Output: 'u'
first_char = sentence[-1]
print(first_char) # Output: '!'
Strings, however, are immutable, which means that once created, strings cannot change individual characters within the string. You may recall that this immutable behavior is what distinguishes a tuple
(which like a string is immutable) compared to a list
(which unlike a string is mutable). So, trying to modify a specific letter, like sentence[8] = '-'
will result in an error. Nonetheless, strings can still be manipulated in various ways to simply create new strings as needed. So, to "change" a string means to just create a new different string, rather than editing the original string.
One of the most useful operations on strings is to break them apart into a list. As you know, a list in Python is an ordered collection of items, each of which can be of any object type. As you might expect, if you break a string apart into a list, the resulting elements of the list will be the (sub) strings (or characters) of the original string. But since a list is mutable, the "items" of the "string" in the converted list form could then be changed. Then (as should be reasonably expected) the modified list could be converted back into a string, producing an "edited" version of the original string.
To convert a string into a list of words, you can use the split()
method of the string object. By default, split()
divides the string wherever there is a space, creating a list of words. Strings also have a join()
method, the most useful construction of which is for the " "
blank space string, which as demonstrated below can below can be used to reconstruct a string from a list.
words = sentence.split()
print(words) # Output: ['Learning', 'Python', 'is', 'fun!']
words[3] = 'not'
words.append('too') # the append method of a list modifies a list
words.append('bad') # by adding an item to the end of the list
words += ['once', 'you', 'get', 'the', 'hang', 'of', 'the', 'process'] # '+' operator overloading
" ".join(words)
As you can see above, converting a string to a list allows us to work with the words of a sentence individually, and leverage list all the functionality of list operations. For example, we could now iterate over the words with a for
loop, count the number of words with len(words)
, or access and modify individual words or expense the list "sentence" (as we've done above), etc.
Operator Overloading
In Python, the same operator can behave differently depending on the object types being operated on. This is called operator overloading and is a specific case of a broader concept known as polymorphism (meaning "many behaviors"). Polymorphism in the form of operator overloading allows different object types to respond to the same operation in ways that are most appropriate to their object type.
Let's take the +
operator as an example. In Python, +
behaves differently depending on whether it’s applied to numbers (like float
or int
), strings, or list
object types. For instance, +
concatenates (or joins) strings together. This is of course an example of operator overloading because +
is being used in a way that is specific to strings.
greeting = "Hello, " + "world!"
print(greeting) # Output: Hello, world!
Similarly, when +
is applied to two list
object types, it concatenates (or joins or combines) the two into one, extending the original list
with the elements of the second one. You may recognize that you've already seen this behavior above, but here it is again in a slightly different manner. We used +=
in the version of this example above, which is just shorthand for combining and assigning in one step.
words = ['Learning', 'Python', 'is', 'not', 'too', bad']
new_words = ['once', 'you', 'get', 'the', 'hang', 'of', 'the', 'process']
words = words + new_words # words += ['once', 'you', 'get', 'the', 'hang', 'of', 'the', 'process']
print(words)
# Output: ['Learning', 'Python', 'is', 'not', 'too', 'bad', 'once', 'you', 'get', 'the', 'hang', 'of', 'the', 'process']
It actually makes a lot of sense that both string and list
object types behave in a similar way regarding concatenation. Especially when you remember that you can index into a string using int
just as you can with a list
object types.
To drive the concatenation behavior of operator overloading when it comes to +
and strings, recall how the construction df.isna().sum(axis=1)
counts the number of missing values across rows by coercing True
and False
to 1
and 0
and summing them. What, then, would df["string_column_1","string_column_1"](/pointOfive/stat130chat130/wiki/"string_column_1","string_column_1").sum(axis=1)
do?
What are pd.DataFrame objects?
Week 02 formally introduced list
, dict
, np.array
and str
"object" types
(as opposed to "data" types
); but, you actually encountered str
, list
, and dict
(dictionary) python object types
in Week 01 (perhaps without particularly noticing) in the Missingness I, boolean values and coercion, and Pandas
column data types
sections of the course wiki-textbook where they were used to defined pandas DataFrame objects
.
# Python `dict` types can be defined with curly brackets "{" and "}"
data = {
'age': [25, 32, 47, 51], # 'age' is an `str` "string" type; `[25, 32, 47, 51]` is a `list`
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'income': [50000, 60000, 70000, 80000],
'has_pet': ['yes', 'no', 'no', 'yes']
}
df = pd.DataFrame(data)
So a pandas DataFrame object
is fundamentally a dictionary, with column names corresponding to the "keys" of the dictionary and the values in the rows of the column corresponding to the "values" in the dictionary which are lists of data (all having the same length).
Technically,
pandas
first transforms eachlist
(ortuple
) into annp.array
and then further transforms this into apd.Series
, finally making thepandas DataFrame object
a collection of columns ofpd.Series
objects which are accessed in the manner of a dictionary.
The fundamental dictionary nature of a pandas DataFrame object
is reflected in the way columns are referenced when working in pandas
, as seen in the Types I and Missingness II sections in Week 01:
df['age'] # returns the 'age' column
del df['name'] # removes the 'name' column from the df object
# both of which function analogously to how `dict` objects are managed
# And, unsurprisingly, data is added to a `pd.DataFrame` object
# in just the same analogous manner as for a `dict` object
df['city'] = ['New York', 'Los Angeles', 'Chicago', 'Houston']
# just like how the data would be added to the original dictionary object
data['city'] = ['New York', 'Los Angeles', 'Chicago', 'Houston']
Lecture: New Topics
scipy.stats
Probability is the mathematical framework that allows us to model chance (or uncertainty). In many real-world situations, we deal with events or outcomes that are not certain, and probability helps us quantify the likelihood of these events. Python, with libraries like numpy
and scipy
, provides powerful tools for handling probability distributions, statistical methods, and random events. The stats
module within the scipy
library (i.e., scipy.stats
) provides a wide range of statistical functions and probability distributions (such as the normal distribution, binomial distribution, and many others, some of which we will introduce later). These tools allow us to model different types of random events and calculate relevant probabilities of interest. To get started, we’ll import the stats
submodule from scipy
as follows, but you may sometimes see this functionality imported with alternative aliasing, such as import scipy.stats as ss
, etc.
from scipy import stats
Now that we have stats
, let's consider our first probability distributions. We'll start with the multinomial distribution. This models the probabilities of selecting n
things from k
options (potentially choosing each option more than once if n
is greater than k
). The simplest version of this would be if n=1
, then we'd just be choosing one of the k
options. An example of a multinomial distribution would be rolling a six-sided die. If we just roll once, n=1
and k=6
and we'll see the face up side of the die (which will be one of the outcomes 1 through 6 if we're talking about a normal die). If you roll the die multiple times, or roll multiple identical dice (like in Yahtzee where you start by rolling 5 dice), then n
changes but k
does not. So in Yahtzee where you roll five dice, n=5
and k=6
.
So far you've probably been imagining a "fair die" or "fair dice", meaning that the chance of each of the k
outcomes (or sides of a die in our ongoing example) is equally likely. But the multinomial distribution allows for some flexibility here. It has another aspect that we've not yet considered which is the "chance" of each of the k
outcomes, and we usually refer to this as p
. The p
needs to be a "list" of k
probabilities which sum to one (which makes intuitive sense if you think about it a bit). So, in our die example, p
will be six fractions (or decimal numbers) between 0 and 1 which together sum to 1. Here's how you use scipy.stats
to model the two examples we've considered so far, followed by one more examples where we're rolling a die that is not "fair".
from scipy import stats
# Suppose we're rolling a single die and the probabilities of each face are equal (1/6).
one_fair_die = stats.multinomial(n=1, p = [1/6] * 6) # ready to roll
# `[1/6] * 6` above is another interesting example of *operator overloading*
# `[1/6] * 6` turns out to produce `[1/6,1/6,1/6,1/6,1/6,1/6]`... if you can't guess why, ask a ChatBot!
# notice that `k` is implied by the length of `p`
one_fair_die.rvs(size=1) # rvs stands for "random variable sample"
# `size` is the number of times to do a "random variable sample" of whatever the `one_fair_die` thing is
# so `size=1` here then means "role one die" just one time (since `n=1`)
one_fair_die.rvs(size=5) # but here it means "role five dice"
# Consider the difference in the output between the following
# stats.multinomial(n=1, p = [1/6] * 6).rvs(size=5)
# stats.multinomial(n=5, p = [1/6] * 6).rvs(size=1)
# and see if you can articulate what exactly the similarity and difference is between these two things
# and what then the following is
# stats.multinomial(n=5, p = [1/6] * 6).rvs(size=5)
one_UNfair_die = stats.multinomial(n=1, p=[0.05, 0.1, 0.15, 0.2, 0.25, 0.25]) # ready to roll
# We'll have to make sure were know which die face outcome corresponds to each probability...
one_UNfair_die.rvs(size=5) # roll the unfair die five times
# Or is it role five "identically unfair dice"? Well... it's the same thing!
If you've understood above the sort of strange interchangeable similarity between n
and size
above, well done!
While they may seem redundant, they let us specify things like stats.multinomial(n=5, p = [1/6] * 6).rvs(size=10)
which can be interpreted as "role 5 dice ten times" (as in a standard Yahtzee game). This shows us that we can conceptualize the event of "picking n
things from k
choices" as something that can be hypothetically repeated over and over. That said, there's actually another, perhaps simpler and clearer way to create random samples from a multinomial distribution in Python. Consider the output of the code below and see if makes sense to you. Then compare the nature of the output below to the nature of the output of the code above. Are you able to figure out how the output below is related to output from stats.multinomial(n,p).rvs(size)
for different choices of n
and size
? if you can't quite tell, ask a ChatBot!
import numpy as np
# Roll a six-sided die 10 times
rolls = np.random.choice([1, 2, 3, 4, 5, 6], size=10, p=[1/6] * 6)
print(rolls)
Conditional Probability and Independence
The last things we want to consider here are the notions of conditional probability and independence. Let's start with conditional probability, which takes the notational form $\Pr(A|B)$. Here's a question: is there such a thing as a "hot streak" when rolling dice? Say you're trying to roll "sixes" on a die, and you've rolled three in a row already(!), do you think you're more likely than usual, or less likely than usual to get another "six" on your next roll? Let's ask this question in notation of conditional probability. Are these two equal?
$\Pr(\textrm{rolling a six}) = Pr(\textrm{rolling a six} | \textrm{the last three roll were a six})?$
That is, does the next die roll depend on the previous die rolls, or is it independent of them? What we're asking here is if there's a relevant conditional probability or if the events being considered independent (so there's really just a single probability and the idea of a conditional probability is not really necessary). So there is either an independence between two events, or (sort of contrarily "opposite") there will be a meaningful conditional probability that changes the probability of the events based on the outcomes of the other.
If you think there's no such thing as a "hot streak" and your next roll does not depend on your last roll then you're saying the equality above is true, which means you're saying the next die roll is independent of the previous dice rolls and there's really no notion of a conditional probability (because it's just a probability). This is true, so long as you're really rolling a "fair die" randomly. So when a conditional probability statement can simplify, like if $\Pr(A|B) = \Pr(A)$ meaning that knowing $B$ does not change the probability of $A$ occurring, then this is when we say that $A$ and $B$ are independent. In the multinomial distribution, the n
selections of the k
different possible options is assumed to be independent. So, the chances that we'll choose each of the k
different possible options could be different (depending on their relative probabilities given by p
), but each time we choose an outcome (one of the n
selections we make), this does depend on which options we've previously chosen (if we're imagining choosing our n
selections sequentially).
This doesn't mean that we couldn't sequentially change our value of p
in some sort of sequentially dynamic process that uses different multinomial distributions over time. But, it does mean that for n
selections from k
options drawn from a multinomial distributions with a fixed unchanging p
, the n
selections are independent and do not change in response to each other or affect each other in any way. And it's actually also interesting to consider again here the stats.multinomial(n,p).rvs(size)
specification. The independence of the multinomial distribution means that the n
choices for the k
options related to stats.multinomial(n,p)
do not depend on each other. But, owing to the definition of "random variable sample", the .rvs(size)
notion of repeating a the "n
choices for the k
options" game size
times is itself also based on independence. This means that the outcomes of different repetitions of the "n
choices for the k
options" games also do not affect each other.
But there might be other examples where this is not true? Can you think of any? How about an example of drawing cards from a deck? Does the probability of drawing an Ace change if you've previously drawn and removed cards from the deck?