Here's a probability question I've been wondering. Suppose there's a company that has a million customers. It is known that 55% of these customers are male and 45% of customers are female. Task is to guess the sex of the next 100 (of the existing) customers who are going to visit the company. For every right guess point is awarded. What's the best strategy to get most correct answers? If we consider the customers one by one, it is good plan to always guess the most probable answer and therefore guess that all 100 of the customers are male. However if we take the hundred people as a group, isn't this task analoguous to situation where one litre of seawater in a container has same salinity as seawater in general? Therefore we could guess that there are 55 males and 45 females among the group of 100 customers. Certainly, if instead of 100 people we would take the whole million customers as a group then 55%/45% split would be the true and correct answer. My question is this: what changes the way of thinking between individuals and groups? Which is the correct way to think about this problem?

What you say about the individual problems is right: if I get a point for each right answer, then each time someone comes to the site, the best strategy is to guess that it's a man. (At least this is right if knowing the sex of an individual customer doesn't help predict whether s/he will visit the site or not.) This is the best strategy because if each individual visit is like a random selection of a customer from the population, the chance is greater that the selected customer will be a man.

The analogy with seawater is problematic. After all, if I pick one customer, that customer won't be 55% male and 45% female. The salinity of small samples of seawater closely approximates the salinity of the sea (unless we get down to really small samples of a few molecules, and then your principle breaks down.) The make-up of a small sample from a population may depart markedly from the make-up of the populations.

What's interesting is that once our samples get to be of even a moderate size, things become more like your seawater analogy. If we take a random sample of 100 from the population of customers, then—putting it roughly—there's a 95% chance that we'll find between 45 and 65 men. Our sample will have a margin of error of about 10%, in other words. If we take a sample of 1000, then the margin of error shrinks: it's more like 3%. As our sample gets even larger, the margin of error gets smaller and smaller (although it get smaller not in proportion to the sample size, but in proportion to its square root.)

The sea water case looks different superficially, but if we think of salinity as a matter of the fraction of molecules in the sample that are salt molecules, then we have to remember that even a drop of sea water contains a fantastically larger number of molecules. What we have with the seawater case is a sample so large (in terms of numbers of molecules) that the margin of error isn't worth mentioning.

And so on individuals and groups: in the kind of case you describe, the "group fact"—the 55/45 ratio—is just the summation of the individual facts. There are roughly 550,000 men and 450,000 women in the customer base. In the problem you describe, we are assuming that noting the sex of a customer who comes to the site is analogous to randomly sampling from a list of the customers and noting the sex of the person picked. From there, it's just a matter of using probability and arithmetic (although once the sample gets big enough and limits start taking hold, calculus-based math makes life a lot simpler!)

Read another response by Allen Stairs
Read another response about Probability