Discussion of procedures with colleagues reveals much about assumptions and leads to
better understood methodology, and to a clearer experimental design. The following
material is drawn from such discussions in private or public forums. Some or all of the
proposals for improvements will be integrated into the project as time permits.
Date: Tue, 10 Nov 1998 18:36:16 -0500
From: Richard Broughton
It indeed will be a challenge to deal with the [Global Event (GE)] situation, but there is
an approach from a different angle that might be considered. Back in the 1970s, Brian
Millar and I were urging the use of data-splitting techniques in experimental situations
where one really didn't know what effects might be found. It was not a new technique, but
the increasing availability of lab computers to record data blind made it much easier to
employ. We were touting this as a way of combining pilot and confirmatory experiments in a
single study. One would collect the data automatically and blindly, and then use some
method to split part of it into a nominally "pilot" part and then a
"confirmatory" part. One could scrounge through the pilot data as much as one
wanted, but any findings could then be predicted to be found in the confirmatory part. If
they didn't confirm, then one couldn't blame changed experimental conditions.
With [Global Events] it would seem that you are in a similar situation. You don't know
what might result in an effect, what kind of an effect, and over what time period. Well,
setting aside experimenter effect issues, if there is a "Gaia effect" on an EGG
here or there that looks promising one would expect it to be distributed over the entire
data sample. Suppose you simply doubled your data collection, but then only looked at the
odd bits. Presumably, if Gaia affected the odd ones, she affected the even ones as well.
So if you found a presumably [GE]-linked affect in the odd bits, you should be able to
confirm that in the even bits as well. No?
That is just a crude example. Real data splitting can get pretty complicated in terms of
the best ratio of pilot to confirmation data, and there are lots of ways one could decide
to divide up the REG data. It is runs a high risk of never finding anything that you can
really talk about. In the few times I used it I never got anything to confirm, but that
was before the greater awareness of effect sizes and power. On the other hand, if you did
find a [GE] effects in part of the data and confirmations in the other part, then folks
would have to take notice.
--- Response from Roger Nelson:
Yes. We can do this as an adjunct to the simple predictions based on intuition and media
intensity.
--- Date: Wed, 25 Nov 1998 11:28:51 -0500
From: Richard Broughton
An article on data splitting by Richard R. Picard and Kenneth N. Berk appeared in _The
American Statistician_, May 1990, Vol 44, No 2, pp 140-47. The abstract reads:
"Data splitting is the act of partitioning available data into two portions, usually
for cross-validatory purposes. One portion of the data is used to develop a predictive
model and the other to evaluate the model's performance. This article reviews data
splitting in the context of regression. Guidelines for slitting are described, and the
merits of predictive assessments derived from data splitting relative to those derived
from alternative approaches are discussed.
------- On Tue, 24 Nov 1998, James Spottiswoode wrote:
The Alternative
---------------
A way out of all these problems is as follows. Given the following hypothesis, which is I
think a corrolary of the GCP hypothesis GCP - H2
--------
When an effective (in the sense of H1) event occurs, all the RNG's in the GCP will have
their means and/or variances modified from MCE. Given H2, we can test the N*(N - 1)/2
independent correlations between the N RNG's, where we take as the summary variable from
the RNG's deviation from mean, perhaps Z^2, or variance, which to be determined by trial.
If we found, say, that there were time periods during which the intercorrelations were
improbably large, we could look to see what events were occurring at those times which
might, by H1 and H2, be responsible. Having then run the experiment for a pilot phase
during which examples of "effective events" were gathered, one could then arrive
at a set of definitions which covered them. That set of definitions could then be used in
a prospective study to see if the associated deviations, and hence correlations, were
replicated with the same event definitions being used to select a new set of
"effective events."
The advantage of this method of working is that it sidesteps all the problems of whether a
flood in Bangladesh is more effective at altering RNG's behavior than a football game
watched by millions on TV. You get the data to tell you.
This is certainly data snooping in the pilot phase. But all is fine if the snoop leads to
crisp definitions which are faithfully applied to filter events in the prospective part of
the experiment.
I hope this is at least clear, even if wrong or silly.
Thanks. Much more clear, and it is useful, although I think the problems of identifying
the qualities of effective events will remain. None of the methods 1, 2, or this
alternate, 3, is likely to work out of the box, and all of them should improve with
repetition. Richard Broughton has suggested a closely related procedure, modeled on the
split-half reliability computations often used in psychology. I want to implement a
version of that, as a complement to the simple analysis.
With some serious thought and planning, your plan might be practical, but I have 'til now
rejected the general approach because I think it will be too easy to find corresponding
events that are in fact irrelevant chance matches if one starts with a search in the data
for big inter-correlations. Our pattern matching and wishful thinking propensities are
deadly liabilities in this situation. But let's try it -- should be a good learning tool,
one way or another.
Roger
On Wed, 25 Nov 1998, James Spottiswoode wrote:
[James responding to Ephraim Schechter] I'm not sure James' approach really is more
objective than Roger's. As Roger notes, it may require just as much subjective decision
making to identify events that seem associated with the RNG deviations. AND it requires a
huge database of RNG outputs to identify the key time periods in the first place. Yes, RNG
outputs may be less subject to "pattern matching and wishful thinking
propensities" than world events are, but I'm not sure whether I'm more comfortable
with using them as the dependent variable (Roger's approach) or as the starting place
(James' approach). I don't see either as clearly the right way to go. Relative risks:
Roger's approach risks picking wrong hypotheses about events in the first place, and
therefore getting demonstrations but not the kind of results that would really test those
hypotheses. James' approach might stand a better chance of zeroing in on the correct
hypothesis in the long run, but only after immense time/effort to gather that huge
database of RNG behaviors and world and experimenter events. If it's done with less data,
it may be no better than Roger's approach at identifying hypotheses worth following up.
I agree with you about the shportcomings of my approach. However it has one distinct
advantage which you haven't mentioned which is why it was brought up in the first place.
With NO assumptions about the nature of "Effective Events (EE)" it tests the GCP
hypothesis that (all) RNG's get zapped by EE's. Simply testing as to whether there are
more extra-chance inter-correlations than there should be will tell you whether that
hypothesis is false. That is its real power. Finding the nature of EE's, if they are found
to exist, seems to be a secondary task for which the method would be rather inefficient.
James
Date: Sun, 29 Nov 1998 07:48:47 -0800
From: James Spottiswoode
Subject: Noosphere project - back to beginning
Folks, I see that my original posting on this topic contained the following question:
I am afraid that I take a very restricted, Popperian, view of science. There are many
reasons for so doing: it has been extraordinarily successful at understanding the world,
it is embodies realism (suitably modified for QM) and it provides a clear, and practically
applicable, principle of demarcation between scientific theories and pseudoscience. A
corrolary of this demarcation principle is that "the more a theory forbids the more
it tells us.(1)" Therefore, my next question is this. While I cannot understand the
theory being tested by the GCP project, perhaps I can have explained to me, simply, _what
does this theory forbid_? What should NOT be observed in the RNG data outputs?
I think I have now provided the answer to this myself:
H1: Certain kinds of events involving human mental states changing (currently their exact
definition is unknown) cause all RNG's to deviate from MCE.
(Comment: H1 seems to be the most general version of the GCP hypotheses).
Test for H1: Select a summary statistic for the RNG outputs, e.g. the variance over a
chosen period. Compute this variance as a function of time for each of the N RNG's as
Var1(t), var2(t),... varN(t) for the recording interval. Then by H1 the N(N - 1)/2 Pearson
correlations between all pairings of RNG's are: rIJ = r(varI(t), varJ(t)) (i = 1,...,N, j
= 1,...,i-1) where r(X,Y) is the Pearson correlation function. Then under the null (~H1)
hypothesis the rIJ's should have chance expectation statistics. (1) Conversely, under H1,
the rIJ's should have excessive positive excursions and these excursions should be seen
simultaneously in all the rIJ's. (2)
Note 1: Some think that H1 should be restricted to RNG's in a certain locale, say the U.S.
or a city. With this restriction the same test can be applied.
Note 2: There are two unspecified "parameters" in H1, the summary statistic to
be used (variance in the above example) and the time interval over which to compute that
statistic. Each choice of these gives rise to an individually testable version of H1.
(1) is what the GCP hypothesis forbids.
(2) is what it predicts.
Does everyone agree with this formulation? If not, why not?
James
------
Roger replied yes, and repeated his commitment to implement this correlation test as a
complementary analysis model, with James' help in its development. Subsequent commentary
by York Dobyns makes clear that this complementary model is not yet correctly formulated.
------
Date: Mon, 30 Nov 1998 20:03:35 -0500 (EST)
From: York H. Dobyns
I have followed the subsequent discussion on this thread, but have returned to James'
initial proposal for clarity.
On Sun, 29 Nov 1998, James Spottiswoode wrote:
[...]
H1: Certain kinds of events involving human mental states changing (currently their exact
definition is unknown) cause all RNG's to deviate from MCE. (Comment: H1 seems to be the
most general version of the GCP hypotheses).
So far, so good. One caveat required is that it is the (unknown) underlying distribution
of the RNG outputs that is presumed to have changed; there is no guarantee that any
specific instance (sample) from that distribution will be detectably different from null
conditions.
Test for H1: Select a summary statistic for the RNG outputs, e.g. the variance over a
chosen period. Compute this variance as a function of time for each of the N RNG's as
Var1(t), var2(t),... varN(t) for the recording interval. Then by H1 the N(N - 1)/2 Pearson
correlations between all pairings of RNG's are:
rIJ = r(varI(t), varJ(t)) (i = 1,...,N, j = 1,...,i-1)
where r(X,Y) is the Pearson correlation function.
Then under the null (~H1) hypothesis the rIJ's should have chance expectation statistics.
(1)
Conversely, under H1, the rIJ's should have excessive positive excursions and these
excursions should be seen simultaneously in all the rIJ's. (2)
This is where I have problems. Why on earth is the extra step of the Pearson correlation
useful or desirable? The hypothesis as formulated for fieldREG hitherto has been simply to
compute a test statistic (not the variance, incidentally), and look for changes in its
value. An effect that changes the average value of the test statistic in active segments
need not have any effect on the correlation function; indeed, it will not, unless some
further structure is presumed.
As far as I can tell the Pearson correlation function is simply the quantity I learned to
call "the" correlation coefficient,
r_{xy} = ( - )/ s_x s_y,
where I am using <> to denote expectation values and s to denote the standard deviation. If
I am calculating rIJ between RNG I and RNG J, and the effect has manifested as a simple
increase in the mean value of the test statistic I am using, for either or both RNGs,
there is no change whatever in the value of r.
Even though there have been some disagreements about the appropriate test statistic to use
in fieldREG applications, the above problem applies to *any* test statistic. Suppose that
we have datasets x and y, and calculate their correlation coefficient as above. Suppose we
now look at x' = x+a and y'=y+b, a model for modified data where the values have been
changed by some uniform increment. We all know this has no effect on the standard
deviations. The numerator of r_{x'y'} becomes
<(x+a)(y+b)> - = - (+a)(+b)
= + a + b + ab - - a - b -ab
= - ,
exactly as before, so the whole formula ends up unchanged. If the active condition in
fieldREG adds an increment to the test statistic -- whatever one uses for a test
statistic, whatever the value of the increment -- the correlation test will not show a
thing.
[...] Does everyone agree with this formulation? If not, why not?
I do not agree with this formulation, and have outlined my reasons above.
In a subsequent posting, James wrote:
But the _correlation_ between the summary stats of each RNG will be positive - that is the
point and why this is such a general test for _anything_ being modified in the RNG's
behaviour. All that is required is for the modification to be consistent across the RNG's
and simultaneous - which seems to be the essance of the GCP hypothesis.
Since there is a clear class of models in which this test *fails* to detect the presence
of a real effect, I do not see why James calls it "general". Since that class of
models includes those tests currently and successfully in use in single-source and
impromptu global field REG experiments, I am at a loss to understand the desirability of
this non-test as the proper approach for GCP.
Date: Tue, 1 Dec 1998 12:12:58 -0500 (EST)
From: York H. Dobyns
On Mon, 30 Nov 1998, James Spottiswoode wrote: York - perhaps we have misunderstood each
other here. Much of the early part of this thread revolved around the problems associated
with defining the events deemed by the FieldREG-ers to be effective. So I wrote H1 with
deliberate vagueness as to what these events might be. The point of my test is that it
should work _with no specification of the nature or timings of "effective
events"_ at all. Why? If, ex hypothese, all the RNG's deviate in sync when an
"effective event" occurs, then the correlations between suitable summary
statistics of their outputs will demonstrate this.
I was about to object vehemently to this form of the hypothesis, because it does not
correspond at all to the fieldREG protocol as exercised; but I then remembered that we are
talking in terms of a generalized and as-yet unspecified test statistic. Unfortunately
there remains a strong objection to this approach, but it's a different objection.
To deal with the straw-man first, the extant fieldREG protocol requires the identification
of "active" data segments by the experimenter, with all the subjectivity that
implies. The test statistic that PEAR has used is simply the Z score of the active
segments, and we test for the presence of (undirected) mean shifts by summing all the Z^2
figures to arrive at a chi-squared for the overall database. So our "hypothesis"
is *not*, as implied above, that all the sources deviate "in sync" during an
active segment, but rather that their output distribution develops a larger variance
(where "output" is taken as the mean of, typically, 10-40 minutes of data).
The reason that this is a straw-man is that, even though your use of the phrase
"deviate" above sounds like you're talking mean shifts, the phrase could be
applied to any test statistic. The squared Z of, say, 10-minute blocks should
"deviate in sync" more or less as you describe, even though it is less
efficient. (The test we're using for fieldREG is provably optimal *if* one assumes that
the experimenters can reliably identify active segments.)
But the real objection is that we *cannot* do this test in a meaningful fashion without
having some identification of "special" time intervals in which an effect is
expected. Lacking some external condition to predict that deviations (on whatever test
statistic you care to use) are expected at some times and not at others, you can collect
and correlate data for as long as you please, and all you are proving is that your RNGs
are bogus.
This was the concern that (originally) caused me to be a bitter opponent of the GCP,
especially after Dick's talk in Schliersee. Unless the experimental design includes a
clear criterion for identifying active data periods that are distinct from a control
condition, there is no way of distinguishing the GCP or EGG protocol from a gigantic
multi-source calibration. There is no way whatever to distinguish the effect one is
looking for from a design flaw in the RNG sources, no matter how cleverly one designs a
test statistic. The inter-source correlation test merely defers the problem by forcing one
to consider the ensemble of all the RNGs as the "source" under consideration;
the answer comes out the same.
It was only Roger's assurance that active experimental periods would be identified and
compared with "control" sequences that finally persuaded me that EGG was worth
doing. Any effort to get away from that identification process in the name of
"objectivity" or "precision" simply means you're doing a
non-experiment. We are stuck with that identification process, subjective and flawed
though it may be, because without we can never demonstrate an effect, we can only
demonstrate that our RNGs are broken.
Date: Fri, 4 Dec 1998 15:04:51 -0500 (EST)
From: York H. Dobyns
On Wed, 2 Dec 1998, James Spottiswoode wrote:
[...I wrote:]
To deal with the straw-man first, the extant fieldREG protocol requires the identification
of "active" data segments by the experimenter, with all the subjectivity that
implies. The test statistic that PEAR has used is simply the Z score of the active
segments, and we test for the presence of (undirected) mean shifts by summing all the Z^2
figures to arrive at a chi-squared for the overall database. So our "hypothesis"
is *not*, as implied above, that all the sources deviate "in sync" during an
active segment, but rather that their output distribution develops a larger variance
(where "output" is taken as the mean of, typically, 10-40 minutes of data).
[James responds]
I only used vague terms like "deviate" because I had understood from reading
Roger's posting that various stats, mean shift, variance and others, were under
consideration to use. Your portrayal of the project is quite different: your statistic has
been fixed a priori. Jolly good, one less degree of freedom to fudge.
Let us be careful here. My description was of the *fieldREG* protocol, of which GCP is a
generalization. While I believe Roger would be content to proceed with a field-REG type
analysis on GCP (he will have to speak for himself on this one), other GCP participants
have their own ideas about the optimal test statistic. I don't see this as any big deal;
at worst it means a Bonferroni correction for the number of analysts. It is my
understanding, however, that all of the proposed analyses share in common the concept of
distinguishing between active and non-active segments.
But there is something mysterious, to me, about the above para. You say that your
"hypothesis" is not that the RNG's vary "in sync" but that there Z^2,
summed over a time window that you also roughly specify, should increase. You then proceed
to argue that a correlation test is useless and that unless "effective events"
can be pre-specified the GCP could not distinguish between a real effect and faulty RNG's.
This argument I do not follow. What is wrong with the following:
H1b: Certain kinds of events involving human mental states changing (currently their exact
definition is unknown) cause the mean Z^2 (averaged across a 10 minute block for instance)
of all RNG's to increase.
With the caveat that what I was talking about was a Z^2 *for the entire block*, not some
kind of average within blocks, this is an accurate statement of the concept. The problem
is with the subsequent reasoning, see below.
You seem to agree, York, that:
The squared Z of, say, 10-minute blocks should "deviate in sync" more or less as
you describe, even though it is less efficient.
Yes.
Therefore, we presumably agree that the correlation between two RNG's Z^2's has an MCE of
zero and an ex hypothese value 0 during an "effective event."
*Here* is the problem. The hypothesis says that the Z^2 test statistic (which has an
expectation of 1 under the null hypothesis) should increase during "effective
events" and remain at its theoretical value at other times. This does *not* produce a
positive correlation during "effective events"; the expected correlation
coefficient within effective periods remains zero (a change in the expectation value does
not change the correlation). A positive correlation between sources is expected *only* if
one looks at intervals that include both "effective" and "inactive"
periods: then, and only then, are the correlated changes in the test statistic visible.
Equally,a set of N RNG's will give rise to N*(N-1)/2 inter-correlations between the Z^2
values which should all show a positive "blip" during an event.
Again, you *can't* get a positive blip on the correlation test: the positive blip is in
the raw value of the test statistic, and gets you only one data point in each source's
output (or N data points where N is (duration of event)/(duration of averaging block)).
*If* all the devices respond, and you run over both active and inactive periods, you see
that the correlation matrix for all sources is in fact net positive.
And even that is a big if. The multiple-source fieldREG, even when a highly significant
result is attained, usually is driving the significance from large deviations in a small
minority of the sources. For a chi-squared distribution there is nothing wrong with this;
it is in fact the expected behavior given that a large chi-squared is attained. But this
means that any given session is unlikely to increase the test statistic in *all* of the
sources.
[...]
You seem to have considred the above argument and declared, if I understood right, that
one could not distinguish between effective events and bust RNG's by the above method. But
why should a set of bust RNG's show a simulataneous increase in their variance (assuming
they're not all tied to the same dicky power supply, or similar SNAFU)?
Maybe they are sensitive to the same external parameter, e.g. geomagnetic activity. In
fact, the whole *premise* of GCP is that they are sensitive to a common external
parameter; what makes it a psi experiment is the assumption that said parameter is related
to human consciousness. I think it important to have an identification criterion that
justifies that conclusion rather than the presumption of some as-yet-unknown physical
confound.