#!/usr/bin/perl -w $VERSION= 0.31; use strict; # a perl directive to complain about undefined vars use optimize; # provides the simple goldensearch optimizer ################################################################ =pod =head1 NAME F --- 10 June 2001 =head1 SYNOPSIS F shows how to code the herding (imitation) estimation described in =over 4 =item Welch, Ivo. "Herding Among Security Analysts," I 58-3 (December 2000), p.369-396. =back Its most important econometric contribution was the development of an estimation method that is applicable to I data. Note that discrete choice is a necessary requirement for I. =head1 FILES The distribution archive contains five files (all in subdirectory F): =over 4 =item welch2000jfe.pl [Necessary] F is the main program and the reason why you should be interested in this archive. It implements the statistical techniques in the paper. You are now looking either at the program code itself (F), or the documentation (F) generated from it. =item optimize.pm [Necessary] F relies on an optimizer, such as the one provided by I. =item MSFT.dat [Useful] is a sample data set to demonstrate usage. =item docs.html [Icing] The documentation, which is generated from F via pod2html -noindex -title "welch2000jfe documentation" welch2000jfe.pl > docs.html You are now looking either at the program code itself (F), or the documentation (F) generated from it. =item webfrontend.html [Icing] The archive also contains F, which prints the form for passing information to the welch2000jfe.pl program. =back =head1 WEB INTERFACE There is now a web interface to this program at http://welch.econ.brown.edu/academics/welch2000jfe/webfrontend.html . This is basically a form, which simply calls F. Unfortunately, this also means that the program F must behave slightly different when it is run by a web server. It permits running without an input file name (it reads from STDIN), it produces less instructive debugging output (just the processed output), and it exits with code 0 if there is an error (so that the server does not announce a server error). =head1 DOWNLOADING The distribution can be downloaded from http://welch.econ.brown.edu/academics/welch2000jfe.tgz The individuals files are at http://welch.econ.brown.edu/academics/ It contains the code, an optimizer, some html documentation, and a sample input file. Then use an unpacking utility, such as pkzip (from http://www.pkware.com) or winzip (from http://www.winzip.com) Many browsers/computers already have this installed; if not, please download one using the above links. To just see some sample output, you do not need to install perl, because the F program's output is reproduced in this documentation. For the most part, the F code should be easy to read, even by someone who does not know perl. Still, the F program can immediately run any arbitrary data set (including your's) and tell you your herding parameter theta. (For small data sets, just use the provided web interface.) So, rather than reinventing the wheel, you may find that downloading the archive and installing perl may well be worth it. =head1 USAGE Invoking this program with argument B produces the output for the graph of figure1 in the paper. Invoking this program with argument B< "MSFT.dat" > uses the enclosed Microsoft dataset. Any such data set has to be in a particular format (fields are space, tab or comma separated): from to (any other information, such as yyyymmdd, identity, etc.) from to (any other information, such as yyyymmdd, identity, etc.) ... from to (any other information, such as yyyymmdd, identity, etc.) (For a format example, please see the microsoft dataset in the program itself). The program determines the range (domain) of feasible values by itself. However, it is about discrete herding, so it truncates each 'from' and/or 'to' into an integer. It also only makes sense if a consensus can be computed as the simple average of past "to's." Necessarily, this means that the domain must be ordered, and a difference between a '1' and a '2' should be similar to the difference between a '4' and '5', etc. The target in this program is assumed to be the prevailing average consensus (the average of all prior "to's"). Thus, to be able to compute a prevailing consensus, the estimation necessarily begins with a later observation number (usually # 11; this is a global parameter that can be changed at the top of the perl code). =head2 Sensible Use It only makes sense to run this if: =over 8 =item * the choices are a small number of discrete values (e.g., '1', '2', and '3'). =item * the choices are ordered, and one can compute a meaningful numerical consensus. =item * there are at least a couple of dozen observations. =item * there is a reason to believe that the prevailing consensus of earlier choices is hypothesized to be relevant. =item * the previous line (agent) is not primarily the same agent. otherwise, the program answers the question: how likely is an agent to imitate himself? =back Again, it may not make sense to ask if an agent is herding on himself. So, if you plan to use this program for an optimization for a paper that empirically tests for imitation, please do not forget that you should filter out multiple consecutive choices by the same agent F you run this program. In contrast to the JFE paper, the sample MSFT data ignores the fact that there may be consecutive recommendations by the same agent and even on the same day. The program simply assumes that each agent is different [even though the file tells us the identity and date]. This makes the code a lot easier here. The user is advised to filter undesirable observations before running the program! =head1 SAMPLE OUTPUT For the enclosed Microsoft MSFT.dat, the output is as follows: The NULL probability transition matrix is: to1 to2 to3 to4 to5 from1 0.4519 0.2212 0.3173 0.0000 0.0096 from2 0.2717 0.3913 0.3370 0.0000 0.0000 from3 0.3301 0.2913 0.3107 0.0485 0.0194 from4 0.1667 0.0000 0.6667 0.1667 0.0000 from5 0.0000 0.0000 0.6667 0.0000 0.3333 (This transition matrix is computed on the fly from the enclosed data and displayed to screen.) The program then shows The total log-probability for theta=0.5 given this data set is -335.129769269242 First Run: The optimal log-likelihood function is LL(theta=0.0278)= - 330.140 Second Run: The optimal log-likelihood function is LL(theta=0.0278)= - 330.140 =head1 Why PERL? Why This Code? PERL was chosen as a language because it is free, universally available for all common computers and operating systems (and these days comes pre-installed on most Unix distributions), concise, and still permits for readily understandable code. In fact, even this documentation is directly embedded in the perl code itself. =for html For more information, click on http://www.perl.org/; recommended books: Learning Perl by Schwartz and Christiansen and Perl Cookbook by Christiansen and Torkington.) =cut ################################################################ # Global Choice Variables ################################################################ our $FirstEstObs= 10; # we need to have a prevailing consensus for estimation our $MovingAverage= 0; # use past number of observations our $DecayParam= 1.0; # 1 means no decay. 0.90 means second-to-previous is decayed by 0.81. ################################################################ =head1 MAIN PROGRAM The main program begins here. It reads data of the form FROM TO (any other ignored information). FROM TO (any other ignored information). ... FROM TO (any other ignored information). then calls subroutines to do all the work. =cut #### some global variables used later. easier than passing them around all the time. our $MinChoice= 9999; # the lowest possible choice, a global variable, determined by the file our $MaxChoice= -9999; # the highest possible value, a global variable, determined by the file our $NoChoice= "NaN"; # the missing value our $DEBUGVERBOSE = 0; # used for debugging only!! ################ process the command line arguments, or see if this is a web run our $httpbackend= ((defined(%ENV)) && (defined($ENV{REQUEST_METHOD}))); my @INPUTLINES = readalldatainput(); ################ create the 'to' and 'from' vectors; my (@from, @to); # this is what we will be reading from the input file. for (my $lc=0; defined($INPUTLINES[$lc]); ++$lc) { # read every line from stdin or the file on the command line; if ((substr($INPUTLINES[$lc],0,1) eq "#")|| ($INPUTLINES[$lc] !~ /[0-9]/)) { next; } # ignore comment lines and empty lines my @fields= split(/[\,\s]/, $INPUTLINES[$lc] ); # a ',', space, or tab will do for field separation. my ($from, $to); # temporary storage $from= int($fields[0]+0.5); # make sure this is an integer; $to= int($fields[1]+0.5); # make sure this is an integer; if ($from =~ /[^0-9 ]/) { trapdie(" The input value '$from' on line $lc is not numeric. "); } if ($to =~ /[^0-9 ]/) { trapdie(" The input value '$to' on line $lc is not numeric. "); } if ($from > $MaxChoice) { $MaxChoice= $from; } if ($to > $MaxChoice) { $MaxChoice= $to; } if ($from < $MinChoice) { $MinChoice= $from; } if ($to < $MinChoice) { $MinChoice= $to; } push( @from, $from ); # and store it push( @to, $to ); } ($#from>=0) or trapdie("The input file could not be opened."); ($#from>=20) or trapdie("Sorry, we do not allow you to run this program with fewer than 20 observations."); ($#from >= $MaxChoice-$MinChoice) or trapdie("It makes no sense to compute choices from $MinChoice to $MaxChoice with only $#from observations"); print "[Working with $#from valid observations with range from '$MinChoice' to '$MaxChoice']\n"; ################ now compute the preliminaries: the prevailing consensi and transition matrices my @targets= computeprevailingconsensus( @to ); my @transitionmatrix = computetransitionmatrix ( [@from], [@to] ); for (my $c=0; $c<=$FirstEstObs; ++$c) { shift(@from); shift(@to); shift(@targets); } # delete the first couple of obs, for which no prevailing consensus exists printmatrix( [@transitionmatrix] ); # the unconditional probability transition matrix my @dataparameters= ( [@transitionmatrix], [@from], [@to], [@targets] ); # summarize all the stuff we shall need ################################################################ # the preliminaries are over: we have read the data and computed # the inputs. now let's produce some real output. ################################################################ # compute a single value of the objective function if (!$httpbackend) { { my $theta=0.5; print "\nThe total log-probability for theta=$theta given this data set is " . (-1*functionvalue($theta, [@dataparameters])). "\n"; } # find the maximum-likelihood estimate for theta, given the data { my ($f,$x)= optimize::goldensearch(-5.0, 0.5, +5.0, 1e-6, \&functionvalue, [@dataparameters] ); printf("First Run: The optimal log-likelihood function is LL(theta=%.4f)= - %.3f\n", $x, $f); ($f,$x)= optimize::goldensearch($x-0.2, $x, $x+0.2, 1e-8, \&functionvalue, [@dataparameters] ); printf("Second Run: The optimal log-likelihood function is LL(theta=%.4f)= - %.3f\n", $x, $f); } } else { my ($f,$x)= optimize::goldensearch(-5.0, 0.5, +5.0, 1e-6, \&functionvalue, [@dataparameters] ); printf("\nThe optimal log-likelihood function is LL(theta=%.4f)= - %.3f\n", $x, $f); } if ($httpbackend) { print "\nFor more information, refer to docs.html". "
\n© Ivo Welch, June 2001.
\n\n"; } exit 0; ################################################################ =pod =head1 SUBROUTINES =cut ################################################################ ################################################################ =pod =head2 figure1 I shows how we computed the figure on page 378 in the JFE. =cut sub figure1 { my @nullvector= ( "nan", 0.05, 0.20, 0.30, 0.05, 0.40 ); # as used in the paper my $target= 4.0; # as used in the paper print "\n****************************************************************\n". "\t Computation of Figure 1 in Welch(2000):\n". "****************************************************************\n\n"; # we are showing theta values of 0.5, 1.0, ..., 4.0 for (my $theta=0.0; $theta<=4.0; $theta+= 0.5) { my @altvector= &computealtvector(\@nullvector, $theta, $target); # see below print "The destination probability vector given Target=$target *if* Theta=$theta:\n\t"; for (my $i=$MinChoice; $i<=$MaxChoice; ++$i) { printf("%7.4f\t", $altvector[$i]); } print "\n"; } print "\n"; } ################################################################ =pod =head2 functionvalue I wraps the actual invokation of the likelihood function. The first argument is theta (possibly to be optimized), the second argument is a perl reference to the (immutable) data information provided by readdata. =cut sub functionvalue { my $theta= shift(@_); # all other arguments were passed through by the optimizer. # here we are only using the very next argument, which is itself # a reference to multiple data structures. my @probtransitionmatrix= @{$_[0]->[0]}; my @from= @{$_[0]->[1]}; my @to= @{$_[0]->[2]}; my @target= @{$_[0]->[3]}; my $totalprob=0.0; for (my $obs=$FirstEstObs; $obs<=$#to; ++$obs) { # skip the first $FirstEstObs observations: we need a reasonable prevailing consensus my @probvec= &computealtvector( $probtransitionmatrix[ $from[$obs] ], $theta, $target[$obs] ); my $probrealization= $probvec[ $to[$obs] ]; # we know where the analyst went, and what prob we had for this. $totalprob+=log($probrealization); } return -$totalprob; } ################################################################ =pod =head2 computealtvector I computes the probability under the alternative hypothesis, given a null probability vector, a theta, and a target. =cut sub computealtvector { my $nullvector= shift(); # first argument: the null prob vector my $theta= shift(); # second argument: a theta my $target= shift(); # third argument: a target # first compute D_i from eq 3 my $Di=0.0; for (my $j=$MinChoice; $j<=$MaxChoice; ++$j) { $Di= $Di + $nullvector->[$j]* ( (1.0+($j-$target)**2) ** (-$theta)) ; } # now compute p_i from eq 2 my @altvector= ( "nan" ); # we are counting from 1, while perl counts from 0; for (my $j=$MinChoice; $j<=$MaxChoice; ++$j) { $altvector[$j]= $nullvector->[$j]* ( (1.0+($j-$target)**2) **(-$theta) )/$Di; } return @altvector; } ################################################################ =pod =head2 computeprevailingconsensus accepts a vector of data, and computes prevailing consensi from past values up until the current observation. It returns a vector of equal dimensions to the input vector (N*1, where N is the number of observations). It would be very easy to modify this subroutine to use a decaying consensus or to use a moving average instead of a consensus computed from I past values. Note that this routine can be slow. It is only run once! =cut sub computeprevailingconsensus { my @consensus; sub computeavg { my $sum=0; my $sumn=0; my $decayed=1.0; my @list= reverse(@_); foreach my $v (@list) { if ($v ne $NoChoice) { $sumn+= $decayed; $sum+=$v * $decayed; $decayed*= $DecayParam; } } return ($sumn>=1) ? ($sum/$sumn) : $NoChoice; } if ($MovingAverage == 0) { $MovingAverage= 9999999999; } # all observations for (my $i=0; $i<=$#_; ++$i) { $consensus[$i]= ($i>=$FirstEstObs) ? computeavg( @_[$i-$MovingAverage..$i-1] ) : $NoChoice ; } return @consensus; } ################################################################ =pod =head2 computetransitionmatrix accepts a reference to the "from" vector and a reference to the "to" vector. Outputs a twodimensional array. Please note: $MinChoice, $MaxChoice, and $NoChoice are global variables that must have been set before! =cut sub computetransitionmatrix { my $from= shift(@_); # the first function parameter my $to = shift(@_); # the second function parameter my @probtransitionmatrix; # first clear the transition matrix for (my $row=$MinChoice; $row<= $MaxChoice; ++$row) { for (my $col=$MinChoice; $col<=$MaxChoice; ++$col) { $probtransitionmatrix[$row][$col]=0; } } # read in a sample data file and compute the unconditional # probability transition matrix and the prevailing consensus for (my $c=0; $c<@$from; ++$c) { ++ $probtransitionmatrix[ $from->[$c] ][ $to->[$c] ]; } # for each previously standing source recommendation, we need to normalize # the destination vector to be a probability vector (summing to 1.0); for (my $fromrow=$MinChoice; $fromrow<=$MaxChoice; ++$fromrow) { my $sum=0.0; for (my $col=$MinChoice; $col<=$MaxChoice; ++$col) { $sum+= $probtransitionmatrix[$fromrow][$col]; } for (my $col=$MinChoice; $col<=$MaxChoice; ++$col) { $probtransitionmatrix[$fromrow][$col]/= $sum; } } return @probtransitionmatrix; } ################################################################ =pod =head2 printmatrix I just prints a transition matrix from a reference. =cut sub printmatrix { # print a probability transition matrix print "\nThe NULL probability transition matrix is:\n\n "; for (my $col=$MinChoice; $col<=$MaxChoice; ++$col) { print " to$col "; } print "\n "; for (my $fromrow=$MinChoice; $fromrow<=$MaxChoice;++$fromrow) { print "from$fromrow "; for (my $col=$MinChoice; $col<=$MaxChoice; ++$col) { printf("%7.4f ", $_[0]->[$fromrow][$col]); } print "\n "; } print "\n"; if ($DEBUGVERBOSE) { print "VERBOSE DEBUGGING OUTPUT:\n"; for (my $c=0; $c<=$#from; ++$c) { print "$c\t$from[$c]\t->\t$to[$c]\tC=$targets[$c]\n"; } } } ################################################################ =pod =head2 readalldatainput is easier than it looks. It just needs to check that the command line arguments are appropriate, whether it is run in a web browser, and then returns all input files as a simple vector of strings (lines). If it is in web server mode, it also prints the HTML header. =cut sub readalldatainput { if (! $httpbackend) { if (!scalar(@ARGV)) { # there are no command line arguments trapdie("COMMAND LINE Usage:\tperl welch2000jfe.pl [DATAFILENAME | FIGURE1]\n". "\tFor documentation, please run \"pod2text welch2000jfe.pl\" or look at the program code documentation\n". "\t(Special case: To run with the sample Microsoft data, run \"perl welch2000jfe.pl MSFT\".)\n". "\t(Special case: To produce figure 1 from the paper, run \"perl welch2000jfe.pl FIGURE1\".)\n". "\t(The web interface requires no argument.)\n"); } if ($ARGV[0] eq "FIGURE1") { figure1(); exit 0; } # ok: let's escape everything @INPUTLINES= <>; # this reads the entire input file into the array @INPUTLINES } else { # unfortunately, web browsers pass data differently from forms my @httpform= split(/\=/, ); # read the single input line, which can be endless $httpform[1] =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; # fix escaped http form characters @INPUTLINES= split(/\n/, $httpform[1]); # and put this into our array. print "Content-type: text/html\n\n welch2000jfe.pl web backend: Results

welch2000jfe.pl web backend: Results


	";
    

    ($ENV{REQUEST_METHOD} eq "POST") or
      trapdie("Sorry, but you need to resubmit your data fresh (or you are using a bad input form)!");
  }
  return @INPUTLINES;
}



################################################################
sub trapdie {
  if (!$httpbackend) { print "ERROR: $_[0]\n"; exit 1; }

  print "
\n"; print "

ERROR

\n"; print "$_[0]\n"; print "\n"; exit 0; } ################################################################ =pod =head1 WARNING F was written I the paper. Errors in either the paper or in this program have not necessarily propagated to the other. F is I the most efficient code for the problem at hand, because its intent is to demonstrate usage. Apologies for the repeated use of the phrase F in file namings. This does not (necessarily) indicate egocentrism as much as a desire to connect the filenames to the standard article citation name. =head1 COPYRIGHT (C) Ivo Welch, March 2001. Free Distribution Permitted, provided usage is cited and this notice remains in the text. The original paper (Welch, Ivo. "Herding Among Security Analysts," I 58-3 (December 2000), p.369-396.) is copyrighted by the Journal of Financial Economics (Elsevier). I am also very pleased to report that it won a JFE Fama prize. =head1 CHANGELOG 10 June 2001 complete overhaul. 13 July 2001 corrected a bug in computealtvector, which assumed that choices go from 1 through 5, rather than from $MinChoice to $MaxChoice. 15 July 2001 added a moving average and decaying. =cut