Things that I know
How having money changes how the stock market looks.

Possibly the most important aspect of technical trading is the concept of a trend.  A trend is most simply a series of prices increase (or decrease) from start to finish.  At the end of the trend is a turning point at which the price starts a new trend.

There are many problems with this overly simplistic view.

The appearance of a trend depends on scale.  If you look at AMD over 5-years, you can see a 3-humped ‘M’ shape with some relatively insignificant noise along each leg.  

If we zoom into the second up-trend on the middle hump, we see something different.  Certainly, the price is moving up, but there is a distinct zig-zag to the price and no one date we can really call the turning point, though February 27-28, 2011 appears to be the maximum of this graph.  Couldn’t we make more money by breaking this single trend up into smaller trends?

The answer is not that simple because it depends on how much money you have to spend.

Determining the Trend

Normally, a trend is determined by applying a moving average to the closing price.  If the price is above the moving average, the trend is up.  If the price is below the moving average, the trend is down.  Unfortunately, this method is limiting, especially around large price movements.

Here is a simple method for identifying a trend:

    p = current_index - 1;
    while( price[ p ] < price[ p+1 ] ) p—;

The trend is the maximum current_index that traces back to the same p.

Using this method, we quickly notice that there are small jitters in what we as humans would consider a trend.  Often these jitters are a single price not fitting the trend.

    p = current_index - 2;
    while( price[ p ] < price[ p+1 ]  OR prices[ p ] < prices[ p+2 ] ) p—;

This methods works better, but it can be generalized to take a time period and to detect downward trends as well.

Why Money Matters

I have highlighted the detected trends that would make money given an amount that can be spent per transaction.  I have used the last 5-years of AMD with an algorithm that ignores up to 10-day jitters as the example.

At $50 per transaction

There are only three possible trends, all of them up-trends, that allow a $50 transaction to make money.  This is assuming a $10 transaction fee for both buying and selling.  The trend had to make at least 40% of the transaction cost just to break even!

At $100 available per transaction

There are 14 possible trends that will make money now, 5 down-trends and 9 up-trends (some of the green lines are very close together, but they are separate).  Notice that the middle hump we looked at before has no trends that would be profitable.

At $200 available per transaction

More of the chart is colored now.  Many of the obvious up-trends appear properly highlighted now, but there are still areas of the chart that are uncolored.  I need to do more work in assessing the risk of a transaction at any point on the graph.  We can consider these dangerous areas.

At $1000 per transaction the graph starts to look different

Notice the green interspersed along the down-trends and the red interspersed along the up-trends.  This means that the tiny jitters in price have become meaningful and potentially profitable.  The zig-zag non-movement of the price between Jan 2013 and May 2013 has potential for no less than 12 profitable transactions.  Notice there were only 14 possible profitable transactions in the entire 5-year run of AMD if only $100 were available per transaction.

Why is it “per transaction”?

It comes down to risk.  How much of your total money are you willing to lose on any given transaction?  Typically, this is 10%.  So, to have $1000 available per transaction, you would need to have $10000 in your account and you could have up to 10-$1000 transactions active at any one time.

If you only have $2000 in your trading account, 10% risk only allows $200 per transaction.


The difference between rich and poor is opportunity.  When they say “you need money to make money” it’s entirely and bitterly true.  Using the stock market as a retirement fund because it works for the rich is a cruel joke.  Unless you have enough to retire on, you can’t seriously participate in the stock market as an entity and make your own decisions.  Instead, you need to rely on other peoples’ decisions where they not only risk your retirement but the retirements of many others on a single transaction.

Unless you can reasonably make a profit off a transaction, the only reason to buy stock in a company is self-satisfaction.  Buying less than 100 shares just isn’t worth it.  The examples above use a stock valued between $2 and $10 over the last 5-years.  Consider what it would take to make a reasonable go at a stock priced around $30.

Rant Rebuttal

It’s easy to panic at numbers.  The graph of Microsoft looks about as colored in as the graph of AMD given $1000 per transaction.  Higher valued stocks do have larger price movements than lower valued stocks so individual shares are effectively worth more on a higher valued stocks.

MSFT at $1000 per transaction

What’s really important to take from all this is that trends are imaginary.  Trends may be real in the sense that the price does move up or down for extended periods of time, but that movement is still random.  Detection mechanisms need to be fudged to properly find a trend and even then, the trend can only be found after it has passed.  Detecting a trend that is happening and taking advantage of it is possible.  People do it all the time.

Parsing Stock Prices

For the last several months I have been trying to find useful features in stock price charts.  This is a surprisingly difficult task for a computer because most of the pattern recognition in technical analysis is intuitive.  You as the person buying and selling stock are making decisions that cannot be easily put into words, much less put into computer code.  I even went through a trading class where the teacher said something like “some people want to use 1% or 2%.  No.  Just put your stop price a little below the moving average.”  How much is a little?  What is the period of the moving average?  These things don’t matter.  It all comes down to experience and trusting your feelings.

A computer can’t do that so I can’t do that.

Let’s break down price movements a little so they’re easier to understand.  Maybe with some actual definitions we can make some useful determinations.

  • Up Trend: The price is generally increasing over time.  More to the point, the price is increasing enough to generate a profit if you bought near the low end of the movement and sold near the high end.  This is buying long.
  • Down Trend: The price is generally decreasing over time.  If you were to sell the stock short, that is to sell stock you don’t own at the current price and buy the stock back (so you can return it to its owner) at a lower price you can make just much money off the deal as you would have buying long.
  • Sideways: An ugly truth about stock prices is that they don’t always go up or down.  A stock can trade in the same range for months at a time.  Below shows the last few months of AMD’s stock price.  Notice that the price stays between 3.90 and 4.20 from mid-May through early-July.

Technically, the price does move up and down during a sideways movement.  The problem is detecting that movement.

Moving Averages

Here is roughly the same time period with a 10-day exponential moving average.  Notice on the left how the stock price is completely above the moving average when the price is obviously going up.  Also notice that from May through July the price is basically on top of the moving average.  

Intuitively, we can see that the if the price stays above a moving average for some period of time, it’s going up.  Correspondingly, if the price remains below the moving average, the price is going down.  This concept is simple enough even for a computer.  So what’s the difference between the price movement from April 29 through May 15 and the price movement from May 19 through July 10?

An alternative method uses two or more moving averages to determine when to buy and sell.  This method is the basis for the MACD indicator.

In this version, the lines should be ordered red-blue-green from bottom to top if the stock price is moving up and ordered green-blue-red if the stock price is moving down.  It should be noted that between February and May, using this method to determine when to buy or sell will always lose money.

When to Sell

When to sell is an easier problem to solve then when to buy.  In a strong up trend, selling when the price crosses the blue or red line will usually work.  If the price moves back above the green line and the trend is not sideways you can buy again.  Just remember that you will probably see fewer returns than the last purchase.

By selling when the price crosses a moving average, you gain money based on the slope of the moving average.  If the moving average has gone up, you make money.  If the moving average is flat, meaning the price is moving sideways, you don’t.

When to Buy

Short answer: I don’t know yet.

Buying a stock means making a prediction.  Is it reasonable to expect a stock price to rise (or fall) to a certain level?  The short answer is no.  Day-to-day price changes are almost totally random.  There is no algorithm that can predict day-to-day price changes.  The long answer is maybe.  Trends may contain a lot of noise but they do tend to move in a specific direction.  New prices tend to stay within a reasonable distance of the previous day’s price, except when they don’t.  

I see a lot of stock sites saying “trade the MACD.”  That sounds great until you implement a trading algorithm that uses the MACD indicator to get buy and sell signals.  When it works, the MACD produces very good, early buy signals and tends to pull out before major damage has occurred.  When it doesn’t work, the MACD fails spectacularly, repeatedly throwing buy signals when the stock price moves high and throwing sell signals as soon as the price moves below the purchase price.  These failures are called whipsaws.  A computer can’t tell the difference between whipsaws and real signals.

The problem is that MACD uses moving averages to determine a trend.  If the trend is sideways, the MACD, and any other indicator that uses moving averages in any form, fails.

Defining “sideways”

I’m working on a way to describe a sideways movement to a computer.  If I can recognize a sideways movement and label it, I should be able to use that in determine when to respect the MACD signals and when to ignore them.  That will go a long way toward making a robust trading system.

Linking SEC and Yahoo!

The Securities and Exchange Commission (SEC) has introduced a problem.  The stock exchange world uses a 1-5 character symbol to represent a company.  This “Ticker Symbol” was meant to reduce the number of characters needed to be printed on a ticker tape and yet be easily identifiable by a person.  The SEC does not include this symbol in its reports.  Instead, it uses a Central Index Key (CIK) to identify not only companies on the stock exchange but individual persons and other non-traded entities.  A list of unique companies from the SEC may contain many entries that a list of companies from Yahoo! does not and depending on whether the stock is traded in the United States, Yahoo! may contain companies that are not represented by the SEC.

Using the SEC’s Search Engine

The problem of mapping CIKs to Ticker Symbols has been the subject of academic research.  The principle is fairly straight forward.  The SEC includes a search facility as part of its website.  There are two main search engines:

I mentioned the first link in a previous post.  The second link was used by the research paper linked above.

The first link can be searched using Ticker Symbols.  The CIK can be found in a hidden input with the name “CIK”.  This method is effectively a lookup.  There is no potential source of error from the data.  By passing the stock symbols of each of the 27,443 companies found in Yahoo!, I was able to find the CIK for 5118.

The paper above uses a less precise method where the company name is normalized to make a better search term.  The CIK is selected from a list of possibilities.  This method has two possible sources of data error.  The company name could be incorrectly normalized so that the SEC’s search engine doesn’t find it.  The selection algorithm may select the CIK incorrectly from a list of possibilities.  I have no information about how many ticker symbols were matched to CIKs using this method.  The java web application available on the paper’s website is not intuitive.


Using Obnoxious Amounts of Data

My next idea was to see whether I could identify some of the unmatched companies using street addresses.  Yahoo! makes a company’s profile available on its website.  This profile includes a street address and a phone number.  Every SEC report includes an SGML section tagged SEC-HEADER that includes this same information.  Since Yahoo! includes the ticker symbol and the SEC-HEADER tag includes the CIK, it seems reasonable to be able to match these addresses and build a symbol to CIK lookup table.

The key word is “seems” of course.  Company names must be normalized.  Street addresses must be normalized.  For example, the word “CORPORATION” is shorted to “CORP.”  The word “NORTH” is shortened to “N” in the address.  Special characters are removed.  The SEC uses a special state code to represent foreign countries.  To top it all off, the address presented to the SEC is not always the same address obtained by Yahoo!

I took one year of SEC reports, using forms 10-Q and 10-K to build my company list and I pulled as many profile pages as were available from Yahoo!  

After normalization, I used Levenshtein’s distance algorithm to find the difference in characters between each piece of the name, address and phone number.  I compared every company from the SEC to every company in Yahoo! and stored the numbers in an N x M grid.  Then I used the Hungarian Algorithm to find the best possible matches across this grid.  After stripped out any match that wasn’t the best possible match in a given row and column, I was left with the following:

SEC companies:  8698
Yahoo! companies: 6824
Total matches: 4874
Possible Bad matches: 66

Bad matches, in this case, are matches where some number of characters in each piece of the name-address-phone number comparison did not match.

Visually inspecting these bad matches, I found that some of them were indeed bad matches.  Others were very close and probably represent a correct match.  This probably means that at least some of the companies I have listed as good matches are not really matches.  Only about 1800 companies matched exactly.  Over 3000 had between 1 and 69 characters difference.  

I have not made a comparison between the SEC search engine results and the address matching results because I am simply not comfortable with the address matching.  On the surface the address matching is less capable than the search engine results and 5118 is a lot to work with.


Here is a graph of the expected change in price of a stock over some number of days.  Note that the scale changes as the number of days get longer.  This graph does not indicate exponential growth the longer you hold a stock.  It appears to indicate a linear increase in uncertainty as the number of days increase with a very slight bias toward positive growth.

Here is a graph of the expected change in price of a stock over some number of days.  Note that the scale changes as the number of days get longer.  This graph does not indicate exponential growth the longer you hold a stock.  It appears to indicate a linear increase in uncertainty as the number of days increase with a very slight bias toward positive growth.

Randomness of Stock Prices

I recently performed several statistical tests on the roughly 33 million stock prices from over 5000 companies from the last 40 years that I have obtained from Yahoo!.  

  1. The number of times a stock price is above its 20-day moving average (the stock price is rising) versus the number of times the price is below the moving average (the stock price is falling).
  2. The length of an up-trend versus a down-trend.
  3. The change in price between one period and the next.

Price Relative to the Moving Average

In the first test, I was hoping to see a clear winner - that a stock price was more likely to be increasing or decreasing.  Instead I found an even 50-50 split in the price being above the average and the price being below the average.  The more I thought about it, the more sense that makes.  Of course it will be a 50-50 split.  The moving average is the expected value of the price at any time which means we should expect 50% of the prices to be above the line and 50% of the prices to be below the line.

Despite my initial failure to recognize how basic statistics work, the results do show that math works.  It also shows that a moving average (the average price over the previous X-days) can be useful in describing the state of a stock price.

A stock price has two states: trending and trading.  A trend means that the stock is continuously rising or falling for some period of time.  The moving average tends to move in the same direction as the trend.  If the price is above the moving average and the moving average is increasing, the stock is trending up.  If the price is below the moving average and the moving average is falling, the stock is trending down.  Trading is where the stock price bounces back and forth across the moving average line while the moving average is relatively flat.

Global Counts

  • mean = -0.0002
  • P[up]   0.4998949646117925
  • P[down] 0.5001050353882075

Note: this does show a tiny bias toward the price being below the moving average, but only at the 4th decimal place, which is not particularly significant and could change given more data.


Length of a Trend

I calculated the length of an up-trend as the number of consecutive periods any price was above the moving average line and the moving average was increasing.  A down-trend was calculated in much the same way.

Without having graphed the data, it appears that the length of a trend line can be described as an exponential distribution with the longest recorded trend lasting just over 150 working days.  The maximum trend length is somewhere between 60 and 80 The mean length is 8 working days while both the mode and the minimum lengths are 1 day.  With the median length at 4 days, the greatest bulk of trend lengths are below the mean with the longer, more important trends being relatively rare.

Also, the probability of being in an up-trend is only 25.83%.  As stated above, the probability of the price being above the moving average line is 50%.  The probability of being in a down-trend is only 25.50%.  The other 48.7% of the time is spend trading.  Periods of trading typically do not last as long as trends, but can last much longer.

Global Length of Up trends

  • mean = 8.7089,
  • median = 4.0000,
  • mode = 1.0000,
  • variance = 106.3966,
  • stddev = 10.3149,
  • min = 1.0000,
  • max = 154.0000

Global Length of Down trends

  • mean = 8.2389,
  • median = 4.0000,
  • mode = 1.0000,
  • variance = 91.9654,
  • stddev = 9.5899,
  • min = 1.0000,
  • max = 150.0000

Global Length of Trading

  • mean = 3.0703,
  • median = 2.0000,
  • mode = 1.0000,
  • variance = 13.6157,
  • stddev = 3.6899,
  • min = 1.0000,
  • max = 1447.0000

Price Changes

Differences in prices between one period and the next were calculated as follows:

Difference = ( current closing price - previous closing price ) / previous closing price.

Edit: I messed up the mean and standard deviation (originally 0.0056 for the mean and 0.0124 for the standard deviation) here by taking too much data into account.  I had 588 outliers out of roughly 15 million where the stock price changed more than 10 times the original price in one day.  After removing those, the mean and standard deviation are more reasonable.

The typical change in stock price from one day to the next is 0.0017 x stock price with a standard deviation of 0.0717 x stock price.  Therefore, the change in a stock with a price of $5.25 will have:

  • a 64% chance to be in the range of  -$0.37 and $0.39
  • a 91% chance of being in the range of -$0.74 and $0.76
  • a 94% chance of being in the range of -$1.12 and $1.14.

More significant changes are not impossible, but they are unlikely.  So, the change from day to day is likely to be small.  This places an expectation on how quickly we can expect to make or lose money over a given time period.  For example, to lose its entire value, the stock would need to go down at a rate 3 standard deviations from the mean for 24 consecutive days or it would need to go down at a rate of 1 standard deviation from the mean for 132 consecutive days.

It also places a growth rate on expensive stocks versus inexpensive stocks.  A stock priced at $200 is expected to change ten times more quickly than a stock priced at $20.  However, we could buy 10 times more shares in the $20 stock.  

Please note that I have not run a test to determine if a $200 stock really can be expected to grow 10 times as fast as a $20 stock.  I’m basing this assumption on the mean and standard deviations found for all stocks of any price where the difference has been normalized by the original stock price.

Over 5 days, we can expect the stock price to go up as often as it goes down, so the expected change in stock price over 5 days is 0.0033 x stock price with a standard deviation of 0.1017 x stock price.  To map this out as above, the change in a $5.25 stock would have:

  • a 64% chance to be in the range of -$0.52 and $0.55
  • a 91% chance to be in the range of -$1.05 and $1.09
  • a 94% chance to be in the range of -$1.58 and $1.62

And then over 30 days with a mean of 0.0149 x stock price and a standard deviation of 0.2137 x stock price.

  • a 64% chance to be in the range of -$0.34 and $1.20
  • a 91% chance to be in the range of -$2.17 and $2.32
  • a 94% chance to be in the range of -$3.29 and $3.44

So, now what?

According to the tests above, we have a 25% chance of being in an up trend on any given day.  That up trend can be expected to last around a little more than a week.  It could go longer and has, but the probability of that happening is low.  In a week, we can expect the stock price to go up as much as half the original price in that time.  

If we catch a stock late in an up trend, the likelihood of the trend ending becomes high quickly.  It would be better to catch a trend near its beginning rather than hoping the trend will continue after buying into the middle.

What I have stated sounds like common sense.  Sometimes it helps to verify common sense.

Below are some tables showing the expected change in price for stock priced in the ranges of $1, $5, $10 or $30.  Remember that the 1st Standard Deviation (-1 to +1) has a probability of 64%, the 2nd has a probability of 91% and the 3rd has a probability of 94%.  It is possible, but unlikely, for a price to move more than this.

1-Day Change

Price                1        5       10      30
-3-StdDev  -0.21  -1.07  -2.15  -6.46
-2-StdDev  -0.14  -0.72  -1.43  -4.30
-1-StdDev  -0.07  -0.36  -0.72  -2.15
Mean         0.00    0.01    0.02   0.05
+1-StdDev 0.07    0.36    0.72   2.15
+2-StdDev 0.15    0.72    1.44   4.31
+3-StdDev 0.22    1.08    2.15   6.46

7-Day Change

Price                1        5       10        30
-3-StdDev  -0.35  -1.78  -3.56  -10.69
-2-StdDev  -0.23  -1.18  -2.37    -7.12
-1-StdDev  -0.11  -0.59  -1.18    -3.56
Mean           0.00   0.02   0.05     0.15
+1-StdDev   0.12   0.60   1.19     3.57
+2-StdDev   0.24   1.19   2.38     7.13
+3-StdDev   0.36   1.79   3.57   10.70

14-Day Change

Price                1        5      10         30
-3-StdDev  -0.45  -2.31  -4.62  -13.88
-2-StdDev  -0.30  -1.54  -3.08    -9.25
-1-StdDev  -0.15  -0.76  -1.54    -4.62
Mean           0.01   0.04   0.08     0.25
+1-StdDev   0.16   0.78   1.55     4.64
+2-StdDev   0.32   1.55   3.10     9.27
+3-StdDev   0.47   2.32   4.64   13.90

Securities and Exchange Commission reports

This is a little about getting data out of the SEC reports 10-K and 10-Q.  Form 10-K is an annual earnings report while 10-Q is a quarterly report.  Both contain data that can be used in the fundamental analysis of stocks and both are confusing as anything.  Hopefully, I can demystify them a bit.

The first step in obtaining an SEC report, as stated in my last post, is to obtain an index file.  These files live at specific locations on the EDGAR file system.  For example, the index for quarter 1 of 2013 can be found at

This can be easily pulled up a in web browser.  The file is delimited by vertical bars (|) and easily parseable with String.split().  The format is as follows:

  • Central Index Key (CIK)
  • Company name
  • Form Type
  • Filing date
  • FTP path

The Central Index Key (CIK)

The CIK is a number used by the SEC to identify a company internally.  Notice that no stock symbol is included and the name of the company is not the same as the name presented by Yahoo! and others.  Fortunately, the SEC has also provided a search page that can be used to look up a company’s CIK based on their ticker symbol.

The CIK can be found on the resulting HTML page in a hidden input with the name “CIK.”

An alternate CIK lookup exists at the following link, but I had worse luck with the results.

Using the company search page I have found 5207 companies with a CIK out of 27443 companies listed in Yahoo!  Note that most of the companies in Yahoo! are foreign to the U.S. and will not have a CIK.

File Structure

As stated in my previous post, obtaining the actual report can be as simple as prefixing the file name in the EDGAR index with “” or as complex as building an FTP client to download the entire list in one session.

I was able to download all the 10-K and 10-Q reports for quarter 1 of 2005 through quarter 1 of 2013 over the course of several weeks for a total of 290600 files using 73.5 GB of storage, compressed with GZip.  GZip tends to get about 10:1 compression on these files so I downloaded around 735 GB of data.  I don’t need or want all that.

Each report file contains many separate documents.  The overall structure of the file is SGML.  The document format is as follows:

  <TYPE>Data type (XML, EXCEL, ZIP) or special name
  <SEQUENCE>A number
  <FILENAME>A file name without a path.
  <DESCRIPTION>An identifying name
     … Content …

Binary files are encoded base64.  XML and HTML files are included inline.

Trying to use an XML parser on the file fail.  HTML does not conform to the stricter rules of XML.  Instead, the documents must first be broken up so each one can be handled individually.  I did this by reading the entire file into memory and splitting the text on the regular expression /(</?DOCUMENT>\s*)+/.  The header information may be read from each block of text and the content block can be extracted.


Extensible Business Reporting Language (XBRL)

In February 2005, the SEC began a voluntary filing program an released the initial US Generally Accepted Accounting Practices (US-GAAP) Taxonomy defining a business oriented XML format call eXtensible Business Reporting Language or XBRL.   In June 2009, the SEC began mandating that companies file their reports using XBRL.  This means a standards-based container for all the company data, ideally making that data as easy to find as document.getElementByTagName().  Unfortunately, nothing is quite that simple.

For more information on XBRL, start here:

Facts comprise much of an XBRL document.  Earnings Per Share, for example, may be listed in the <us-gaap:EarningsPerShareBasic> tag.  Net Income may be listed in the <us-gaap:NetIncomeLoss> tag.  I say “may be listed” because things are not that simple.  Financial reporting is often very specific.  Numbers that may be used in the same way for the same purposes may be named different things.  Net Income may be <us-gaap:NetIncomeLoss> or <us-gaap:NetIncomeLossAvailableToCommonStockholdersBasic> or <us-gaap:ProfitLoss> depending on the company.  Each of these numbers may have a distinct, real world meaning but they are each used for the same purpose, effectively making them the same number.

To further complicate the issue, companies are required to report the previous year as well as the current year.  They may also report a 6 or 9-month range in the current and previous years along with the required 3-month range.  This means that a node containing, say, Earnings Per Share may appear many times.

XBRL uses contexts to organize the data.  Contexts include a date or date range and potentially a set of explicit members.  These explicit members make it possible for multiple contexts to fall into the same date range but have different meanings.  Contexts with a single date are called “instant” and describe the value of a fact as of a certain time period.  Contexts with two date ranges may indicate that the facts describe average values such as the weighted average number of shares over a period as opposed to shares issued on a given date.

XBRL also uses units to further differentiate facts.  Units are often used to denote a currency type such as USD, CAN or GBP.

The sheer number of possibilities makes looking up a value in XBRL difficult.  The best method I have to find data in XBRL is to select the contexts for the current year with a 3-month date range (in Form 10-Q) or a 12-month date range (in Form 10-K).  Then select the context containing the most facts.  One will be the obvious winner.


Common Facts

Despite the overwhelming number of available facts, only a small number of facts are commonly used between a wide number of companies.  We can use the node names of these facts as a starting point, but we still need to deal with the strangeness of the financial world.

Facts listed under the us-gaap namespace are most commonly used because these are the facts defined by the SEC.  Most if not all companies will add their own namespace and their own node names to the list.  The dei namespace includes metadata like the company name.


Financial Ratios

Certain financial ratios are used in fundamental analysis that are not found directly in the SEC reports.  Fortunately, we have math.  A nice set of financial ratios can be found here:

From the wiki, Return on Equity is a ratio that is not available in the XBRL data, but it can be calculated.  From the wiki above, I see that I need two values: Net Income and Stockholders Equity. Looking at the XBRL data, Net Income typically appears in a date range context while Stockholders Equity appears in an instant context.

Return on Equity = Net Income / Average Stockholders Equity

I would expect to find the data listed under us-gaap:NetIncomeLoss and us-gaap:StockholdersEquity because those nodes are very common.  The example below illustrates how this isn’t always a correct assumption.

Choosing a random 10-K report, Sandisk in 2011 had the following values:

Period = 01/03/2011 to 01/01/2012 (12M), id = “D2011Q4YTD”

  • 290000 (usd) us-gaap:NetIncomeLossAttributableToNoncontrollingInterest
  • 986990000 (usd) us-gaap:NetIncomeLossAvailableToCommonStockholdersDiluted

Period = 01/01/2012, id = “I2011Q4”

  • 7064358000 (usd) us-gaap:StockholdersEquity
  • 7060839000 (usd) us-gaap:StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest

As a human, I look at this data and see two possible results

  • ROE = 290000 / 7060839000 = 0.00004107
  • ROE = 986990000 / 706435800 = 1.397

Where the first ROE is attributable to non-controlling Interests and the second is attributable to common stockholders.  The second appears to be the more relevant of the two so I would use that one.

Unfortunately, beyond heuristics and gut feelings, I don’t have a good way to identify which numbers apply to what ratio.


This post was just a walk through the SEC data available for free on the Internet.  I mentioned financial ratios, but I didn’t describe their purpose or how to use them.  The reason for this is that ratios need to be compared to an industrial average.  I have not discovered how that average is obtained, though Yahoo! makes industrial and even sector averages available for the current quarter.

The SEC data only goes back to 2010 with XBRL, but that gives me 13 quarters of data for testing.

Java and FTP

Java’s relationship with FTP is somewhat vague.  The code to handle an FTP connection exists in the bowels of the Java runtime, but it can’t be directly accessed.  Instead, if you give an FTP address to a URL you can download files over FTP using a URLConnection.  There’s a problem with this.

The Securities and Exchange Commission (SEC) receives regular quarterly reports from companies in the United States.  It makes these reports available to the public through an anonymous FTP site called EDGAR (  EDGAR does not allow listing in the data directories.  The preferred method of getting a file list is to download an index file that can be filtered based on company, date and report type.  For example, there were 35606 reports made available by the SEC for the first quarter of 2013 (this is not a complete list because the quarter is not over!).  Once you have this, just append the file name to the root URL and you can download that file.

The problem lies in that you are making a separate FTP connection for every file you download and EDGAR will run out of connections quickly if you try to download a big list of files at once.  This happens not because you have made so many connections at once.  The connections take a while to time out so after a while, EDGAR just stops handing out files altogether.  The solution is to write an FTP client for java.

I had done this back in 2004 to automate uploading new comics to my Keenspace (now Comic Genesis) site.  This code was originally based on the MatzSoft Java FTP-Client, a GPLed, Java-based FTP client.  I found their implementation to be overly complex for my purposes and ended up rewriting most of the back end.  I rewrote it again on Feb 20 2013 to make use of Java’s exception handling.

MatzSoft Java FTP-Client Project on Sourceforge
User: Mathias Menzel-Nielsen

Please note that I will not be talking about FTP over SSL or sFTP or any of the other variants.  My code only handles vanilla, unsecured FTP, good for connecting to anonymous sites.  Always remember that FTP uses plain text and unsecured connections.  Any passwords you send over FTP are plainly visible to anyone capturing packets anywhere along the route.  Don’t use your FTP password anywhere else.

Writing an FTP client is not particularly difficult.  The complete specification can be found in RFC 959 ( The client makes use of two Sockets, the Control Socket and the Data Socket.  The Control Socket connects to the main FTP site ( on port 21.  How the Data Socket connects depends on a few things.

Normal data connections happen when an FTP server expects the client to open a port locally.  This only works properly if the client is not separated from the server by a firewall.  Most DSL and cable services as well as University and Internet cafe locations use private network addresses (192.168.* or 10.*) to connect users instead of giving them real Internet IP addresses.  If you have direct access to the router, you can use port forwarding to connect the router’s port numbers to your local machine.  Also, the data port is usually a random 16-bit number above 10000 so you will need to open a wide range of ports.  That’s a lot of work for little gain.

Instead, we need to look at passive mode.  A passive connection means that the client sends a request to the server to open a data port.  The server responds on the Control Socket with the IP address and port of the open connection.  Note that the server handling data connections does not need to be the same server handling control connections, so be sure to use the IP address sent!  The client then connects to the IP address/port given and copies the file.  Depending on the transfer method, this connection may be closed when the transfer is complete or left open for more files.

I’ve mentioned sending requests and receiving responses on the Control Socket.  I won’t list the entire FTP language for you here.  If you want to see it, check out the RFC.  I will list some of the important key words and how to use them.

USER <user name>

Send the user name.  This can be “anonymous.”  If you send a real username, the server will respond with “331 Password required for <user name>.”  Otherwise the server will respond with “331 Guest login ok, send your complete e-mail address as password.”  Use PASS to respond to the server’s request.

PASS <password>

Send the password in plain text.  If your username is “anonymous” send your email address.  Note that “a@b.c” is probably as valid as anything.

CWD <directory path>

Change working directory.  This is not necessary if you already know the directory structure of the FTP server.  It can be helpful if you just want to browse.  If successful, the server will respond with “250 CWD command successful.”


Enter passive mode.  This must be done for each request involving the Data Socket.  The server will respond with something like “227 Entering Passive Mode (127,0,0,1,129,210)”  The numbers in parenthesis are the 4 bytes in an IPV4 address and the high and low bytes of the port number.  The regular expresion “^227.*\(([0-9]+,[0-9]+,[0-9]+,[0-9]+),([0-9]+),([0-9]+)\).*$” will separate this response into three groups like so:

Matcher matcher = ipPattern.matcher( response );
if( matcher.matches() ) {
String ipAddress = 1 ).replace( ‘,’, ‘.’ );
    int highPort = Integer.parseInt( 2 ));
    int lowPort = Integer.parseInt( 3 ))
    int port = (highPort « 8) + lowPort;

TYPE <representation type>

We are interested in two data types: ASCII (A) and Binary (I).  ASCII is used for any transfers involving only text such as getting a file list.  Binary is useful for downloading files.

LIST [<path>]

Return a file list over the Data Socket.  Note that TYPE A must be sent first.  It should also be noted that no information about the data on the FTP server can be retrieved using only the Control Socket, not even a list of files.

RETR <file name>

Retrieve a file from the server.  Notice that only one file can ever be transferred using a single RETR command.  Not wild cards are allowed.

STOR <file name>

Send a file to the server.  Again, only one file can be sent at a time.


Tell the server you are done and want to close the connection.

Notice that the commands used by FTP are not the same as the ones used by a command line FTP client.  A lot more work is being done in the background that the user does not need to know about.

Here is an example FTP session connecting to a local FTP server:

  • Client connects to the server (ftp://localhost:21/)
  • server> 220 sepia FTP server (Version 6.4/OpenBSD/Linux-ftpd-0.17) ready.
  • client> USER user
  • server> 331 Password required for apple.
  • client> PASS password
  • 230- Welcome to Ubuntu 12.10 (GNU/Linux 3.5.0-25-generic x86_64)
    230-  * Documentation:
    230 User user logged in.
  • client> CWD test/data
  • server> 250 CWD command successful.
  • client> PASV
  • server> 227 Entering Passive Mode (127,0,0,1,129,210)
  • Client opens a Data Socket to
  • client> LIST
  • server> 150 Opening ASCII mode data connection for ‘/bin/ls’.
  • Client receives the file list:
    total 32
    -rw-rw-r— 1 apple apple   99 Feb 20 19:52
    -rw-rw-r— 1 apple apple   95 Feb 20 19:02
    -rw-rw-r— 1 apple apple 3986 Feb 22 08:56
    -rw-rw-r— 1 apple apple 6426 Feb 22 08:56
    -rw-rw-r— 1 apple apple 6663 Feb 21 08:59
    -rw-rw-r— 1 apple apple 1510 Feb 20 22:20
  • server> 226 Transfer complete.
  • Client closes the Data Socket.
  • client> TYPE I
  • server> 200 Type set to I.
  • client> PASV
  • server> 227 Entering Passive Mode (127,0,0,1,234,235)
  • Client opens a Data Socket to
  • client> RETR
  • server> 150 Opening BINARY mode data connection for ‘’ (3986 bytes).
  • Client downloads the file over the Data Socket.  How this file is handled locally depends on what you want to do with it.  The OutputStream does not need to go to a file.
  • server> 226 Transfer complete.
  • Client closes the Data Socket.
  • client> QUIT
  • server> 221 Goodbye.
  • Client closes the Control Socket.

Notice that every reply from the server happens over the Control Socket, including the replies that indicate a file transfer is complete.  Only one file gets downloaded at a time because the Control Socket is used to identify when a file has completed sending.

I mentioned that the Data Socket can be left open in certain transfer modes.  The example above uses Stream mode.  The files are made available as a file stream.  This is very simple to implement and is the only mode I have actually done.  Block mode sends the file in blocks along with descriptor blocks that show the boundaries of a file.  Block mode is not required and may not be implemented in a given server.  Stream mode is required and must be implemented.

The above example uses a local FTP server on my Ubuntu box.  The text returned by any given FTP server will not be the same so look for the return code at the beginning of a line followed by a space to determine when the serve is done sending response text.  The return code followed by a dash (-) indicates more response text to come.

Here is an example of connecting to the SEC:

  • Client connects to the server (
  • server> 220 FTP server ready.
  • client> USER anonymous
  • server> 331 Anonymous login ok, send your complete email address as your password
  • client> PASS
  • server> 230-Anonymous access granted, restrictions apply
     Please read the file README.txt
    230    it was last modified on Tue Aug 15 14:29:31 2000 - 4573 days ago
  • client> CWD edgar
  • server> 250-CWD command successful
     Please read the file README.txt
    250    it was last modified on Tue Jul 10 10:31:45 2007 - 2054 days ago
  • client> PASV
  • server> 227 Entering Passive Mode (162,138,177,36,164,46).
  • Client connects Data Socket to
  • client> LIST
  • server> 150 Opening ASCII mode data connection for file list
  • Client downloads the contents of the file list.  Unfortunately, no indication from the server is made that the file list has completed downloading and the client crashes because the Socket timed out.

Note the differences in the response text between the SEC’s FTP server and my local Ubuntu FTP server.

I did receive the following file list before the timeout happened:

drwxr-xr-x   4 1019     bin          4096 Feb  9 02:26 2000
-rw-r—r—   1 1024     bin       1508388 Aug 21  2009 2009-03-23.rss.xml
drwxr-xr-x  22 1019     bin          8192 Feb  9 02:36 Feed
drwxr-xr-x  21 1019     bin          8192 Feb  9 02:42 Oldloads
-rw-r—r—   1 root     bin          7208 Jul 10  2007 README.txt
drwxr-xr-x   2 1019     bin          4096 Oct  7  2004 Tools
drwxr-xr-x 102 1019     bin          8192 Feb 10 11:30 containers
drwxr-xr-x  22 1019     bin         28672 Feb 22 03:01 daily-index
drwxr-xr-x  44 1019     bin      37363712 Feb 22 16:57 data
drwxr-xr-x   2 1019     bin          4096 Oct  7  2004 docs
drwxr-xr-x   2 1019     bin          4096 Oct  7  2004 forms
drwxr-xr-x  23 1019     bin          4096 Feb 22 03:02 full-index
lrwxrwxrwx   1 root     root           33 Oct  7  2004 index.htm -> /usr/local/web/index.nobrowse.htm
drwxr-xr-x   2 1024     bin          8192 Feb  2 03:15 monthly
-rw-r—r—   1 1019     bin        102478 Feb 22 03:02 sitemap.xml
-rw-r—r—   1 1024     106       1202034 Feb 22 16:48 usgaap.rss.xml
-rw-r—r—   1 1024     106        437347 Aug 13  2010 usgaap.rss.xml.20100812
drwxr-xr-x  29 1024     bin          4096 Jan 10 02:17 vprr
-rw-r—r—   1 1024     106         88773 Feb 22 16:48 xbrl-rr-vfp.rss.xml
-rw-r—r—   1 1024     106        634833 Feb 22 16:48 xbrl-rr.rss.xml
lrwxrwxrwx   1 root     root           44 Jun 10  2009 xbrl.html -> /usr/local/web/info/edgar/ednews/xbrlrss.htm
-rw-r—r—   1 root     bin      79934154 Mar 24  2009 xbrldata.tar.gz
-rw-r—r—   1 root     bin      78763738 Mar 24  2009
-rw-r—r—   1 1024     106       1104085 Feb 22 16:48 xbrlrss.all.xml
-rw-r—r—   1 1024     106        425863 Aug 13  2010 xbrlrss.all.xml.20100812
-rw-r—r—   1 1024     bin          7733 Apr 20  2009 xbrlrss.idea.xml
-rw-r—r—   1 1024     bin         92840 Apr  8  2010 xbrlrss.risk-return.xml
-rw-r—r—   1 1024     106         92840 Apr  8  2010 xbrlrss.risk-return.xml.20100812
-rw-r—r—   1 1024     bin           816 Apr  8  2010 xbrlrss.rr2008.xml
-rw-r—r—   1 1024     106           816 Apr  8  2010 xbrlrss.rr2008.xml.20100812
-rw-r—r—   1 1024     106        644547 Feb 22 16:48 xbrlrss.xml
-rw-r—r—   1 1024     106        359839 Aug 13  2010 xbrlrss.xml.20100812

I would also like to note that I have retrieved a master index from the edgar/full-index/2013/QTR1/ directory, parsed it and downloaded all the Form 10-Q files listed.  I was able to download 750 files over the course of 2 hours and 45 minutes with no errors or problems.

In a later post, I will show some results of parsing the Form 10-Q files and how to extract data from them.

Hacking the Stock Market

Since my last post, I have put aside my EPUB work and focused more on the stock market.  This is partly due to money concerns and the belief that I will not have enough money to retire on and partly due to the (crazy) belief that I can make a program that will generate money.  I have found only that making money on the stock market is not a simple thing.  I have also found that making money is not an unreasonable or impossible task.

I don’t remember the exact quote or who said it, but “if you know something you can do it, but if you truly understand something you can write a program to do it.”  That’s what I want to do with this project.


My goals are pretty simple.

  1. Analyze a set of companies and weed out the ones least likely to make money.  (Fundamental Analysis)
  2. Analyze the stock price of a company to determine the best time to buy stock.  (Technical Analysis)
  3. Analyze the stock price of a company to determine the best time to sell stock I have bought.  (more Technical Analysis)
  4. Profit.  (Underpants Gnomes)

To do all this, I need a few things.

  1. An account that will allow me to buy and sell stock.
  2. A model of this account so that I can perform “paper trading” on historical stock data.
  3. A model of the stock market that I can analyze.
  4. An analysis of the movement of stock prices versus the various fundamental analysis numbers to get an idea of the probabilities involved.
  5. An analysis of the movement of stock prices versus the various technical analysis numbers to get an idea of probabilities there as well.

Model of the Stock Market

The stock market groups individual companies into industries according to which companies make the same types of products or services.  These industries are then grouped into sectors.  There are roughly 27000 companies grouped into 215 industries and 9 sectors according to Yahoo! Finance.  Of those, just over 5000 companies make their earnings available to the Security and Exchange Commission (SEC) here in the United States.  I will focus mainly on these 5000 companies because I know I can readily obtain stock prices for them as well as earnings reports if I am so inclined.

So far, I see four tables: Sectors, Industries, Companies and Stock Prices.  The data for these tables is easily obtained from Yahoo! Finance.

The list of sectors can be found here:

This page then links to the various industry lists:<sector number>conameu.html

And these pages link to the various company lists:<sector number><industry number>conameu.html

Parsing these pages is fairly straight forward.  I use NekoHTML to create a DOM tree in Java and then use getElementByTagName to find all the hyperlinks on the page.

Sectors and Industries only have a name.  Companies have a name and a symbol.  The symbol is important as it is used to retrieve the stock prices.  Fortunately, the symbol is included in the hyperlink of the company name.

Historical Stock Prices can be found using:<symbol>&d=<month>&e=<day>&f=<year>&g=d&a=0&b=1&c=1970&ignore=.csv

The stock prices are returned in comma delimited format and can extend backward 40 years or more of daily price ranges.  While some day traders may need interday prices for their trading, I intend to work with relatively small amounts of money.  Therefore, daily prices are just fine.  I will show why in a later post.

Downloading the historical data for 5000 companies will require about 2 gigabytes of disk space in a database including indices and can take several hours.  I was able to download that data in roughly 2.5 hours over a 10 Mbit DSL connection using a threaded URL reader.

My results were (as of ~Feb 4, 2013):

  • 9 sectors
  • 215 industries
  • 26998 companies
  • 17,990,511 prices

Fundamental Analysis

In “Fundamental Analysis the Easy Way” they talk about using Return on Equity as a simple method for highlighting good companies.  If you looked at any of the links above, you may have noticed that ROE% is one of the values provided for sectors, industries and ultimately companies.  These numbers are also available as a download in comma delimited form.  The links to that data are: (sectors)<sector number>conameu.csv (industries)<sector number><industry number>conameu.csv (companies)

"Fundamental Analysis the Easy Way" can be found here:

While I don’t have any further data to show on this method (I changed my database and can’t run that particular piece of code just now), I will say that I got a substantially reduced number of companies to look over (it was either 100 or 1000.  I don’t remember.)  This means that I can use the results of a fundamental analysis technique, follow the companies for several days or weeks and see how their prices actually change.  I can then use that change to determine the probability that a particular fundamental analysis method returns the correct results.  So far, I have one method that I can test.


I reran the code I mentioned above and got 199 companies inside my price range, $1 - $25.  When I reran it with the price range set at $1 - $10, I got 88 companies.  This is a much more manageable number than 5000.

Technical Analysis

I have played a bit with technical analysis, but I can’t say that I have determined when a good time to buy and sell is.  Given a particular company, I find it very difficult to get a consistent indicator of when to buy or sell. (I use AMD as a test because I once made a small amount of money on that.)  I have tried various methods and the best one for me consistently lost money.  I will come back to this later.


In future posts, I will go into more depth about my methods.  This will, I hope, consolidate my ideas so that I can work out the problems I’m having and maybe give someone else some insight into a similar project.

Using Java’s XML DOM

Since EPUB is almost entirely based on XML, I have been working with the java XML DOM for both parsing and now for producing the XML text.  This work is based on my own tinkering.  It should be stated that I do not have an in depth knowledge of the XML DOM or the various (and there are many) implementations and implications of them.


Getting started with XML parsing is as simple as doing a Google search.  The code is fairly easy to come by as is the explanation of why it is used.  I don’t want to write a complete tutorial about parsing XML.  Instead, I’ll just thrown some of the code I commonly use for parsing at you.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder db = factory.newDocumentBuilder();
InputStream in = new ByteArrayInputStream( xmlText.getBytes( “UTF-8” ));
Document doc = db.parse(  in );

Notice that the factory is namespace aware.  This is not so important for parsing, but it may save you some steps here and there.  

The DOM may now be accessed in a fairly standard way:

Node node = doc.getFirstChild();
while( node != null ) {
    for( int i = 0; i < node.getAttributes().getLength(); i++ ) {
        Node attr = node.getAttributes().item( i );
        String name = attr.getNodeName();
        String value = attr.getNodeValue();
    String text = node.getTextContent();
    node = node.getNextSibling();


NodeList foos = doc.getElementsByTagName( "foo" );
for( int i = 0; i < foos.getLength(); i++ ) {
    Node node = foos.item( i );

Any conversion from XML to a standard Java class is done manually with data extracted from getNodeValue() and assigned based on getNodeName().  

For example, when reading an OPF file into my OPF object, I parse the document into a DOM Document and step through the children, building the data structure based on 

Producing XML:

Producing XML is a bit stranger.  It requires knowing a bit more about the DOM and how the different classes communicate with each other.  I should mention that the XML is not produced by the DOM itself.  Whichever serializer you decide to use will do that.  I have tried two with differing results.

The first method of producing XML uses JAXP and can be found in the javax.xml.transform package of Java 1.6.

TransformerFactory factory = TransformerFactory.newInstance();
Transformer trans = factory.newTransformer();
trans.setOutputProperty( OutputKeys.INDENT, "yes" );
trans.transform( new DOMSource( document ), new StreamResult( System.out ));

The second method uses the DOM level 3 Load and Save (LS) method and can be found in the org.w3c.dom package of Java 1.6, though I currently have Apache Xerces added to my test project and can’t remember which is doing the work.

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation( "LS" );
LSSerializer writer = impl.createLSSerializer();
String xml = writer.writeToString( node );

The above method does not include white space that would make the XML human readable.  Pretty printing options are available by setting the DOM configuration of the writer.

writer.getDomConfig().setParameter( "format-pretty-print", Boolean.TRUE );

Working with namespaces:

OPF metadata uses a number of different namespaces:

  • OPF:
  • Dublin Core Metadata Elements v 1.1:
  • Dublin Core Metadata Terms:
  • Marc Relators (used for Creator and Contributor roles):
  • and several others….

Below is an example of building part of an OPF document using the XML DOM.

String opfNamespace = "";
String dcNamespace = ""; 

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder db = factory.newDocumentBuilder();
Document doc = db.newDocument();

Node packageNode = doc.createElementNS( opfNamespace, "package" );
((Element)packageNode).setAttribute( "unique-id", "pub-id" );
((Element)packageNode).setAttribute( "xml:lang", "en" );
((Element)packageNode).setAttribute( "version", "3.0" );

Node metadataNode = doc.createElementNS( opfNamespace, "metadata" );

Node identifierNode = doc.createElementNS( dcNamespace, "dc:identifier" );
((Element)identifierNode).setAttribute( "id", "pub-id" );
identifierNode.setTextContent( "urn:uuid:" + UUID.randomUUID().toString() );
metadataNode.appendChild( identifierNode );

Node titleNode = doc.createElementNS( dcNamespace, "dc:title" );
titleNode.setTextContent( "Alice's Adventures in Wonderland" );
metadataNode.appendChild( titleNode );

Node languageNode = doc.createElementNS( dcNamespace, "dc:language" );
languageNode.setTextContent( "en-us" );
metadataNode.appendChild( languageNode );

Node dateNode = doc.createElementNS( opfNamespace, "meta" );
((Element)dateNode).setAttribute( "property", "dcterms:modified" );
SimpleDateFormat sdf = new SimpleDateFormat( "yyyy-MM-dd'T'hh:mm:ss.SSSZ" );
dateNode.setTextContent( sdf.format( new java.util.Date() ));
metadataNode.appendChild( dateNode );

Node spineNode = doc.createElementNS( opfNamespace, "spine" );
Node guideNode = doc.createElementNS( opfNamespace, "guide" );

packageNode.appendChild( metadataNode );
packageNode.appendChild( spineNode );
packageNode.appendChild( guideNode );
doc.appendChild( packageNode );

The XML produced using the DOM level 3 Load and Save method is as follows:

<package xmlns="" unique-id="pub-id" version="3.0" xml:lang="en">
      <dc:identifier xmlns:dc="" id="pub-id">urn:uuid:74784855-cf69-4f7b-801e-191f3208c83f</dc:identifier>
      <dc:title xmlns:dc="">Alice's Adventures in Wonderland</dc:title>
      <dc:language xmlns:dc="">en-us</dc:language>
      <meta property="dcterms:modified">2011-12-31T07:22:10.448-0600</meta>

Notice the namespace declarations in dc:title, dc:language and dc:identifier.  These should be propagated into metadata but they aren’t, at least not automatically.

The following statement will fix that:

((Element)metadataNode).setAttributeNS( XMLConstants.XMLNS_ATTRIBUTE_NS_URI, "xmlns:dc", dcNamespace );

This produces a much more reasonable XML document:

<package xmlns="" unique-id="pub-id" version="3.0" xml:lang="en">
   <metadata xmlns:dc="">
      <dc:identifier id="pub-id">urn:uuid:e004acd1-aedb-4406-9222-bc113e076caa</dc:identifier>
      <dc:title>Alice's Adventures in Wonderland</dc:title>
      <meta property="dcterms:modified">2011-12-31T08:11:57.171-0600</meta>

The above is a fairly simple, two namespace problem where the namespace is attached to the element.  Watch what happens with the DOM Level 3 method is presented with attributes that have namespaces.


<metadata xmlns="">
   <dc:identifier xmlns:dc="" id="pub-id">urn:uuid:2e8c7bdc-4886-40dc-bb7a-d7d3827a037c</dc:identifier>
   <dc:title xmlns:dc="">Alice's Adventures in Wonderland</dc:title>
   <dc:creator xmlns:dc="" xmlns:opf="" opf:role="aut">Lewis Carroll</dc:creator>
   <dc:language xmlns:dc="">en-us</dc:language>
   <dc:date xmlns:dc="" xmlns:opf="" opf:event="modification">2011-12-31T08:22:56.186-06:00</dc:date>


<metadata xmlns="" xmlns:dc="" xmlns:opf="">
   <dc:identifier id="pub-id">urn:uuid:dfd0bdf0-cf9c-417a-989b-dffbc5e15600</dc:identifier>
   <dc:title>Alice's Adventures in Wonderland</dc:title>
   <dc:creator xmlns:NS1="" NS1:role="aut">Lewis Carroll</dc:creator>
   <dc:date xmlns:NS1="" NS1:event="modification">2011-12-31T08:21:55.912-06:00</dc:date>

Notice that the opf: prefix in the first version changed to NS1: in the second.  The JAXP transformer method produced these results with the same DOM:

<metadata xmlns="" xmlns:dc="" xmlns:opf="">
<dc:identifier id="pub-id">urn:uuid:230839f5-5fef-427b-bde3-e888634ada08</dc:identifier>
<dc:title>Alice's Adventures in Wonderland</dc:title>
<dc:creator opf:role="aut">Lewis Carroll</dc:creator>
<dc:date opf:event="modification">2011-12-31T08:24:29.202-06:00</dc:date>

Basically, the point of this post is two-fold.  First, I wanted to show how to propagate XML namespace declarations into parent tags.  This is demonstrated by the setAttributeNS() call.  Second, I wanted to show that serializer choice matters depending on how complicated your DOM is.

Hacking Project Gutenberg Part 5

Unit Testing

The last 2(3?) months have been spent generating unit tests to determine whether I am, in fact, correctly finding the chapter breaks.  This culminated in a 4962 line jUnit TestCase that covered 97 books.  Granted, 97 out of 36478 books is not statistically significant but it takes a long time to go through every book to copy and paste chapter headings.  Out of those 97, 36 failed to correctly find all the headings.  

My code relies on consistency to work properly.  When finding numbered chapters, I need each level in the hierarchy to appear every time.  If multiple books are present, the sections will be named differently and have different internal hierarchies.  One book will use “Chapter number.”  The next will use only Roman numerals.  For this reason, the following books failed:

  • Redemption and Two Other Plays, by Leo Tolstoy et al
  • What Men Live By and Other Tales, by Leo Tolstoy
  • The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle
  • The Forged Coupon and Other Stories, by Leo Tolstoy

Another reason for failure is inconsistency in the layout of the original document.  The following books are very bad.  I have no hope of fixing these without significantly altering the source document:

  • The Strange Case of Dr. Jekyll and Mr. Hyde, by Robert Louis Stevenson
  • The Return of Sherlock Holmes, by Arthur Conan Doyle
  • Ulysses, by James Joyce

The following aren’t too bad but still suffer from consistency of layout issues.

  • The Federalist Papers.
  • Aesop’s Fables

The other 27 failures are either missing 1 or 2 unnumbered section breaks, e.g. Preface, Transcriber’s Notes, Contents, End of the Project Gutenberg EBook…, etc.  These are due to the spacing around the heading not looking like a break because the chapter breaks had significantly more spacing around them.

Complaints about Project Gutenberg Etext Formatting

I understand that Project Gutenberg’s content is produced entirely be volunteers with only the barest of minimal direction to define just how the documents should be laid out.  However, human nature seems to generate a fairly consistent layout scheme.

  1. Guidelines exist for the formatting of the prefatory material, but these are often ignored.  The Strange Case of Dr. Jekyll and Mr. Hyde, by Robert Louis Stevenson, was so bad that I could not get the title from the prefatory material.
  2. Paragraphs in the same chapter should be separated by one blank line.  
  3. Breaks within a chapter should be separated by two blank lines.
  4. Only verse text should have leading spaces on the line.  These spaces should format the poem as well as possible in text.
  5. Chapter headings should be preceded by more than two blank lines, preferably three or more.  Chapter headings without subheadings should be followed by more than one blank line.  This visibly (and programmatically) separates the heading text from the chapter body.
  6. Chapter subheadings should be followed by more than one blank line.  When this is not the case, identifying subheadings becomes difficult.
  7. ALL CAPS are _underscores_ are used to emphasize text.  In GutenMark, these are converted to italics.  I have not yet handled these and need to. 
  8. GutenMark has a heurstic system for identifying diacritical marks and converting them to their HTML equivalents.  I have not yet handled these.

HTML Generation

I have generated some HTML pages by splitting the document at each chapter heading.  Currently, I am comparing my output to GutenMark’s to see where both programs fail.