<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[data visualizations - ]]></title><description><![CDATA[data visualizations - ]]></description><link>http://carltonmatthews.com/</link><generator>Ghost 0.5</generator><lastBuildDate>Wed, 22 Apr 2026 16:32:11 GMT</lastBuildDate><atom:link href="http://carltonmatthews.com/tag/data-visualizations/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Berkshire Hathaway Quantitive Text Analysis]]></title><description><![CDATA[<p>In my continuing effort to #showYourWork here is my final project for my Data Management and Visualizations course.  Hopefully this doesn't trigger a plagerism record since <a href="http://turnitin.com/">Turn It In</a> web crawls as part of its process.</p>

<p>TO: Senior Director Business Analysis <br>
FROM: Carlton Matthews <br>
DATE: 24 April 2016 <br>
SUBJECT: Using Python for Textual Analysis  </p>

<hr>

<h3 id="introduction">Introduction</h3>

<p>At our last information exchange meeting, you put to the team to determine ways to increase our analytical toolkit.   As a financial services business much of what we do involves numbers and other hard data points. However, during our discussion, we wondered if there were ways to determine sound investments based on more than just hard financial data.</p>

<p>I recommended that we look at other sources of information related to companies we are investigating.   Specifically, I wondered if we could look through the annual reports looking at the CEOs letter to identify any interesting patterns.   Over the last week I have spent time developing a program using the Python language to look at a series of CEO letters from Berkshire Hathaway Incorporated a publicly traded company headed by Warren Buffett.   Every year Mr. Buffett writes a letter to the shareholders as part of their annual report.</p>

<h3 id="processingtherawtext">Processing the Raw Text</h3>

<p>I pulled CEO letters in 10 year intervals starting in 1977 and ending in 2016.  This gave me a total word count 53219.  Using Python to extract the text was a simple exercise however turning that data into something meaningful was a challenge.  First the more commonly used words in the English language should be excluded from our analysis.  These words are known as stop words and there are collections that exist online for use for this application.  Here is some of the list that I used for my analysis.</p>

<blockquote>
  <p>'beside', 'besides', 'between', 'beyond', 
  'bill', 'both', 'bottom', 'but', 'by',
  'call', 'can',    'cannot', 'cant', 'during', 'each'</p>
</blockquote>

<p>After processing the text files across the time periods I was give a list of over 200 words.  Each of these words was used at least 10 times within that year’s letter.  To further reduce the number of words I began to combine root words with their tenses.  I combined business and businesses as an example.  While this still left us with a list of 227 words now some interesting patterns were beginning to emerge. </p>

<h3 id="quantitativetextualanalysis">Quantitative Textual Analysis</h3>

<p><img src="http://carltonmatthews.com/content/images/2016/04/BH-word-cloud.png" alt="Berkshire Hathaway Word Cloud">
The above figure shows the top 50 words used within the CEO letters.  I could have removed Berkshire as you would expect it to be used multiple times but it serves as a visual descriptor of the data.  From here we can see the most used words are business, company and earnings.  This makes sense as Berkshire Hathaway grows by buying portions of companies. From their “Owners Manual” one of their stated goals is, “… directly owning a diversified group of businesses that generate cash and consistently earn above-average returns on capital.” <a href="http://www.berkshirehathaway.com/owners.html">(Buffett, Berkshire Hathaway Inc. - An Owners Manual, 1999))</a></p>

<p>Another way to represent this data can be seen below.  This chart shows the top 20 words stacked by year.</p>

<p><img src="http://carltonmatthews.com/content/images/2016/04/BH-Top-20-stacked-bar.png" alt="Top 20 Words Stacked Bar Chart">
As with the first visualization the top words are business, company and earnings.  Later on I will examine these three terms but I want to note that from 1977 to 1987 there was significant growth in both the company and the size of the CEO letter.  The 1977 letter is a little more than 2300 words while the 1987 letter is just over 12000.  During that time period Berkshire Hathaway also increased their stock value from $138 in 1977 to $2,950 in 1987, primarily through increasing their business holdings. <a href="http://allfinancialmatters.com/2008/04/02/a-look-at-berkshire-hathaways-annual-market-returns-from-1968-2007/">(Pritchard, 2008)</a></p>

<h3 id="mostusedwords">Most Used Words</h3>

<p>This leads us to the most important set of words in our dataset.  These three words are used more times than any other words within the CEO letters.  The words, which have already been discussed, are business, company and earnings.  In the below chart we see one representation of how these words have been used over time.  The dark blue area of each bar represent 1977, followed by 1987 in orange, ’97 in green, 2007 is read and the 2015 words in purple.  Business includes both the singular and plural usage of the word.  For a company that is in the business of holding stocks of other businesses you expect to see multiple instances.  From 1977 to 1987, as mentioned previously, there were a lot of businesses added the the Berkshire Hathaway portfolio.  That shows in the spike in the usage from 1977 to 1987. <br>
<img src="http://carltonmatthews.com/content/images/2016/04/BH-Top-3-stacked-bar.png" alt="Top 3 Words Stacked Bar Chart">
In 1977 there was a lot of discussion of the company and the companies that make up their portfolio.  As more companies were added which caused growth in Berkshire as a company the usage increased with a peak in the most recent letter.  As expected, as the company grew the earnings also grew which is seen in the usage of the the word earnings from 1977-2015.  Below we see the same data represented.  This time however we are looking at each word and their word counts over time.  It paints the same picture as described above.  This style of graph best illustrates the spike in the usage of the word business. <br>
<img src="http://carltonmatthews.com/content/images/2016/04/BH-Top-3-line-graph.png" alt="Top 3 Words Line Graph"></p>

<h3 id="conclusion">Conclusion</h3>

<p>After walking through this process.  I think that we can use this type of analysis to supplement financial models to better understand our investments.  Berkshire Hathaway was a good choice for this analysis because provides a long history and publically accessible data to glean from.  If you think that this is a worthwhile endeavor, I can begin enhancing the program to ingest and process even larger volumes data.  Berkshire Hathaway could be used as test case again as they have all their CEO letters publically available.</p>

<h3 id="references">References</h3>

<p>Buffett, W. E. (1978, March 14). 1977 Shareholder Letter from the CEO. Retrieved April 18, 2016, from Berkshire Hathaway Inc.: <a href="http://www.berkshirehathaway.com/letters/1977.html">http://www.berkshirehathaway.com/letters/1977.html</a></p>

<p>Buffett, W. E. (1988, February 29). 1987 Shareholder Letter From the CEO. Retrieved April 20, 2016, from Berkshire Hathaway Inc.: <a href="http://www.berkshirehathaway.com/letters/1987.html">http://www.berkshirehathaway.com/letters/1987.html</a></p>

<p>Buffett, W. E. (1998, February 27). 1997 Shareholder Letter From the CEO. Retrieved April 20, 2016, from Berkshire Hathaway Inc.: <a href="http://www.berkshirehathaway.com/letters/1997.html">http://www.berkshirehathaway.com/letters/1997.html</a></p>

<p>Buffett, W. E. (2008, February 27). 2007 Shareholder Letter From the CEO. Retrieved April 20, 2016, from Berkshire Hathaway Inc.: <a href="http://www.berkshirehathaway.com/letters/2007ltr.pdf">http://www.berkshirehathaway.com/letters/2007ltr.pdf</a></p>

<p>Buffett, W. E. (2016, February 27). 2015 Shareholder Letter From the CEO . Retrieved April 20, 2016, from Berkshire Hathaway Inc.: <a href="http://www.berkshirehathaway.com/letters/2015ltr.pdf">http://www.berkshirehathaway.com/letters/2015ltr.pdf</a></p>

<p>Buffett, W. E. (1999, January 30). Berkshire Hathaway Inc. - An Owners Manual. Retrieved April 20, 2016, from Berkshire Hathaway Inc.: <a href="http://www.berkshirehathaway.com/owners.html">http://www.berkshirehathaway.com/owners.html</a></p>

<p>Pritchard, J. (2008, April 02). A Look at Berkshire Hathaway’s Annual Market Returns From 1968 – 2007. Retrieved April 20, 2016, from All Financial Matters: <a href="http://allfinancialmatters.com/2008/04/02/a-look-at-berkshire-hathaways-annual-market-returns-from-1968-2007/">http://allfinancialmatters.com/2008/04/02/a-look-at-berkshire-hathaways-annual-market-returns-from-1968-2007/</a></p>

<h3 id="appendixpythoncode">Appendix: Python Code</h3>

<p><a href="https://github.com/cbmatthews/python-text-analysis">See project on Github</a></p>]]></description><link>http://carltonmatthews.com/berkshire-hathaway-quantitive-text-analysis/</link><guid isPermaLink="false">1678ccc4-0e80-45e0-b491-2defb3bcf618</guid><category><![CDATA[show_your_work]]></category><category><![CDATA[data visualizations]]></category><category><![CDATA[umuc]]></category><category><![CDATA[text analysis]]></category><category><![CDATA[bershire hathaway]]></category><category><![CDATA[python]]></category><dc:creator><![CDATA[Carlton Matthews]]></dc:creator><pubDate>Mon, 25 Apr 2016 12:45:56 GMT</pubDate></item><item><title><![CDATA[My Quantified Self]]></title><description><![CDATA[<p>I am currently in the 9th week of my <a href="http://umuc.edu/academic-programs/course-information.cfm?course=data620">Data Management and Visualization</a> class at <a href="http://umuc.edu/">University of Maryland University College</a>.  For this assignment I needed to create a and record a Time Series Presentation using a dataset of my choosing.  I uploaded my recording on youtube.</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/DJE37iC3J1g" frameborder="0" allowfullscreen></iframe>

<p>Here is the transcript: <br>
<img src="http://carltonmatthews.com/content/images/2016/04/Slide01.png" alt="The Quantified Self"></p>

<blockquote>
  <p>Good Evening and welcome to my time series presentation.  As you can see by the title I will be adding to what has been titled The Quantified Self movement.  The tagline over on the website quantifiedself.com is self knowledge through numbers.  </p>
  
  <p>My name is Carlton Matthews and I will be you guide through some of my personal data.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide02.png" alt="Why The Quantified Self"></p>

<blockquote>
  <p>So how did I become interested in this type of data tracking.  For me it started after listening to a TED talk given by <a href="https://youtu.be/TDCYJ3_gx2w">Dr. Talithia Williams entitled, Show Me the Data: Becoming an Expert in Yourself in Feb 2014</a>.  She gave this talk at Claremont College in 2014 and she explained how knowing your bodies data can help you make better decisions.  It was a fascinating talk that pushed me to further my own data collecting habit.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide03.png" alt="Fitbit and the Search for Data"></p>

<blockquote>
  <p>In the summer 2014 I purchased my first fitbit flex and started tracking daily activity like the number of steps I was taking as well as the number of hours I slept and the number of calories I would eat.  Even though I had been tracking this data for almost 2 years I have never exported it and looked for any trends. </p>
  
  <p>I decided to extract a small slice from my data and look at the 1st 2 months of 2016 and see what i can predict going through the rest of the year.  From 2 January 2016 through 2 March we will look primarily at the how the number of steps I take fluctuate from day to day.  We will also look at hours slept and calories eaten and see if either of them impact the steps.</p>
  
  <p>We will also be looking at the high temperature for each day to see if that has any influence on my daily activity.</p>
  
  <p>I expect to see that warmer temperatures lead to more steps taken.  I also think that days that I am ore active I eat less calories.  So as Dr. Williams says in her video Show Me The Data.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide04.png" alt="Steps and Sleep Line Graph"></p>

<blockquote>
  <p>In this slide we take a look at both number of steps and hours of sleep laid out together.  The Blue line shows the number of steps take while the yellowish green line shows hours sleep.  There are 2 reference lines that show the 2 goals I have with this data.  My step goal for each day is 10000 steps and my sleep goal is 8 hours per night as you can see I consistently do not meet either goal.</p>
  
  <p>The interesting thing that I see here is that periods of higher activity seemed to indicated a period of longer sleep.  I guess this is to be expected but I did not think they lines would mirror each other as much as they do.  </p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide05.png" alt="Steps and Calories"></p>

<blockquote>
  <p>Here we see steps and calories shown similarly to the previous slide.  This time however I do not see the number of calories impacting the number of steps as much.  My assumption was that on days when I was more focused on what I ate I would likewise be conscious of the amount of activity I would participate in.  Again we see that I have not been hitting my goals here denoted by the reference lines for my step and calorie goals.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide06.png" alt="Steps by Temperature"></p>

<blockquote>
  <p>When I decided to see if the daily temperature impacted the number of steps I thought that it would be an simple task to pull in the necessary data.  That was not the case.  I eventually found the Quality Controlled Local Climatological Data on the National Oceanic and Atmospheric Administration page.  This allowed me to extract the data from the local weather station near the BWI Airport.  </p>
  
  <p>With all of that introduction we see the step counts plotted against the daily temperatures for my area.  As I expected we see based on the trend line that there is an increase in the step count as the warmer weather hits.  I am glad to see this confirmed.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide07.png" alt="Steps and Sleep By Day"></p>

<blockquote>
  <p>This last chart is the most interesting to me.  I decided to look at which days I was the most active and got the most sleep.  Here is the breakdown.  I am the most active on the weekends.  This makes sense since I work in an office and am at my desk for most of the day.  I also get the most sleep on the weekends.  Again this make sense due to family, work and school commitments.  I do the least amount of moving and get the least amount of sleep on Thursdays.  I did not expect that.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide08.png" alt="Predictions"></p>

<blockquote>
  <p>Not the question is what can we predict for the data that we have just seen.  Looking at the data I expect to see an increase in the number of steps taken.  I also expect this to result in a greater amount of sleep.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide09.png" alt="Overall Observations"></p>

<blockquote>
  <p>Here are some additional observations that I made from this exercise.  Having good data is key.  This means that the data set should have consistency.  I had to exclude a larger range of data because there were gaps in my logs.  Even in this dataset there were days of missed food logging, step counts and sleep records.  </p>
  
  <p>Sadly I also observed that I am eating too much and not moving as much as I should.  Going forward I need to correct that.</p>
</blockquote>

<p><img src="http://carltonmatthews.com/content/images/2016/04/Slide10.png" alt="Reference Slide">
Thank you for joining me on this look at my quantified self.  Start tracking your own data.  Who knows what you might learn.</p>]]></description><link>http://carltonmatthews.com/my-quantified-self/</link><guid isPermaLink="false">37a03888-667b-4fc7-aa5f-c80a56a97ef4</guid><category><![CDATA[show_your_work]]></category><category><![CDATA[data visualizations]]></category><category><![CDATA[umuc]]></category><dc:creator><![CDATA[Carlton Matthews]]></dc:creator><pubDate>Tue, 05 Apr 2016 02:59:26 GMT</pubDate></item><item><title><![CDATA[Flat File vs Relational Database System Assignment]]></title><description><![CDATA[<p>My latest assignment for my Data Visualizations class give me this case,  </p>

<blockquote>
  <p><em>"You have some concerns about moving your entire airline operations out of SQL to this flat file format.  Write your boss a memo, outlining any concerns or hesitations you have about moving to this format for management of your data.  Include the pros and cons of the relational database format and the flat file format.  Be sure to think critically, and include any problematic use case scenarios."</em></p>
</blockquote>

<p>Sir,</p>

<p>I know from the last few staff meetings that we are looking to change our database configuration from a traditional relational database mangement system (RDBMS) to a flat file storage based system.   </p>

<h2 id="advantagesofaflatfilesystem">Advantages of a Flat File System</h2>

<p>Two of the main advantages of a flat file system are the simplicity of record storage and the ease of use of the data.</p>

<h3 id="recordstorage">Record Storage</h3>

<p>One of the main advantages of a flat file based system is having all of the available data in the same location.  This means that all the data available is within any given record.  In the case of our data a flat file system would look like this.</p>

<p><code>Flight_ID    Airport_Code_Origin Airport_Code_Destination    Departure_DateTime  Arrival_DateTime    Airport_Code    Airport_Location    Year_Opened Num_of_Terminals    Manufacturer    Model_Num   Original_Purchase_Date  Last_Service    Number_of_Seats Carrier_Name
1    MIA JFK 2/20/16 23:26   2/21/16 4:04    JFK New York, New York  1943    12  Boeing  737-900 12/2/09 1/26/16 500 Virgin Atlantic <br>
2    MIA SFO 2/21/16 8:55    2/21/16 9:11    SFO San Francisco, CA   1927    8   Boeing  737-900 12/2/09 1/26/16 500 Southwest Airlines <br>
3    LAS PHL 2/21/16 12:51   2/21/16 15:18   PHL Philadelphia, PA    1927    6   Embraer RJ-45   11/25/08    1/27/16 550 Delta <br>
4    SFO PIT 2/21/16 21:45   2/21/16 23:18   PIT Pittsburgh, PA  1946    4   Boeing  747-400 10/25/01    1/3/16  250 Southwest Airlines <br>
5    IAH PIT 2/22/16 19:34   2/22/16 22:27   PIT Pittsburgh, PA  1946    4   Airbus  A330    12/2/01 12/16/15    400 Virgin Atlantic</code></p>

<p>Each flight record would include all the necessary fields to describe flight.  This setup makes the data very readable and understandle to anyone who has access to the file.</p>

<h3 id="easeofuse">Ease of Use</h3>

<p>The second advantage we will examine is the ease of use for the database.  A flat file system can be viewed from any number of applications making is very accessible.  Users will also have very little difficulty with understanding the data because each record contains all available about a given flight.  Simple queries and sorting should be no problem for most flat file based systems.</p>

<h2 id="disadvantagesofflatfilesystems">Disadvantages of Flat File Systems</h2>

<p>While a flat file based system can work for some datasets I do not think it is appropriate for us.  There are several disadvangtes to using a flat file system for our data.  These include data duplication, difficulty of updating, and data security.</p>

<h3 id="dataduplication">Data Duplication</h3>

<p>In the example data extract shown above records 1 and 2 have flights orignating from the MIA airport.  Each record includes the data about that airport.  This data duplication causes the size of the flat file system to increase with unnecessary data.  Not only is the originating airports data duplicated, the aircraft data is also duplicated for every flight on that days route.  This duplication does not exsist in our RDBMS because the airport data is only stored once.  The same is true for aircraft and carrier data.  This leads us into the next disadvantage, difficulty updating records.</p>

<h3 id="updatedifficulty">Update Difficulty</h3>

<p>As was mentioned in the last section there is a lot of duplicate data in the flat file.  Imgaine when an update needs to be made to a piece of data.  In the case of an airccraft we keep track of the last service date.  When and aircraft is serviced we will need to update every instance of that aircraft within the flat file.  That would me traversing the entire data file and updating the last service date.  Even with an automated update script it would be a time consuming and error-prone process.  This is just one use case where updates could be difficult.  The same process would need to be applied for airport and carrier data.  There would need to be multiple users accessing the data to process the amount of updates we would generate.  Who would have access and how would we control it is the last disadvantage I would like to expand upon.</p>

<h3 id="datasecurity">Data Security</h3>

<p>Data housed within flat file systems is hard to restrict. Only file level restrictions can be applied in this type of configuration.  Individual records cannot be protected.  A user who has access to the file has access to all of the records.  In our system we need to restrict the level of access on a per flight basis.  This would not be possible in a flat file system.</p>

<h3 id="recommendation">Recommendation</h3>

<p>My recommendation is that we keep our system as it currently configured.  Since we are using a RDBMS our data can be accessed in a variety of different ways.  From screens within the airport terminals showing read only listings of flights to gate agents updating flight records our data is much more customizable to our needs in a RDBMS.</p>]]></description><link>http://carltonmatthews.com/flat-file-vs-rdbms/</link><guid isPermaLink="false">393bcb6a-9845-46fa-ab06-a25b58d1ef43</guid><category><![CDATA[show_your_work]]></category><category><![CDATA[data visualizations]]></category><category><![CDATA[RDBMS]]></category><category><![CDATA[Flat File Data System]]></category><dc:creator><![CDATA[Carlton Matthews]]></dc:creator><pubDate>Mon, 22 Feb 2016 04:20:33 GMT</pubDate></item></channel></rss>