Data Analysis

From TeleCafeWiki
Revision as of 12:08, 5 May 2016 by Dave (talk | contribs) (→‎PDF Conversion: Updated pdftotext.org to pdftotext.github.io.)
Jump to navigation Jump to search

Business Intelligence

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

BI Tools

A section from the Creative Commons book Getting the Most Out of Information Systems: A Manager's Guide.
These MySQL reporting tools fall into two broad camps – business intelligence suites where reporting is a major component, and tools that are specifically aimed at reporting. Also many of them are free.
Free Open Source Business Intelligence Solutions
Pentaho Community Edition, OpenText Actuate Information Hub, Free Edition, ReportServer, JasperReports Business Intelligence, Jedox Base, SpagoBI, ART, Pentaho Reporting, JMagallanes, OpenReports, Seal Report, Openi, NextReports, RapidMiner, Mondrian, KNIME.
Free Cloud Business Intelligence Solutions
Watson Analytics, SAP Lumira Cloud, Power BI, Microstrategy Analytics Express and Birst Express for NetSuite.
Free Proprietary Business Intelligence Solutions
EspressReport Lite, SAP Lumira, QlikView Personal Edition, InetSoft, Qlik Sense Desktop, icCube, Tableau Public.
Open Source Commercial Business Intelligence Solutions
Pentaho, Jaspersoft, Palo, Actuate Corporation, TACTIC.

Data Analysis Tools

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
Take control of your R code. RStudio is the premier integrated development environment for R. It is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or over the web with RStudio Server. Download RStudio (for Windows, Mac, or Linux).
Open Refine is a standalone open source desktop application for data cleanup and transformation to other formats, the activity known as data wrangling. It is similar to spreadsheet applications (and can work with spreadsheet file formats), however, it behaves more like a database.
Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, ...
HTSQL is designed for data analysts and other accidental programmers who have complex business inquiries to solve and need a productive tool to write and share database queries. HTSQL is free and open source software.
Jigsaw is a visual analytics system to help analysts and researchers better explore, analyze, and make sense of such document collections.

Learn Data Science

Also See: Computer Productivity Hacks#Educate_Yourself

Academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.

We’re likely to see more uncredentialed, inexperienced individuals try their hands at data science, bootstrapping their skills on the open-source ecosystem and using the diversity of modeling tools available. Just as data-science platforms and tools are proliferating through the magic of open source, big data’s data-scientist pool will as well.

And there’s yet another trend that will alleviate any talent gap: the democratization of data science. Autodidacts – the self-taught, uncredentialed, data-passionate people – will come to play a significant role in many organizations’ data science initiatives.
 
Start analyzing real data today, for free. Join 50,000 other learners from around the world.
The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to make data useful.
The Internet is Your Oyster: With Coursera, ebooks, Stack Overflow, and GitHub -- all free and open -- how can you afford not to take advantage of an open source education?
Numerous "flash card"-style data science lessons.

Regular Expressions

In theoretical computer science and formal language theory, a regular expression (abbreviated regex or regexp and sometimes called a rational expression) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations.

RegEx Tutorials

Includes:
- Interactive Tutorial › Learn to use Regular Expressions
- Practical Examples › Practice your Regular Expressions
- RegEx Cheatsheet › Regular Expressions in PHP & More
Any non-trivial regex looks daunting to anybody not familiar with them. But with just a bit of experience, you will soon be able to craft your own regular expressions like you have never done anything else.
Regex is the gift that keeps giving. Once you learn it, you discover it comes in handy in many places where you hadn't planned to use it.
This is an in-progress book that quickly teaches you regular expressions.

RegEx Tools

Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, JavaScript and Python. The website also features a community where you can share useful expressions.
Regular expression tester with syntax highlighting, contextual help, video tutorial, reference, and searchable community patterns.
Pythex is a real-time regular expression editor for Python, a quick way to test your regular expressions.
JavaScript regex tester. Highlights matches on the fly.

Small Data

Small Data is everything Big Data is not.

You Might Not Be Big Data If

- You were generated through human data entry. (Big Data came about in order to handle the exponential growth of machine-generated data, because we humans aren’t fast enough to outpace a good old relational database).
- You are an operational database. For instance, CRM is never Big Data, and ERP is never Big Data.
- You fit just fine in a MySQL database. Even if you have to put a lot of RAM in it, it’s still not Big Data.[1]

For the majority of small and medium businesses, Big Data is the technology of the future, not the reality they experience today. To be blunt, most SMBs don’t even have a handle on the Small Data they’re already creating and collecting themselves. (And if many enterprise organizations are being honest, neither do they.) According to Forrester Research, most companies are analyzing a mere 12% of their existing data. That leaves a whopping 88% of data that businesses are flat out ignoring. Can you imagine the potential of actually leveraging that existing data to derive data-driven business insights? Instead of chasing the Big Data dream, SMBs should consider picking up the dollars that are effectively lying on the floor, and invest first in leveraging their Small Data.
Just as we now find it ludicrous to talk of "big software" – as if size in itself were a measure of value – we should, and will one day, find it equally odd to talk of "big data". Size in itself doesn't matter – what matters is having the data, of whatever size, that helps us solve a problem or address the question we have.
What is small data, you ask? Small data is a dataset that contains very specific attributes. Small data is used to determine current states and conditions or may be generated by analyzing larger data sets.

SQL

Cheat Sheets

Simple-Talk's free wallchart of the most important SSMS keyboard shortcuts aims to help find all those curiously forgettable key combinations within SQL Server Management Studio that unlock the hidden magic that is available for editing and executing queries.

SQL & PowerShell

Scripting is very powerful. And for me, one of the best scripting languages is PowerShell (PoSH). Yes, PoSH takes a bit of getting used to, but once you pass the initial learning curve, you end up with a powerful tool in your hands.
Multiple examples; approaches.
Download the latest version of this PowerShell™ wallchart and read the accompanying in-depth article from Simple-Talk.
Also See:

Text Extraction

PDF Conversion

Capture2Text enables users to do the following:
  1. Optical Character Recognition (OCR)
  2. Speech Recognition
Detexter is an app designed to extract text from PDF files.
I used this service to successfully convert the .US Locality Domain Name Registration Terms and Conditions form from PDF to Word format.
Lists several options.
Convert PDF to HTML without losing text or format.
Note: (Domain squatter has apparently hijacked the original pdftotext.org domain. But the service is still here: http://pdftotext.github.io/
pdftotext.org is the best online service for easily extracting text from your PDF files. Conversion from PDF to TXT is really fast thanks to our in-browser conversion architecture. Your PDF files are never uploaded to the Internet, so even private PDF files are safe to convert with this service. The conversion is done locally in your browser – you can even convert when you are offline! There is no need for any registration or sign-up, and the service will always be free to use.
Extracts plain text from documents in all popular formats.
Tabula is a tool for liberating data tables locked inside PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.

Data Scrape

Tools and tips compiled by journalists from PBS and Omaha World-Herald.
Watch (video) how easy it is to import data from a Web page into R.
Scrapinghub's list of open source scraping projects.
Monitor website links with ease. Sitestalker supervises websites and notifies you when your desired content hits the web.Stop wasting your time constantly refreshing websites.
Sitestalker is great for:
Finding jobs
Searching for an apartment
Getting the best bargains
Clipping
Tools for gathering data from public sources.

Text Search

Makes tools to search text content, including:
  1. FALCON - Text Search Java Project: JSON based text search Java Project
  2. HAWK - PDF Text Search Java Project: Taking initiative for Document Text Search
Xpdf is an open source viewer for Portable Document Format (PDF) files.
Windows installer: Short Programs/Scripts (Look for the xpdf3.exe / poppler.exe links in left sidebar.)

Text Transformation

Do you have a list of text strings that you want to modify the format on? Copy and paste the list into the a box, then provide an example of how you want each text string formatted. The hope is that Transformy will do the rest.

See Also

References