Python, one of the most popular scripting languages, is also one of the most preferred tools for data analysis and visualization. In addition to the broader Python developer community, there is also a significant group that uses Python to analyze data, draw actionable insights, and make decisions.
With its extensive collection of helper libraries and platforms, Python is a great tool for quick, iterative data exploration. Python’s set of libraries includes everything from visualization to statistical analysis, making it convenient for its users to jump into the data and begin identifying patterns.
Together with the ability to iterate quickly in data and statistical analysis, there are great open source tools on managing data pipelines and workflows. A growing community of analysts are finding new ways of using Python to crunch numbers and understand their data.
We had a chance to catch up with the Chief Analyst at Mode, Benn Stancil, and ask him about the importance of Python, how to use it in day-to-day analysis, and to tell us about some key features of Mode’s new product, Mode Python Notebooks. Mode Python Notebooks is a hosted solution that allows analysts to use Python for exploratory analysis.
I’m interested in Python for the same reasons I like SQL: It gives me the power and flexibility to answer any question. The community is great and adoption is on the rise.
There are many easy to use Python libraries to make data exploration convenient and immediate. This allows for iterative data analysis. With Python, you can really chase your curiosities down the rabbit hole.
Lastly, Python’s utility and flexibility allows it to be used for a variety of tasks within the data science stack. For example, Luigi and Airflow both allow for managing data pipelines and workflows in Python. By completing exploratory analysis in Python, there can be times where the work carries over into production.
What are the most popular use cases for SQL as opposed to Python?
SQL is designed to query and extract data from a database. It’s a necessary first step to get the data into a usable format. For instance, SQL allows you to easily join several data sets to create a table that you can explore further.
SQL isn’t really designed for manipulating or transforming data in certain ways. Higher level data manipulation that is common with data science, such as statistical analysis, regressions, trend lines, and working with time series data, isn’t easy in SQL.
Despite these limitations, because SQL is necessary for extracting data, it’s still commonly used for complex operations. The query below, which calculates quantiles for different series in a data, is something I’ve used versions of many times.
WITH details AS ( SELECT series, value, ROW_NUMBER() OVER (PARTITION BY series ORDER BY value) AS row_number, SUM(1) OVER (PARTITION BY series) AS total FROM dataset ), quartiles AS ( SELECT series, value, AVG(CASE WHEN row_number >= (FLOOR(total/2.0)/2.0) AND row_number <= (FLOOR(total/2.0)/2.0) + 1 THEN value/1.0 ELSE NULL END ) OVER (PARTITION BY series) AS q1, AVG(CASE WHEN row_number >= (total/2.0) AND row_number <= (total/2.0) + 1 THEN value/1.0 ELSE NULL END ) OVER (PARTITION BY series) AS median, AVG(CASE WHEN row_number >= (CEIL(total/2.0) + (FLOOR(total/2.0)/2.0)) AND row_number <= (CEIL(total/2.0) + (FLOOR(total/2.0)/2.0) + 1) THEN value/1.0 ELSE NULL END ) OVER (PARTITION BY series) AS q3 FROM details ) SELECT series, MIN(CASE WHEN value >= q1 - ((q3-q1) * 1.5) THEN value ELSE NULL END) AS minimum, AVG(q1) AS q1, AVG(median) AS median, AVG(q3) AS q3, MAX(CASE WHEN value <= q3 + ((q3-q1) * 1.5) THEN value ELSE NULL END) AS maximum FROM quartiles GROUP BY 1
When would Python come in?
Python has a ton of libraries (e.g. Pandas, StatsModel, and SciPy) that are designed for statistical and mathematical analysis. The libraries also do a great job of abstracting away the details so that you don’t need to calculate all the underlying math by hand. Moreover, you can get your results immediately, so you can use Python iteratively to explore your data.
Rather than saying “I want to do a regression analysis” and sitting down for half an hour figuring out where to begin in SQL, the Python libraries make it so that you can just run the analysis, see the results, and continue exploring the path your curiosity takes you down. With Python, there is not much lag between inspiration and action. With SQL, on the other hand, I often think twice before going down a path that may or may not be fruitful.
For example, I’d only write the query above if I really knew I wanted to present the quantiles of that dataset. Because the entire thing can be accomplished with the one line of Python below, I’d do it much earlier in my analytical process–and may discover something I wasn’t looking for as a result.
Another way to think about the difference between Python and SQL is that Python allows you to start with one large table, from which you branch off different analyses in different directions. One avenue of inspiration can bring you to another avenue and to another avenue. The speed and flexibility of analysis makes it easy to go down many exploratory paths.
Those kinds of analyses sound very different. Why combine SQL and Python in one place?
Because SQL and Python each have individual strengths and weaknesses. Tying the languages together gives analysts the best of both worlds.
First, SQL is needed to build the data set into a final table that has all of the necessary attributes. Then, from this large data set, you can use Python to spin off deeper analysis.
What would it take for a SQL analyst to learn Python?
Like many skills, the best way to learn how to use Python for analysis is by diving in to work on a problem you’re interested in, care about, and with which you’re somewhat familiar.
When you work on something that you’re interested in, you tend to go deeper. You uncover something in the core problem that piques your interest, and you want to learn more by analyzing the data set in a different way. You begin asking more and more questions. This curiosity can push you further than you would go otherwise, and a is a source of a lot of real learning.
You also should work on data that you’re somewhat familiar with, so you know when you do something wrong. You’ll have better instincts about what’s going on and what to expect. Compare this to when you’re working on data about which you know nothing—like flower petal sizes (an unusually popular data set found in many Python examples). If your analysis concludes that “all of these flowers have two centimeter-long petals” and you have no idea whether that is reasonable, you may just assume it’s right and move on.
Mode is also releasing a new Python tutorial that aims to help SQL users learn how and where to integrate Python into their workflow. In addition, the Python tutorials provide problems that are familiar to those in business settings, instead of academic problems.
How would you expect learning Python to help benefit analysts in their job and their careers?
Learning Python definitely can augment an analyst’s skill set.
An analyst needs to communicate the business value through data. One part of the job is to find the insights from the data, but the more effective job is also to include the right context and narrative around the insights that can compel your teammates towards action. And since using data and analytics to make decisions is becoming more important in the workplace, the role of the analyst to deliver comprehensive analysis is more important than ever.
Is it easy to get started with Python? What kind of setup and tooling do you need?
Until recently, getting started with Python for analysis requires installing a few things—Python, several main statistics and data analysis libraries, and Notebooksoftware that’ll run the analysis locally on your computer. Then, you’d have to run the Notebook, starting a local server to execute your Python commands.
The results generated from your commands on Notebooks will exist on your desktop. In order to share it, you basically download a Python Notebook HTML file (which has a mix of code and results) and send that around. In order to easily parse the results from the HTML, your colleagues would have to open that file in a browser. And at that point, unless they have Python set up too, the code isn’t re-executable.
When you run Python locally, you’re limited to the power of your computer, you have to leave your computer open to run it, and running scripts can slow other things down. By running it remotely, you can run it from any machine and you can run something, close your computer and walk away, and still have your results waiting when you get back.
With Mode Python Notebooks, all of that setup and hosting is taken care of for you. If you’re using Mode to query an existing database, there’s a tab for a Python Notebook. You can open it up and the query results table will be automatically populated. And after generating plots, time series analysis, summary statistics, etc., in the Notebook, it’s easy to curate the results into an instantly shareable report anyone can re-run. Moreover, you don’t have to run it on your local machine, which saves your computer’s processing power and memory. Finally, setting it all up is as easy as setting up a database connection.
Thanks for your time, Benn!
If you’re interested in learning how to run data analysis in Python, check out Mode’s new Python tutorial. And, if you’re a Segment customer looking to run Python scripts on your event and cloud app data check out Mode Python Notebooks.