Python Data Analysis Techniques for Beginners

To begin with pandas. This library is a game-changer when it comes to working with large datasets in Python. It allows us to easily read and manipulate CSV files (which is what most data looks like), as well as perform some basic analysis on the data.

To get started, let’s download a dataset from Kaggle we’ll be using the “Titanic” dataset for this tutorial. You can find it here: https://www.kaggle.com/c/titanic/data

Once you have the data downloaded and saved in your working directory, let’s load it into pandas with the following code:

# Import the pandas library and assign it to the variable "pd"
import pandas as pd

# Use the read_csv function from pandas to load the dataset into a dataframe and assign it to the variable "df"
# Note: The path to the dataset should be specified within the parentheses
df = pd.read_csv('path/to/your/dataset')

# Use the head() function to display the first 5 rows of the dataframe
print(df.head())



# Import the pandas library and assign it to the variable "pd"
import pandas as pd

# Use the read_csv function from pandas to load the "Titanic" dataset into a dataframe and assign it to the variable "df"
# Note: The path to the dataset should be specified within the parentheses
df = pd.read_csv('path/to/titanic/dataset')

# Use the head() function to display the first 5 rows of the dataframe
print(df.head())

This will print out the first five rows of your dataset, which is a great way to make sure everything loaded correctly.

Now that we have our data in pandas, some basic analysis techniques. One of the most common things you might want to do with your data is filtering it based on certain criteria. For example, if you only care about passengers who survived the Titanic disaster, you can use the following code:

# Import pandas library
import pandas as pd

# Load data into a pandas dataframe
df = pd.read_csv('titanic_data.csv')

# Create a new dataframe called "survivors" by filtering the original dataframe "df" based on the "Survived" column where the value is equal to 1 (indicating the passenger survived)
survivors = df[df['Survived'] == 1]

# Print the first 5 rows of the "survivors" dataframe
print(survivors.head())

# Output:
# PassengerId  Survived  Pclass  Name  Sex  Age  SibSp  Parch  Ticket  Fare  Cabin  Embarked
# 1            1         3       Braund, Mr. Owen Harris  male  22.0  1      0      A/5 21171  7.25  NaN    S
# 2            1         1       Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0  1      0      PC 17599  71.2833  C85  C
# 3            1         3       Heikkinen, Miss. Laina  female  26.0  0      0      STON/O2. 3101282  7.925  NaN  S
# 8            1         3       Palsson, Master. Gosta Leonard  male  2.0  3      1      349909  21.075  NaN  S
# 9            1         2       Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0  0      2      347742  11.1333  NaN  S

# The survivors dataframe now only contains data for passengers who survived the Titanic disaster. This can be useful for further analysis or visualization.

This will create a new DataFrame called ‘survivors’ that contains only the rows where the ‘Survived’ column is equal to 1 (which means they survived). Pretty cool, right?

Another common analysis technique is grouping your data by certain columns. For example, let’s say you want to see how many passengers of each gender and class survived:

# Import pandas library
import pandas as pd

# Create a dataframe from a csv file
df = pd.read_csv('titanic.csv')

# Create a new dataframe called 'survivors' that contains only the rows where the 'Survived' column is equal to 1 (which means they survived)
survivors = df[df['Survived'] == 1]

# Group the data by 'Sex' and 'Pclass' columns and count the number of survivors in each group
grouped = survivors.groupby(['Sex', 'Pclass'])['Survived'].count()

# Print the grouped data
print(grouped)

# Output:
# Sex     Pclass
# female  1         91
#         2         70
#         3         72
# male    1         45
#         2         17
#         3         47
# Name: Survived, dtype: int64

# The 'groupby' function groups the data by the specified columns and the 'count' function counts the number of survivors in each group.
# The output shows the number of female and male survivors in each passenger class.

This will create a new DataFrame called ‘grouped’ that contains the count of passengers who survived for each combination of gender and class.

Finally, visualization because what good is data if you can’t see it? One of my favorite libraries for this is matplotlib. It allows us to create some pretty sweet charts and graphs with just a few lines of code. Let’s say we want to plot the number of passengers who survived by class:

# Importing the necessary library
import matplotlib.pyplot as plt

# Grouping the data by passenger class and counting the number of survivors in each class
grouped = df.groupby('Pclass')['Survived'].count()

# Creating a bar plot with the number of survivors on the y-axis and passenger class on the x-axis
plt.bar(range(1,4), grouped)

# Adding a label for the x-axis
plt.xlabel("Class")

# Adding a label for the y-axis
plt.ylabel("Number of Survivors")

# Adding a title for the plot
plt.title("Titanic Survivor Count by Class")

# Displaying the plot
plt.show()

# The purpose of this script is to visualize the number of survivors on the Titanic by passenger class using a bar plot. 
# The matplotlib library is imported to create the plot. 
# The data is grouped by passenger class and the number of survivors is counted. 
# The bar plot is then created with the number of survivors on the y-axis and passenger class on the x-axis. 
# Labels and a title are added for better understanding of the plot. 
# Finally, the plot is displayed for visualization.

This will create a bar chart that shows the number of passengers who survived for each class (1 being first, 2 being second, and so on). Pretty cool, right?

Of course, this is just scratching the surface but hopefully it’s enough to get you started!

SICORPS