Python  - Panda - Basic Tutorial                                     Home : www.sharetechnote.com

 

 

 

 

Panda - Basic Tutorial

 

In this page, I will give you some of the basic operations for Panda. I will not go very deep into Panda functions in this page. This page is just for you to try some basic things without getting scared of too complicated operation right after you installed panda package.

 

The data for this tutorial is from https://support.spatialkey.com/spatialkey-sample-csv-data/ (If this link does not work, try this).  

Download the file, unzip it and put it into the folder where Python is installed. Of course, you can put the file into other places. However, if you want to run the command in this tutorial as it is without modifying the data path, put the csv file in your Python installation folder.

 

The data structure of the csv file is as shown below.

 

 

Following is the list of basic steps that you may would go through most commonly.

 

 

 

 

Importing Packages

 

In this tutorial, multiple packages listed below will be used. You can import all of the these package at the start or you can import them in the middle as you need it.

 

>>> import pandas as pd

>>> import numpy as np

>>> import matplotlib.pyplot as plt

>>> import random as rd

 

 

 

Read csv file

 

First, read the whole csv file and store the data into a variable. There are many options for read_csv() function so that you can restructure the data or read only a subset of the data. But I used the simplest way to use this function in this tutorial.

NOTE : it is assumed that you put the following csv file in c:\tmp folder.

 

>>> df = pd.read_csv('C:\\tmp\\FL_insurance_sample.csv')

 

 

 

Extract Row

 

Now let's take a specific row from the read csv file. For example, I want to take out the first row of the data (table) using the command as shown below.

 

>>> df.ix[0]

policyID                   119736

statecode                      FL

county                CLAY COUNTY

eq_site_limit              498960

hu_site_limit              498960

fl_site_limit              498960

fr_site_limit              498960

tiv_2011                   498960

tiv_2012                   792149

eq_site_deductible              0

hu_site_deductible         9979.2

fl_site_deductible              0

fr_site_deductible              0

point_latitude            30.1023

point_longitude          -81.7118

line                  Residential

construction              Masonry

point_granularity               1

Name: 0, dtype: object

 

 

You can extract the multiple consecutive rows from the data (table) as shown below.

 

>>> df.ix[3:5]

   policyID statecode       county  eq_site_limit  hu_site_limit  \

3    333743        FL  CLAY COUNTY            0.0       79520.76

4    172534        FL  CLAY COUNTY            0.0      254281.50

5    785275        FL  CLAY COUNTY            0.0      515035.62

 

   fl_site_limit  fr_site_limit   tiv_2011   tiv_2012  eq_site_deductible  \

3            0.0            0.0   79520.76   86854.48                 0.0

4            0.0       254281.5  254281.50  246144.49                 0.0

5            0.0            0.0  515035.62  884419.17                 0.0

 

   hu_site_deductible  fl_site_deductible  fr_site_deductible  point_latitude

3                 0.0                 0.0                   0       30.063236

4                 0.0                 0.0                   0       30.060614

5                 0.0                 0.0                   0       30.063236

 

   point_longitude         line construction  point_granularity

3       -81.707703  Residential         Wood                  3

4       -81.702675  Residential         Wood                  1

5       -81.707703  Residential      Masonry                  3

 

 

You can extract multiple / non-consecutive rows using a array as shown below.

 

>>> df.ix[[3,5,9]]

 

   policyID statecode       county  eq_site_limit  hu_site_limit  \

3    333743        FL  CLAY COUNTY            0.0       79520.76

5    785275        FL  CLAY COUNTY            0.0      515035.62

9    142071        FL  CLAY COUNTY       705600.0      705600.00

 

   fl_site_limit  fr_site_limit   tiv_2011    tiv_2012  eq_site_deductible  \

3            0.0            0.0   79520.76    86854.48                 0.0

5            0.0            0.0  515035.62   884419.17                 0.0

9       705600.0       705600.0  705600.00  1010842.56             14112.0

 

   hu_site_deductible  fl_site_deductible  fr_site_deductible  point_latitude  \

3                 0.0                 0.0                   0       30.063236

5                 0.0                 0.0                   0       30.063236

9             35280.0                 0.0                   0       30.100628

 

   point_longitude         line construction  point_granularity

3       -81.707703  Residential         Wood                  3

5       -81.707703  Residential      Masonry                  3

9       -81.703751  Residential      Masonry                  1

 

 

You can get the exactly same result as above using the following command.

 

>>> r = [3,5,9]

>>> df.ix[r]

 

 

 

Extract Column

 

You can extract a certain column from the csv file using the column title as shown below.

 

>>> df_eq = df['eq_site_limit']

 

You can print out the extracted data using print(). Don't worry even if the number of the extracted data is too big. I noticed that the print() print only a subset of the data as shown below.

 

>>> print(df_eq)

0         498960.0

1        1322376.3

2         190724.4

3              0.0

4              0.0

5              0.0

6              0.0

7         328500.0

8         315000.0

9         705600.0

10        831498.3

11             0.0

12             0.0

13             0.0

14             0.0

15             0.0

16             0.0

17             0.0

18             0.0

19             0.0

20             0.0

21             0.0

22             0.0

23             0.0

24             0.0

25             0.0

26             0.0

27             0.0

28             0.0

29             0.0

           ...

36604          0.0

36605          0.0

36606          0.0

36607          0.0

36608          0.0

36609    1917403.2

36610    1392346.8

36611    1175310.0

36612      47250.9

36613      18630.0

36614       9450.0

36615          0.0

36616          0.0

36617          0.0

36618          0.0

36619          0.0

36620          0.0

36621          0.0

36622          0.0

36623          0.0

36624          0.0

36625          0.0

36626          0.0

36627          0.0

36628          0.0

36629          0.0

36630    1297057.5

36631     173286.9

36632    1499781.6

36633     373488.3

Name: eq_site_limit, dtype: float64

 

 

You can extract the multiple columns as using the array (list) of column titles as shown below.

 

>>> df_eq = df[['eq_site_limit','eq_site_deductible']]

>>> print(df_eq)

       eq_site_limit  eq_site_deductible

0           498960.0                 0.0

1          1322376.3                 0.0

2           190724.4                 0.0

3                0.0                 0.0

4                0.0                 0.0

5                0.0                 0.0

6                0.0                 0.0

7           328500.0                 0.0

8           315000.0                 0.0

9           705600.0             14112.0

10          831498.3                 0.0

11               0.0                 0.0

12               0.0                 0.0

13               0.0                 0.0

14               0.0                 0.0

15               0.0                 0.0

16               0.0                 0.0

17               0.0                 0.0

18               0.0                 0.0

19               0.0                 0.0

20               0.0                 0.0

21               0.0                 0.0

22               0.0                 0.0

23               0.0                 0.0

24               0.0                 0.0

25               0.0                 0.0

26               0.0                 0.0

27               0.0                 0.0

28               0.0                 0.0

29               0.0                 0.0

...              ...                 ...

36604            0.0                 0.0

36605            0.0                 0.0

36606            0.0                 0.0

36607            0.0                 0.0

36608            0.0                 0.0

36609      1917403.2                 0.0

36610      1392346.8                 0.0

36611      1175310.0                 0.0

36612        47250.9                 0.0

36613        18630.0                 0.0

36614         9450.0                 0.0

36615            0.0                 0.0

36616            0.0                 0.0

36617            0.0                 0.0

36618            0.0                 0.0

36619            0.0                 0.0

36620            0.0                 0.0

36621            0.0                 0.0

36622            0.0                 0.0

36623            0.0                 0.0

36624            0.0                 0.0

36625            0.0                 0.0

36626            0.0                 0.0

36627            0.0                 0.0

36628            0.0                 0.0

36629            0.0                 0.0

36630      1297057.5                 0.0

36631       173286.9                 0.0

36632      1499781.6                 0.0

36633       373488.3                 0.0

 

[36634 rows x 2 columns]

 

 

 

Extracting a Subset (Sub Table)

 

Using the combination of Row Extraction and Column Extraction functionalities, you can cut out a subset (a subtable) from the csv file as shown in the following examples.

 

>>> df_subset = df.ix[3:10][['eq_site_limit','eq_site_deductible']]

>>> print(df_subset)

   eq_site_limit  eq_site_deductible

3            0.0                 0.0

4            0.0                 0.0

5            0.0                 0.0

6            0.0                 0.0

7       328500.0                 0.0

8       315000.0                 0.0

9       705600.0             14112.0

10      831498.3                 0.0

 

 

>>> df_subset = df.ix[3:10,['eq_site_limit','eq_site_deductible']]

>>> print(df_subset)

   eq_site_limit  eq_site_deductible

3            0.0                 0.0

4            0.0                 0.0

5            0.0                 0.0

6            0.0                 0.0

7       328500.0                 0.0

8       315000.0                 0.0

9       705600.0             14112.0

10      831498.3                 0.0

 

 

>>> df_subset = df.ix[[3,5,10]][['eq_site_limit','eq_site_deductible']]

>>> print(df_subset)

    eq_site_limit  eq_site_deductible

3             0.0                 0.0

5             0.0                 0.0

10       831498.3                 0.0

 

 

>>> df_subset = df.ix[[3,5,10],['eq_site_limit','eq_site_deductible']]

>>> print(df_subset)

    eq_site_limit  eq_site_deductible

3             0.0                 0.0

5             0.0                 0.0

10       831498.3                 0.0

 

 

If you use a random array as row index array as shown below, you can get random sampling of the csv file.

 

>>> rndRow = [rd.randint(0,100) for i in range(10)]

>>> df_subset = df.ix[rndRow][['eq_site_limit','eq_site_deductible']]

>>> print(df_subset)

    eq_site_limit  eq_site_deductible

34            0.0                 0.0

43       213876.0                 0.0

1       1322376.3                 0.0

52            0.0                 0.0

3             0.0                 0.0

83        42300.0                 0.0

15            0.0                 0.0

86            0.0                 0.0

11            0.0                 0.0

21            0.0                 0.0

 

 

 

Plot Data

 

You can easily visualize (e.g, plot) the data using functions in matplotlib as shown in this section.

 

First, you need to import the matplotlib package as shown below.

 

>>> import matplotlib.pyplot as plt  

    #  in manu tutorial, they say 'import matplotlib as plt' but this didn't work in my setup. When I tried plot() function, it gave me an errror as shown below. So I tried 'import matplotlib.pyplot as plt' and it worked.

      >>> plt.plot(df_eq)

      Traceback (most recent call last):

        File "<stdin>", line 1, in <module>

      AttributeError: module 'matplotlib' has no attribute 'plot'

 

Then extract a dataset as in the following example.

 

>>> df_eq = df['eq_site_limit']

 

Then plot it as shown below. However, just running a plot() function would not display any graph. It just create a plot in the memory without displaying it on the screen.

 

>>> plt.plot(df_eq)

[<matplotlib.lines.Line2D object at 0x0612E770>]

 

In order to display the plot, you need to run show() function as shown below.

 

>>> plt.show()

 

 

 

If you want to run all of the above steps in a file, write the following file and try it.

 

panda_BasicCheck.py

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import random as rd

 

df = pd.read_csv('C:\\tmp\\FL_insurance_sample.csv')

 

df.ix[0]

df.ix[3:5]

df.ix[[3,5,9]]

r = [3,5,9]

 

df.ix[r]

df_eq = df['eq_site_limit']

print(df_eq)

 

df_eq = df[['eq_site_limit','eq_site_deductible']]

print(df_eq)

 

df_subset = df.ix[3:10][['eq_site_limit','eq_site_deductible']]

print(df_subset)

 

df_subset = df.ix[3:10,['eq_site_limit','eq_site_deductible']]

print(df_subset)

 

df_subset = df.ix[[3,5,10]][['eq_site_limit','eq_site_deductible']]

print(df_subset)

 

df_subset = df.ix[[3,5,10],['eq_site_limit','eq_site_deductible']]

print(df_subset)

 

rndRow = [rd.randint(0,100) for i in range(10)]

df_subset = df.ix[rndRow][['eq_site_limit','eq_site_deductible']]

print(df_subset)

 

df_eq = df['eq_site_limit']

plt.plot(df_eq)

plt.show()