Generating data sets¶

At first import the package emipy and read the data base.

The programm stored the path to the project initialisation and automatically searches for the data there and loads it. You can aswell read explicit databases. For this, give the function read_db() the path in form of a String as an argument.

import emipy as ep

db = ep.read_db()
db.head()

	FacilityReportID	PollutantReleaseAndTransferReportID	FacilityID	NationalID	ParentCompanyName	FacilityName	StreetName	BuildingNumber	City	PostalCode	...	PollutantName	PollutantGroupCode	PollutantGroupName	PollutantCAS	MethodBasisCode	MethodBasisName	TotalQuantity	UnitCode	UnitName
0	1856	1	5763	1013410312	Lenzing AG	Lenzing AG	Werkstraße 1	NaN	Lenzing	4860	...	Particulate matter (PM10)	INORG	Inorganic substances	NaN	E	Estimated	68200.0	KGM	kilogram
1	1856	1	5763	1013410312	Lenzing AG	Lenzing AG	Werkstraße 1	NaN	Lenzing	4860	...	Sulphur oxides (SOx/SO2)	OTHGAS	Other gases	NaN	M	Measured	420000.0	KGM	kilogram
2	1856	1	5763	1013410312	Lenzing AG	Lenzing AG	Werkstraße 1	NaN	Lenzing	4860	...	Carbon dioxide (CO2)	GRHGAS	Greenhouse gases	124-38-9	E	Estimated	182000000.0	KGM	kilogram
3	1856	1	5763	1013410312	Lenzing AG	Lenzing AG	Werkstraße 1	NaN	Lenzing	4860	...	Nitrogen oxides (NOx/NO2)	OTHGAS	Other gases	NaN	M	Measured	818000.0	KGM	kilogram
4	1857	1	5764	1013410313	Lenzing AG	Wasserreinhalteverband Lenzing - Lenzing AG	Werkstraße 1	NaN	Lenzing	4860	...	Zinc and compounds (as Zn)	HEVMET	Heavy metals	NaN	M	Measured	3210.0	KGM	kilogram

5 rows × 73 columns

A list of possible column names to filter for is displayed with:

db.columns

Index(['FacilityReportID', 'PollutantReleaseAndTransferReportID', 'FacilityID',
       'NationalID', 'ParentCompanyName', 'FacilityName', 'StreetName',
       'BuildingNumber', 'City', 'PostalCode', 'CountryCode', 'CountryName',
       'Lat', 'Long', 'RBDGeoCode', 'RBDGeoName', 'NUTSRegionGeoCode',
       'NUTSRegionGeoName', 'RBDSourceCode', 'RBDSourceName',
       'NUTSRegionSourceCode', 'NUTSRegionSourceName',
       'NACEMainEconomicActivityCode', 'NACEMainEconomicActivityName',
       'CompetentAuthorityName', 'CompetentAuthorityAddressStreetName',
       'CompetentAuthorityAddressBuildingNumber',
       'CompetentAuthorityAddressCity', 'CompetentAuthorityAddressPostalCode',
       'CompetentAuthorityAddressCountryCode',
       'CompetentAuthorityAddressCountryName',
       'CompetentAuthorityTelephoneCommunication',
       'CompetentAuthorityFaxCommunication',
       'CompetentAuthorityEmailCommunication',
       'CompetentAuthorityContactPersonName', 'ProductionVolumeProductName',
       'ProductionVolumeQuantity', 'ProductionVolumeUnitCode',
       'ProductionVolumeUnitName', 'TotalIPPCInstallationQuantity',
       'OperatingHours', 'TotalEmployeeQuantity', 'WebsiteCommunication',
       'PublicInformation', 'ConfidentialIndicator',
       'ConfidentialityReasonCode', 'ConfidentialityReasonName',
       'ProtectVoluntaryData', 'MainIASectorCode', 'MainIASectorName',
       'MainIAActivityCode', 'MainIAActivityName', 'MainIASubActivityCode',
       'MainIASubActivityName', 'ReportingYear', 'CoordinateSystemCode',
       'CoordinateSystemName', 'CdrReleased', 'Published',
       'PollutantReleaseID', 'ReleaseMediumCode', 'ReleaseMediumName',
       'PollutantCode', 'PollutantName', 'PollutantGroupCode',
       'PollutantGroupName', 'PollutantCAS', 'MethodBasisCode',
       'MethodBasisName', 'TotalQuantity', 'AccidentalQuantity', 'UnitCode',
       'UnitName'],
      dtype='object')

If you are interested in e.g. the countries that occur in your database you can receive a list with the get_Countrylist() function. There are more get_xy() functions to access the information in your data base. For more information take a look at the processdata module description.

ep.get_CountryList(db)

['Austria',
 'Belgium',
 'Cyprus',
 'Czech Republic',
 'Germany',
 'Denmark',
 'Estonia',
 'Spain',
 'Finland',
 'France',
 'Greece',
 'Hungary',
 'Ireland',
 'Italy',
 'Lithuania',
 'Luxembourg',
 'Latvia',
 'Malta',
 'Netherlands',
 'Norway',
 'Poland',
 'Portugal',
 'Sweden',
 'Slovenia',
 'Slovakia',
 'United Kingdom',
 'Iceland',
 'Serbia',
 'Romania',
 'Bulgaria',
 'Switzerland',
 'Croatia']

The actual filtering happens with the function f_db(). You have to specifiy the database that you want to filter and the column names and column values that you want to filter for.

Note

The following lines only create the DataFrame and do not display it. To display the data table, execute e.g. data1.head().

For a better overview, you can use data = ep.row_reduction(db). The new DataFrame is reduced to a list of columns. This list can be adjusted.

Let’s filter for pollution in Germany:

data1 = ep.f_db(db, CountryName='Germany')

If you want to filter for multiple values in one column you have to insert a list.

data2 = ep.f_db(db, CountryName=['Germany', 'Switzerland', 'Austria'])

You can filter for multiple columns at the same time:

CountryName = ['Germany', 'Austria', 'Switzerland']
ReportingYear = [2014, 2015, 2016,2017]
PollutantName = ['Carbon dioxide (CO2)', 'Methane (CH4)']

data3 = ep.f_db(db, CountryName=CountryName, ReportingYear=ReportingYear, PollutantName=PollutantName)

Note

Take into account that numbers are not from type string and therefore do not need quote markers around them.

For the precise values use the get_xy() function or alternativley, you can take a look at the parameter table.
You can also filter step by step. For this you would have to insert the filtered database into the filter function.

You can adjust two more arguments in f_db().
If you want to take a look at the continent Europe, you have to exclude Exclaves that belong to European countries, like French Guiana.

data4 = ep.f_db(db, ExclaveExclude=True)

If you put ReturnUnknown on True the function returns a data table, which contains all entries that would be sorted out in the filter process but just do not possess enough information to pass the filter. If this table is empty, then it is a good sign.

data5 = ep.f_db(db, CountryName='Germany', ReturnUnknown=True)

Now you can generate your own data set of interest with a few lines of code. Since db is a DataFrame object, you can use all pandas functions as well, to personalize your data generation.